# 1. Population and Sample

First things first. In every statistical analysis you should define both your population and your sample. It is important in descriptive statistics to distinguish them because they require different formulas to calculate variance and standard deviation and in inferential statistics you drawn conclusion about your population from your representative sample.


> **Population:** All items of interest (this may take a while to gather...). It all comes down on how you define your study. The broader it is, the bigger the population will be.
  
> **Sample:** A sample is simply a subset of your population. In many cases, you will never have acess to the entire items of your study, therefore you will never have acess to the population. So, when you define your study, you set a strategy to collect data and that group of data will for your sampling base. Look at it like your DW, that is mandatory to be accessible.

![image.png](attachment:image.png)

> ## 1.1. Sample

There are factors important to determine the size of your sample.

 - **Variability:** The size of your sample does not matter if the caractheristics of your population are homogeneous. Your simple need to be as bigger as heterogeneous is your population.<br>
 - **Precision:** If I want more precision, I need a bigger sample. Precision is as big as small is the range of our interval <br>
 - **Level of confidence:** For more confidence, a bigger sample<br>

The most important thing about a sample is that it is representative of our population. Here are a few types of sampling:

- **Random Sampling:** The probability of a data point being chosen is the same for each one of them

- **Stratified Sampling:** Divide the population in groups with homogeneous caractheristics. I can impose restrictions on the variables in my sample

![image.png](attachment:image.png)


- **Cluster Sampling:** the process is conducted on clusters of the population when they are homogeneous within yet heterogeneous internally. Each cluster is a sampling unit 

![image.png](attachment:image.png)

---

# 2. Types of Data

There can be a lot of confusion regarding the amount of types of data that exist out there. Here, I will show you an image that I took from ChiSquared Innovations website that I hope it can make the picture more clear.

![image.png](attachment:image.png)

So, we have:

> **Categorical Data (or Qualitative)**
- *Nominal:* Brands, Subject, etc.
- *Ordinal:* When an order is subjacent: Very Much | Much | More or Less | Hate
- *Binary:* We can have to types of Binary data <br><br>
    - *Simetric:* When the two options have the same intrinsic value: Male / Female
    - *Asymmetric:* The outcomes are not equally important: Usually we assign 1 to the positive or most important outcome
    <br>
    
> **Numerical Data (or Quantative)**
- *Discrete:* We can count the outcome in a finite way: Number of apples; Number of chairs
- *Continous:* If you increase by decimals, when you approximate, the scale won't change: Weight of a person

> We also have 2 more concepts for **Quantative Data**
- *Interval:* A classic example is temperature. The 0 is non-arbitrary if I measure in Celsius or in Farenheit. There are scales created in a conventional way 
- *Ratio:* This type of data have  true or meaningful zero: Number of objects, time or distance (I can conceptualize that I have 5 times more shoes than another person) 

---

# 3. Measures of Central Tendency

A measure of central tendency tries to describe a set of data by identifying the central position within that set.

> ## 3.1 Mean

The rock star measure of central tendency! The mean is an useful when dealing with continuous data and one of its important properties is that it minimises error in the prediction of any one value in your data set.

In spite of that it has the disadvantage to be sensitive to outliers.

![image.png](attachment:image.png)

In [1]:
numbers = [1, 2, 3, 4, 5]

In [2]:
total = 0

for num in numbers:
    total = total + num

print(total)

15


In [3]:
elements = len(numbers)

In [4]:
mean = total/elements
mean

3.0

*Lets add an outlier*

In [5]:
total_out = total + 100
total_out

115

In [6]:
mean_2 = total_out/(len(numbers)+1)
mean_2

19.166666666666668

> ## 3.2 Mode

The mode is the value with the highest frequency in our data set.

> ## 3.3 Median

The median is the central point in our data set.

![image.png](attachment:image.png)

In [7]:
import numpy as np

a = np.arange(13, 33, step = 2)
a

array([13, 15, 17, 19, 21, 23, 25, 27, 29, 31])

In [8]:
len(a)

10

In [9]:
# Since we have an array with an even number of elements, we just divide the data in half
central_obs = len(a)/2
central_obs

5.0

Since the data set is already sorted in ascending order and in Python we start to count the position of the elements in zero. Therefore we acess the n-1 position, in this case 5 - 1 = 4

In [10]:
median = a[4]
median

21

Both the mode and the median do not use use all the values from the data set to compute their values, therefore they are not so sensible to outliers.

Compare mean to the median is also a good way to evaluate symmetry

---

# 4. Measures of Dispersion

These measures let us see how our data behaves around its Measures of Central Tendency

> ## 4.1. Mean Absolute Deviation

With the MAD we can calculate the average absolute distance of each point in our data set.

![image.png](attachment:image.png)

We calculate the MAD because we are interest in the deviation of the points to the mean and we don't want to consider the values below the mean as negatives, so we use the absolute deviations. 

In [11]:
import pandas as pd  
import numpy as np

In [12]:
# create a list
list = [93, 10, 30, 15, 64, 35]

In [13]:
# create a series
series = pd.Series(list)

In [14]:
mad = series.mad()
mad

24.888888888888886

> ## 4.2. Variance

![image.png](attachment:image.png)

The variance is a measure to calculate how a group of data points deviates from the mean. It is positive by squaring each of the deviations from our data set.

If the variance around the mean is a large number, it means our data is very spreaded.

The numerator from this formula, that sums all the deviation is knowned by **Sum of Squares**. This brings on problem, that it is given more weight to big data points like outliers. Secondly, the variance is not in the same units as the scores in our data set, is measured in the units squared. This means we cannot place it on our frequency distribution and cannot directly relate its value to the values in our data set. The latest can be complemented with the calculation of the standard deviation.


> ## 4.3. Standard Deviation

![image.png](attachment:image.png)

The standard deviation is useful in data that is not significantly skewed or has outliers.

> ## 4.4. Range

![image.png](attachment:image.png)

Lets make an assumption that we have a list_A with 5 values out of 25 and a list_B with 5 values out of 100

In [15]:
list_A = [10, 15, 18, 20, 20]
list_B = [30, 35, 40, 45, 50]

In [16]:
range_A = max(list_A) - min(list_A)
range_A

10

In [17]:
range_B = max(list_B) - min(list_B)
range_B

20

Apparently, list_A is more spread than list_B. However, if we look at the Coefficient our conclusion will change

In [18]:
coef_A = range_A / (max(list_A) + min(list_A))
coef_B = range_B / (max(list_B) + min(list_B))

In [19]:
print('coef_A: ', coef_A)
print('coef_B: ', coef_B)

coef_A:  0.3333333333333333
coef_B:  0.25


The Coefficient of Range is the Range standardized by the sum of the max value with the min value.

Since both values were taken from a sample with different sizes, the ranges cannot be compared directly and that is why we use the Coefficient.

In this case, we concluded that the list_A is more disperse that B in constrast to what we conclude earlier.

> ## 4.5. Quartiles

![image.png](attachment:image.png)

Quartiles are useful to get an idea of how spread is our data. If we add a visualization to our analysis through a boxplot, we can easily identify outliers in our data set. Outliers can be used to find errors in databases; they should be removed from the interpretation of the data, as they tend to influence the mean and standard deviation

- **1ºQ and 3ºQ**: The quartiles are order statistics that divide the distribution into four equal parts. With observations in ascending order, the 1st quartile and the 3rd quartile accumulate respectively 25% and 75% of observations.

- **IQR = Q3 - Q1:** The interquartile range is a robust measure of dispersion , defined by the difference between the 3rd and 1st quartile and covers the central observations, which correspond to 50% of the total observations, which means that neither 25% of the smallest nor 25% of the largest observations are involved in its calculation.


The larger the box, the greater the dispersion. The height of the box represents the IQR

The interquartile range has some of the same drawbacks as the range, i.e. two distributions can have the same value IQR although they have a very uneven dispersion.

Quartiles are useful, but they are also somewhat limited because they do not take into account every score in our group of data. To get a more representative idea of spread we need to take into account the actual values of each score in a data set. 

The absolute deviation, variance and standard deviation are such measures


> ## 4.6. Skewness

The data we test is often normally distributed, since is a common assumption in statistical tests. 

However, our data can sometimes be skewed. The measure of asymmetry in a probability distribution is defined by Skewness. 

![image.png](attachment:image.png)

In a situation of assymetrical data, the median is considered a best measure of central tendency than the mean. The more skewed the distribution, the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean.

A good measurement for the skewness of a distribution is Pearson’s skewness coefficient that provides a quick estimation of a distributions symmetry.

A common way to calculate skewness:

![image.png](attachment:image.png)

![image.png](attachment:image.png)

> ## 4.7. Kurtosis

![image.png](attachment:image.png)

Measure to identify if our data set is heavy (more outliers) or light (fewer outliers) tailed. 

- **Mesokurtic:** The kurtosis is zero, similar to the normal distributions.

- **Leptokurtic:** The tail of the distribution is heavy (outlier present) and kurtosis is higher than that of the normal distribution.

- **Platykurtic_** The tail of the distribution is light and kurtosis is lesser than that of the normal distribution.

### References

https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-median.php

https://statistics.laerd.com/statistical-guides/measures-of-spread-range-quartiles.php

https://mba-lectures.com/statistics/descriptive-statistics/338/mean-absolute-deviation.html

https://www.emathzone.com/tutorials/basic-statistics/range-and-coefficient-of-range.html

https://towardsdatascience.com/skewness-kurtosis-simplified-1338e094fc85