# Measuere of Central Tendency
> **Central Tendency** refers to the measure used to determine the "center" of the distribution of a data
* Types of Central Tendency
1. Mean
2. Median
3. Mode

**Suppose** We have data {1, 2, 3, 4, 5, 6, 7}.

1. **Mean(Average)**
    * We use two types of data
        * Population data(N). --> `μ` this symbole is used for population distribution.
            * Formula is μ = ( ΣXi ) / N
            * Where:  
                - μ = Population mean  
                - Xi = Each individual value in the population  
                - N = Total number of values in the population
        
        * Sample Data (n). --> `x̄` this symbole is used for Sample Data.
            * Formula is x̄ = ( ΣXi ) / n
            * Where:  
                - x̄ = Sample mean  
                - Xi = Each individual value in the sample  
                - n = Total number of values in the sample 
                
2. **Median()**
    `Suppose ` x = [1,2,2,3,4,5,100]
    μ = 16.71 --> Because of adding 100 in the data set we got huge difference on the means and this large number (100) is also called outlier.
    * Using Mode we can prevent from the outlier.
    * There are three steps in median.
        1. Sort all the numbers.
        2. Find the central element.
            * Two find the we have to choice.
                > The length of data set is 7 which is odd we can easily find central element which is `3` suppose we have one more 101 than become even.
                * Odd Length
                * Even Length
                    * To find central element we have to number `[3, 4]`.
                    * ( 3 + 4 ) / 2 = `3.5`.

3. **Mode()**
    * In mode we use most frequent number.
    * Mode is most commonly used in categorical features.
        * Example Age = [23,45,21,2,-,-], Weight = [-, -, 78, 90, 82] Gender = [M,F] 
        * If we want to fill the missing value we have to calculate the mean value and fill it.
        * If we have outlier in the data so we have to use median.
        * In the case of Gender we can use Mode.

**Use Cases**

In [24]:
import pandas as pd

data = [1,2,2,3,4,5]

means = pd.Series(data).mean()

print(f"Mean of (1 + 2 + 3 + 4 + 5 / len(data)) using Population Data: ", means)

data.append(100)

means = pd.Series(data).mean()

print(f"Mean of (1 + 2 + 3 + 4 + 5 + 100 / len(data)) using Population Data: ", means) # Because of adding 100 in the data set we see how it does not perform well.

median = pd.Series(data).median()
print(f"Median is: {median}")

data = [1,2,2,3,3,3,4,5,5,5,5]

mode = pd.Series(data).mode()
print(f"Mode: {mode[0]}")

Mean of (1 + 2 + 3 + 4 + 5 / len(data)) using Population Data:  2.8333333333333335
Mean of (1 + 2 + 3 + 4 + 5 + 100 / len(data)) using Population Data:  16.714285714285715
Median is: 3.0
Mode: 5


## Measure of Dispersion (Variance and Standard Deviation)
> ***Measure of dispersion*** talks about spread. Suppose we have a dataset where the center of the graph is determined by central tendency (mean, median, mode). Spread (dispersion) is determined by two key metrics: **Variance** and **Standard Deviation**.

### Differentiation by Population and Sample

#### 1. Population (\(N\))
##### Variance: 
    Variance decide the spread of the graph if variance is a big number our spread will be big and if small than vice versa.
**Formula:**  
$$
\sigma^2 = \frac{\sum (X_i - \mu)^2}{N}
$$

Where:
- \( \sigma^2 \) = Population variance  
- \( X_i \) = Each individual data point  
- \( \mu \) = Population mean  
- \( N \) = Total number of data points in the population  

##### Standard Deviation:
**Formula:**  
$$
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum (X_i - \mu)^2}{N}}
$$

Where:
- \( \sigma \) = Population standard deviation  

---

#### 2. Sample (\(n\))
##### Variance:
**Formula:**  
$$
s^2 = \frac{\sum (X_i - \bar{X})^2}{n - 1}
$$

Where:
- \( s^2 \) = Sample variance  
- \( X_i \) = Each individual data point  
- \( \bar{X} \) = Sample mean  
- \( n \) = Total number of data points in the sample  

##### Standard Deviation:
**Formula:**  
$$
s = \sqrt{s^2} = \sqrt{\frac{\sum (X_i - \bar{X})^2}{n - 1}}
$$

Where:
- \( s \) = Sample standard deviation  

---

### Key Differences:
1. **Population variance and standard deviation** use \(N\) (total number of data points).
2. **Sample variance and standard deviation** use \(n - 1\) (Bessel’s correction) to avoid underestimating the population variance.


* `Example`:
    Given data: 
    \[
    \text{Data} = [1, 2, 2, 3, 4, 5]
    \]
    Mean (\(\bar{x}\)) calculation:  
    \[
    \bar{x} = \frac{1 + 2 + 2 + 3 + 4 + 5}{6} = 2.83
    \]

    | \( x \) | \( \bar{x} \) | \( x - \bar{x} \) | \( (x - \bar{x})^2 \) |
    |---------|-------------|----------------|----------------|
    | 1       | 2.83        | -1.83          | 3.36           |
    | 2       | 2.83        | -0.83          | 0.69           |
    | 2       | 2.83        | -0.83          | 0.69           |
    | 3       | 2.83        | 0.17           | 0.03           |
    | 4       | 2.83        | 1.17           | 1.36           |
    | 5       | 2.83        | 2.17           | 4.71           |
    | **Σ**   | **—**       | **0**          | **10.84**      |

    ### Population Variance Formula:
    \[
    \sigma^2 = \frac{\sum (x - \bar{x})^2}{N}
    \]

    Substituting values:
    \[
    \sigma^2 = \frac{10.84}{6} = 1.81
    \]

    ### Sample Variance Formula:
    \[
    s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}
    \]

    Substituting values:
    \[
    s^2 = \frac{10.84}{5} = 2.17
    \]

    Thus,  
    - **Population Variance** (\(\sigma^2\)) = **1.81**  
    - **Sample Variance** (\(s^2\)) = **2.17** 
    - **Standerd Devation** $$ \sqrt\text{variance} \\ \sqrt 2.17 = 1.472 $$
##### Variance:
**Formula:**  
$$
s^2 = \frac{\sum (X_i - \bar{X})^2}{n - 1}
$$

Where:
- \( s^2 \) = Sample variance  
- \( X_i \) = Each individual data point  
- \( \bar{X} \) = Sample mean  
- \( n \) = Total number of data points in the sample  

##### Standard Deviation:
**Formula:**  
$$
s = \sqrt{s^2} = \sqrt{\frac{\sum (X_i - \bar{X})^2}{n - 1}}
$$

Where:
- \( s \) = Sample standard deviation  

---

### Key Differences:
1. **Population variance and standard deviation** use \(N\) (total number of data points).
2. **Sample variance and standard deviation** use \(n - 1\) (Bessel’s correction) to avoid underestimating the population variance.

---

### Example Calculation:
#### Given data:  
\[
\text{Data} = [1, 2, 2, 3, 4, 5]
\]

#### Mean (\(\bar{x}\)) calculation:  
\[
\bar{x} = \frac{1 + 2 + 2 + 3 + 4 + 5}{6} = 2.83
\]

| \( x \) | \( \bar{x} \) | \( x - \bar{x} \) | \( (x - \bar{x})^2 \) |
|---------|-------------|----------------|----------------|
| 1       | 2.83        | -1.83          | 3.36           |
| 2       | 2.83        | -0.83          | 0.69           |
| 2       | 2.83        | -0.83          | 0.69           |
| 3       | 2.83        | 0.17           | 0.03           |
| 4       | 2.83        | 1.17           | 1.36           |
| 5       | 2.83        | 2.17           | 4.71           |
| **Σ**   | **—**       | **0**          | **10.84**      |

---

### Population Variance:
**Formula:**
\[
\sigma^2 = \frac{\sum (x - \bar{x})^2}{N}
\]

Substituting values:
\[
\sigma^2 = \frac{10.84}{6} = 1.81
\]

---

### Sample Variance:
**Formula:**
\[
s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}
\]

Substituting values:
\[
s^2 = \frac{10.84}{5} = 2.17
\]

---

### Final Results:
- **Population Variance** (\(\sigma^2\)) = **1.81**  
- **Sample Variance** (\(s^2\)) = **2.17**  
- **Standard Deviation**:  
  \[
  s = \sqrt{\text{variance}} = \sqrt{2.17} = 1.472
  \]


<img src="../img/varience_sd.png" height="480px" width="720px" alt="Photo">


# *Percentiles* and *Quartiles*

* ***Percentiles***
    * *Percentiles* is a value below which a certain percentage of observation lie.
        * 95 Percentiles means that the person has got better marks than 95% of entire students.
    * `Example`:
        * Data = {1, 2, 3, 4, 5}
        * What is the percentage of the even number in the Data
         $$\frac {2}{5} = 0.4 = 40\text{ Percent} $$
        
        `Example using percentiles`

        dataset = {2, 2, 3, 4, 5, 5, 5, 6, 7, 8, 8, 8, 8, 8, 9, 9, 10, 11, 11, 12} --> What is the percentile ranking of 10. n = 20.

        $$\frac{16}{20} * 100 = 80 \text{ percentile}$$

        * For 11.

        $$\frac{17}{20} * 100 = 85 \text{ percentile}$$

        * For 25 Percentile

        $$\frac{25}{100} * {(20 + 1)} = 5.25 \text{ Index} \\ \text{ Check 5.25 in the dataset but we does not have any value called 5.25 that we take average of } \frac{5 + 5}{2} = 5$$
    
    * `Formula`
        $$\frac{\text{Number of value below 10}}{n} * 100 \\ \text{Vice versa(Means if we have 25 percentile and check which one is come)} \\ \frac{\text{Percentile}}{100} * {n + 1}$$

* ***Quartiles***
    * ***Quartiles(1/4 25% --> 1st Quartiles, 75% --> 3rd Quartiles)*** It is used to find the outliers in the dataset we use `boxplot` to do this.
    * Inter_Quartiles_Range = Q3 - Q1