<a href="https://colab.research.google.com/github/davidofitaly/algebra_stats_probability_ds_notes/blob/main/03_descriptive_inferential_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###*Mean and Weighted Mean*

#####The **mean** and **weighted mean** are fundamental concepts in statistics, used to measure the central tendency of a dataset. This includes distinctions between **sample mean** and **population mean**.



##### **Arithmetic Mean:**

#####The **mean** (or **arithmetic mean**) represents the average value of a dataset. It can be calculated for either a **population** or a **sample**.


##### **Population Mean ($\mu$):**

#####The **population mean** is the average of all values in an entire population. It is a fixed value.

##### **Formula:**
$$
\mu = \frac{\sum_{i=1}^N x_i}{N}
$$

##### **Where:**
- $x_i$: The $i$-th value in the population,
- $N$: Total number of values in the population.

##### **Sample Mean ($\bar{x}$):**

#####The **sample mean** is the average of values in a sample, which is a subset of the population. It is used to estimate the population mean.

##### **Formula:**
$$
\bar{x} = \frac{\sum_{i=1}^n x_i}{n}
$$

##### **Where:**
- $x_i$: The $i$-th value in the sample,
- $n$: Total number of values in the sample.

##### **Example (Population and Sample):**

Given the population $[3, 5, 7, 9]$:
$$
\mu = \frac{3 + 5 + 7 + 9}{4} = \frac{24}{4} = 6
$$

For a sample $[3, 9]$ taken from the population:
$$
\bar{x} = \frac{3 + 9}{2} = \frac{12}{2} = 6
$$


##### **Weighted Mean:**

The **weighted mean** accounts for the importance (or weight) of each value, whether in a population or sample.

##### **Formula:**
$$
\text{Weighted Mean (} \mu_w \text{ or } \bar{x}_w \text{)} = \frac{\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}
$$

##### **Where:**
- $x_i$: The $i$-th value,
- $w_i$: The weight associated with $x_i$,
- $\sum_{i=1}^n w_i$: The total of all weights.

##### **Example (Weighted Mean):**

Given values $[10, 20, 30]$ with weights $[1, 2, 3]$:
$$
\mu_w = \frac{1 \cdot 10 + 2 \cdot 20 + 3 \cdot 30}{1 + 2 + 3}
= \frac{10 + 40 + 90}{6}
= \frac{140}{6} = 23.33
$$


##### **Key Distinctions:**

1. **Population Mean ($\mu$):**
   - Considers all values in the population,
   - Fixed and unchanging.

2. **Sample Mean ($\bar{x}$):**
   - Uses only a subset of the population,
   - May vary between samples and is used to estimate $\mu$.

3. **Weighted Mean ($\mu_w$ or $\bar{x}_w$):**
   - Adjusts for the relative importance of values,
   - Useful in scenarios where some data points carry more significance.


### *Median*

#####The **median** is a measure of central tendency that represents the middle value in a dataset when the values are arranged in ascending or descending order. Unlike the mean, the median is not influenced by extreme values or outliers.



##### **Key Characteristics:**

1. **Order Matters:**  
   To determine the median, the data must first be arranged in order (ascending or descending).

2. **Robustness:**  
   The median is resistant to extreme values, making it ideal for skewed datasets.

3. **Applicability:**  
   The median can be calculated for both populations and samples.



##### **Calculating the Median:**

1. **For Odd Number of Observations ($n$):**  
   The median is the value at the middle position:
   $$
   \text{Median} = x_{\frac{n+1}{2}}
   $$

2. **For Even Number of Observations ($n$):**  
   The median is the average of the two middle values:
   $$
   \text{Median} = \frac{x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}}{2}
   $$


##### **Example:**

**Dataset 1 (Odd Number of Values):**  
$[3, 5, 7, 9, 11]$  
- Arrange in ascending order: $[3, 5, 7, 9, 11]$  
- Middle position: $7$ (3rd value)  
$$
\text{Median} = 7
$$

**Dataset 2 (Even Number of Values):**  
$[2, 4, 6, 8, 10, 12]$  
- Arrange in ascending order: $[2, 4, 6, 8, 10, 12]$  
- Middle values: $6$ and $8$  
$$
\text{Median} = \frac{6 + 8}{2} = 7
$$



##### **Median vs. Mean:**

1. **Sensitivity to Outliers:**  
   - The **mean** is affected by extreme values, while the **median** is not.

2. **Use Cases:**  
   - The **median** is preferred for skewed datasets (e.g., income, house prices).
   - The **mean** is used for symmetrical datasets with no outliers.


###*Mode*


#####The **mode** (dominanta) is a measure of central tendency that identifies the most frequently occurring value(s) in a dataset. It is especially useful for categorical or discrete data and can be applied to both populations and samples.



##### **Key Characteristics:**

1. **Frequency-Based:**  
   The mode is the value(s) that appear most often in the dataset.

2. **Applicability:**  
   - For numerical data: Finds the most common value.  
   - For categorical data: Identifies the most frequent category.

3. **Types of Datasets:**  
   - A dataset can have:
     - **No mode:** If all values occur with the same frequency.  
     - **One mode (Unimodal):** If a single value dominates.  
     - **Two modes (Bimodal):** If two values have the highest frequency.  
     - **Multiple modes (Multimodal):** If more than two values share the highest frequency.



##### **Calculating the Mode:**

1. Arrange the data (optional, but helpful for clarity).  
2. Count the frequency of each value.  
3. Identify the value(s) with the highest frequency.



##### **Example 1 (Numerical Data):**  
Dataset: $[2, 4, 4, 6, 6, 6, 8, 10]$  
- Frequency:  
  $2 \rightarrow 1$, $4 \rightarrow 2$, $6 \rightarrow 3$, $8 \rightarrow 1$, $10 \rightarrow 1$  
- **Mode:** $6$ (appears 3 times)



##### **Example 2 (Categorical Data):**  
Dataset: $[\text{Red}, \text{Blue}, \text{Red}, \text{Green}, \text{Blue}, \text{Blue}]$  
- Frequency:  
  $\text{Red} \rightarrow 2$, $\text{Blue} \rightarrow 3$, $\text{Green} \rightarrow 1$  
- **Mode:** $\text{Blue}$ (appears 3 times)



##### **Mode vs. Mean and Median:**

1. **Dataset Type:**  
   - The **mode** is ideal for categorical data.  
   - The **mean** and **median** are better for numerical data.

2. **Skewed Data:**  
   - The **mode** provides a quick insight into the most typical value, especially in skewed distributions.

3. **Multiple Values:**  
   - The mode can represent multiple values, unlike the mean or median, which are unique.

