# Mean
### Theory of Mean

The mean, often referred to as the average, is a measure of central tendency in statistics. It represents the sum of all values in a dataset divided by the number of values. The mean provides a single value that summarizes the entire dataset and is widely used in data analysis.

### Formula
For a dataset with values $x_1, x_2, \ldots, x_n$, the mean $\bar{x}$ is calculated as:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
$$

### Properties
- The mean is sensitive to extreme values (outliers).
- It is best used for data that is symmetrically distributed without outliers.
- The mean is used in many statistical analyses and is the basis for other measures such as variance and standard deviation.

### Example
If the dataset is [2, 4, 6, 8, 10], the mean is:

$$
\bar{x} = \frac{2 + 4 + 6 + 8 + 10}{5} = 6
$$

# Mean using manual calculation

In [1]:
Data=[12, 15, 18, 20, 22, 25, 28, 30, 42, 51]
Mean=sum(Data)/len(Data)
print(f"Mean:{Mean}")

Mean:26.3


# Mean using numpy library

In [2]:
import numpy as np
Data=[12, 15, 18, 20, 22, 25, 28, 30, 42, 51]
Mean=np.mean(Data)
print(f"Mean:{Mean}")

Mean:26.3


# Mean using statistics library

In [3]:
import statistics as stats
Data=[12, 15, 18, 20, 22, 25, 28, 30, 42, 51]
Mean=stats.mean(Data)
print(f"Mean:{Mean}")

Mean:26.3


# Applying Mean on real world data

In [5]:
import seaborn as sns

iris=sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [10]:
import seaborn as sns
import numpy as np

iris=sns.load_dataset('iris')
Mean=np.mean(iris["sepal_length"])
print(f"Mean of Sepal Length is {Mean}")

Mean of Sepal Length is 5.843333333333334


# Exercises: Mean (6 questions)

1. Compute the mean of the list `Data = [13, 17, 19, 21, 23, 27, 29, 33, 40, 49]` manually (use sum and length) and show the calculation.

2. Compute the mean of `Data` using numpy and using the statistics module. Verify both results match the manual calculation.

3. Using the `iris` DataFrame, compute the mean of the `sepal_length` column. Round the result to three decimal places.

4. Compute the mean `sepal_length` for each species in the `iris` dataset (setosa, versicolor, virginica). Which species has the largest mean sepal length?

5. Investigate outlier impact: remove the largest value in `Data` and recompute the mean. How much did the mean change? Explain why the mean is sensitive to outliers.

6. For a dataset with clear outliers, which measure of central tendency would you prefer (mean, median, or mode)? Provide a short justification and give an example using `Data` or `iris`.


# 1. Manual Calculation

In [12]:
Data=[13, 17, 19, 21, 23, 27, 29, 33, 40, 49]
Mean=sum(Data)/len(Data)
print(f"Mean:{Mean}")

Mean:27.1


# 2. Verifying with numpy

In [14]:
Data=[13, 17, 19, 21, 23, 27, 29, 33, 40, 49]
Mean=np.mean(Data)
print(f"Mean:{Mean}")

Mean:27.1


# 3. Mean of Real World Data calculation

In [16]:
import seaborn as sns
import numpy as np

Data=sns.load_dataset("iris")
Mean=np.mean(Data["sepal_length"])
print(f"Mean of sepal length is {round(Mean,3)}")

Mean of sepal length is 5.843


# 4. Mean of each species

In [24]:
import seaborn as sns
import numpy as np

Data = sns.load_dataset("iris")
Mean = {
    "setosa": np.mean(Data["species"] == "setosa"),
    "versicolor": np.mean(Data["species"] == "versicolor"),
    "virginica": np.mean(Data["species"] == "virginica")
}
print(f"{max(Mean)} has largest mean sepal length")


virginica has largest mean sepal length


## 5. Investigate Outlier Impact

### Why is the Mean Sensitive to Outliers?

The mean takes into account every value in a dataset. When an outlier (a value that is much larger or smaller than the rest) is present, it can significantly affect the mean:

- **Influence of Extreme Values:** Outliers pull the mean toward themselves, making the mean less representative of the majority of the data.
- **Example:** In the dataset `[13, 17, 19, 21, 23, 27, 29, 33, 40, 49]`, the value `49` is much larger than the other numbers. This single large value increases the mean noticeably compared to if it were not present.
- **Conclusion:** Because the mean is calculated using all values, it is highly sensitive to outliers. In datasets with extreme values, the mean may not accurately reflect the "typical" value.

In [27]:
import numpy as np
Data = [13, 17, 19, 21, 23, 27, 29, 33, 40, 49]
print(f"Mean before removing max value:{np.mean(Data)}")
Data.remove(max(Data))
print(f"Mean after removing max value:{np.mean(Data)}")

Mean before removing max value:27.1
Mean after removing max value:24.666666666666668


# 6. Justification

## Choosing a Measure of Central Tendency with Outliers

When a dataset contains clear outliers, the **median** is generally the preferred measure of central tendency.

### Justification
- **Median** is the middle value when data is ordered and is not affected by extremely large or small values (outliers).
- **Mean** is sensitive to outliers and can be pulled away from the center of the majority of the data.
- **Mode** is useful for categorical data or when the most frequent value is of interest, but may not represent the center for numerical data.

### Example
Consider the dataset: `Data = [13, 17, 19, 21, 23, 27, 29, 33, 40, 49]`
- **Mean:** $\frac{13 + 17 + 19 + 21 + 23 + 27 + 29 + 33 + 40 + 49}{10} = 27.1$
- **Median:** The middle values are 23 and 27, so $\frac{23 + 27}{2} = 25$

Here, the mean (27.1) is higher than the median (25) due to the outlier 49. The median better represents the typical value in the dataset.

**In summary:** For datasets with outliers, the median is a more robust and reliable measure of central tendency.

In [30]:
import numpy as np
Data = [13, 17, 19, 21, 23, 27, 29, 33, 40, 49]
mean_value = np.mean(Data)
median_value = np.median(Data)
print(f"Mean:{mean_value}")
print(f"Median:{median_value}")

Mean:27.1
Median:25.0
