# Mean without outliers (Optional Challenge)



📚 As you already know, the mean is defined by:

$$ \bar{x} = \frac{1}{n} \sum_{i=0}^{n} x_i = \frac{x_1 + x_2 + ... + x_{n-1} + x_n}{n}$$

⚠️ However, an outlier can wrongly influence the mean.

💪 The median is a more robust measure of central tendancy.

🤔 But what if we could create a function `mean_without_outliers` to compute - as the name says - the mean without outliers ?




## Preliminary step: defining `outliers`

This question implies a preliminary step: what is an `outlier` ?

For each observation:

* `option 1:` We could consider that an outlier is an observation with a **`z-score`** below -3 or above 3 for example. 
    - But it implies a strong assumption: you are assuming that your distribution is Gaussian.
    - We could also be stricter with the z-score replacing 3-std-limit by 2, or more loose replacing the 3-std-limit by 4 or 5...

* `option 2:` We could use the definition of an outlier in a **`whisker boxplot`** where an outlier is an observation that lives below `Q1 - 1.5 IQR` or above `Q3 + 1.5 IQR`



In [1]:
import numpy as np
import pandas as pd

In [2]:
sample = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]

## Outliers defined by Z-score

### Draft

- For your sample, compute:
    - the mean
    - the standard deviation
    - the z-score of each observation
- Remove the outliers (observation with a z-score higher than your cutoff or lower than -cutoff
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [None]:
# YOUR CODE HERE

## `mean_without_outliers_z_score`

In [8]:
def mean_without_outliers_z_score(elements):
    ''' return the mean of of a list of elements without outliers using the z_score'''
    mean = np.mean(elements)
    std_dev = np.std(elements)
    z_scores = [(x - mean) / std_dev for x in elements]
    cutoff = 3
    filtered_elements = [x for x, z in zip(elements, z_scores) if -cutoff <= z <= cutoff]
    filtered_mean = np.mean(filtered_elements)
    return filtered_mean

mean_without_outliers = mean_without_outliers_z_score(sample)
print("Mean of Sample without Outliers:", mean_without_outliers)

Mean of Sample without Outliers: 5.5


## Outliers defined by the boxplot

### Draft

- For your sample, compute:
    - Q1
    - Q3
    - IQR
    - the lower bound Q1 - 1.5 IQR
    - the upper bound Q3 + 1.5 IQR
- Remove the outliers (observations that are lower than the lower bound or greaterthan the upper bound
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [None]:
# YOUR CODE HERE

### `mean_without_outliers_boxplot`

In [9]:
def mean_without_outliers_boxplot(elements):
    ''' return the mean of elements without outliers using the boxplot definition'''
    Q1 = np.percentile(elements, 25)
    Q3 = np.percentile(elements, 75)
    
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    filtered_elements = [x for x in elements if lower_bound <= x <= upper_bound]
    filtered_mean = np.mean(filtered_elements)
    
    return filtered_mean

mean_without_outliers = mean_without_outliers_boxplot(sample)
print("Mean of Sample without Outliers:", mean_without_outliers)

Mean of Sample without Outliers: 5.5


## Comparisons

*Uncomment the following cell*

In [None]:
# data = {'method': ['mean', 'mean filtering by z-score', 'mean filtering by outliers'], 
#         'result': [np.mean(sample),mean_without_outliers_z_score(sample), mean_without_outliers_boxplot(sample)]}
# comparison_df = pd.DataFrame(data = data)
# round(comparison_df,2)

👏 If you managed to finish the optional, congrats !

💾 Do not forget to `git add/commit/push` your work !