# Mean without outliers (Optional Challenge)



📚 As you already know, the mean is defined by:

$$ \bar{x} = \frac{1}{n} \sum_{i=0}^{n} x_i = \frac{x_1 + x_2 + ... + x_{n-1} + x_n}{n}$$

⚠️ However, an outlier can wrongly influence the mean.

💪 The median is a more robust measure of central tendancy.

🤔 But what if we could create a function `mean_without_outliers` to compute - as the name says - the mean without outliers ?




## Preliminary step: defining `outliers`

This question implies a preliminary step: what is an `outlier` ?

For each observation:

* `option 1:` We could consider that an outlier is an observation with a **`z-score`** below -3 or above 3 for example. 
    - But it implies a strong assumption: you are assuming that your distribution is Gaussian.
    - We could also be stricter with the z-score replacing 3-std-limit by 2, or more loose replacing the 3-std-limit by 4 or 5...

* `option 2:` We could use the definition of an outlier in a **`whisker boxplot`** where an outlier is an observation that lives below `Q1 - 1.5 IQR` or above `Q3 + 1.5 IQR`



In [None]:
import numpy as np
import pandas as pd

In [None]:
sample = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]

## Outliers defined by Z-score

### Draft

- For your sample, compute:
    - the mean
    - the standard deviation
    - the z-score of each observation
- Remove the outliers (observation with a z-score higher than your cutoff or lower than -cutoff
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [None]:
sample_mean = np.mean(sample)
sample_mean

14.090909090909092

In [None]:
sample_std = np.std(sample)
sample_std

27.304526915561905

In [None]:
sample_z_scores = (sample - sample_mean)/sample_std
sample_z_scores

array([-0.47944098, -0.44281701, -0.40619305, -0.36956909, -0.33294512,
       -0.29632116, -0.2596972 , -0.22307323, -0.18644927, -0.14982531,
        3.14633142])

In [None]:
cutoff = 3 # standard deviations

In [None]:
np.abs(sample_z_scores) >= 3

array([False, False, False, False, False, False, False, False, False,
       False,  True])

In [None]:
filtered = [x for i, x in enumerate(sample) if (np.abs(sample_z_scores)[i] >= cutoff) == False]
filtered

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [None]:
np.mean(filtered)

5.5

## `mean_without_outliers_z_score`

In [None]:
def mean_without_outliers_z_score(elements):
    ''' return the mean of of a list of elements without outliers using the z_score'''
    # $CHALLENGIFY_BEGIN
    # mean and std
    mu = np.mean(elements)
    sigma = np.std(elements)

    # z-scores
    z_scores = (elements - mu)/sigma

    # remove z-scores below -3 or above +3
    cutoff = 3
    filtered = [x for i, x in enumerate(sample) if (np.abs(sample_z_scores)[i] >= cutoff) == False]

    return np.mean(filtered)
    # $CHALLENGIFY_END

In [None]:
mean_without_outliers_z_score(sample)

5.5

## Outliers defined by the boxplot

### Draft

- For your sample, compute:
    - Q1
    - Q3
    - IQR
    - the lower bound Q1 - 1.5 IQR
    - the upper bound Q3 + 1.5 IQR
- Remove the outliers (observations that are lower than the lower bound or greaterthan the upper bound
- Compute the mean with the remaining elements

Once you are satisfied with your steps, you can wrap these steps up into a single function in the next section of this notebook.

In [None]:
sample

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100]

In [None]:
pd.Series(sample).describe()

count     11.000000
mean      14.090909
std       28.637229
min        1.000000
25%        3.500000
50%        6.000000
75%        8.500000
max      100.000000
dtype: float64

In [None]:
q1 = pd.Series(sample).describe()["25%"]
q3 = pd.Series(sample).describe()["75%"]

In [None]:
iqr = q3 - q1
iqr

5.0

In [None]:
lower_bound_outliers = q1 - 1.5 * iqr
lower_bound_outliers

-4.0

In [None]:
upper_bound_outliers = q3 + 1.5 * iqr
upper_bound_outliers

16.0

In [None]:
filtered_indices = np.where(np.logical_and(sample>=lower_bound_outliers, sample<=upper_bound_outliers))[0]
filtered_indices

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
filtered_values = np.array(sample)[filtered_indices]
np.mean(filtered_values)

5.5

### `mean_without_outliers_boxplot`

In [None]:
def mean_without_outliers_boxplot(elements):
    ''' return the mean of elements without outliers using the boxplot definition'''
    # $CHALLENGIFY_BEGIN
    # statistics
    q1 = pd.Series(elements).describe()["25%"]
    q3 = pd.Series(elements).describe()["75%"]

    # computing the inter-quartile range
    iqr = q3 - q1
    lower = q1 - 1.5 * iqr
    upper = q3 + 1.5 * iqr

    # filtering
    filtered_indices = np.where(
        np.logical_and(
            elements>=lower,
            elements<=upper
        )
    )[0]

    filtered_values = np.array(elements)[filtered_indices]

    return np.mean(filtered_values)

    # $CHALLENGIFY_END

In [None]:
mean_without_outliers_boxplot(sample)

5.5

## Comparisons

*Uncomment the following cell*

In [None]:
# data = {'method': ['mean', 'mean filtering by z-score', 'mean filtering by outliers'],
#         'result': [np.mean(sample),mean_without_outliers_z_score(sample), mean_without_outliers_boxplot(sample)]}
# comparison_df = pd.DataFrame(data = data)
# round(comparison_df,2)

Unnamed: 0,method,result
0,mean,14.09
1,mean filtering by z-score,5.5
2,mean filtering by outliers,4.5


👏 If you managed to finish the optional, congrats !

💾 Do not forget to `git add/commit/push` your work !