![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=Mathematics/CentralTendencyAndOutlier/outliers-and-central-tendency.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Effect of Outliers on Central Tendency

This notebook is about outliers and how they affect central tendency. Remember that central tendency means the mean, median, and mode of a data set. Check out [this other notebook](../CentralTendency/central-tendency.ipynb) for more information about central tendency.

<img style="float: right;" src="images/PunnyOutlier.jpg" width="400" height="300">

## What is an Outlier?

An outlier is a value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.

Outliers may also result from errors in data collection or organization.

Let's look at an example: <br>
Here is a data set: $26, 23, 27, 25, 28, 29, 24, 99$ <br>
Lets put it in order: $23, 24, 25, 26, 27, 28, 29, 99$ <br> 
The value $99$ is much larger than all of the other points, so we can call $99$ an outlier.

## Central Tendency

Let's calculate the central tendency for this example:

In [None]:
dataset = [26, 23, 27, 25, 28, 29, 24, 99]
from statistics import mean, median, multimode
print('mean:', mean(dataset))
print('median:', median(dataset))
print('mode:', multimode(dataset))

There is no mode because no value is repeated.

Then we will remove the outlier ($99$) and recalculate:

In [None]:
dataset = [26, 23, 27, 25, 28, 29, 24]
print('mean:', mean(dataset))
print('median:', median(dataset))
print('mode:', multimode(dataset))

What changes do you notice? What changed the most?

## A Bigger Set of Data

Let's look at some data about the sodium (salt) content in different common foods to see if we find any outliers?

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CentralTendencyAndOutlier/data/example-data.csv')
df

Now let's calculate the mean, median, and mode:

In [None]:
print('mean:', df['Sodium Content'].mean())
print('median:', df['Sodium Content'].median())
print('mode:', df['Sodium Content'].mode()[0])  # the [0] means we want just the first value, although there is only one

We can visualize this dataset as a histogram, which allows us to group together foods that have a similar sodium content into "bins". You can control how many bins are on the graph by using the slider below.

Can you tell if there's an outlier when there's only a couple bins? How about when there are a lot of bins?

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
steps = []
for i in range(15):
    fig.add_trace(go.Histogram(x=df['Sodium Content'], nbinsx=i+1, visible=False))
    visible_list = [False] * 15
    visible_list[i] = True
    steps.append(dict(label=i+1, method='restyle', args=[{'visible':visible_list}]))
fig.data[0]['visible'] = True
sliders = [dict(active=0, currentvalue={'prefix': 'Number of Bins: '}, steps=steps)]
fig.update_layout(sliders=sliders, title='Interactive Histogram of Sodium Content in 30 Products from Australian Supermarkets')
fig.show()

When there are a lot of bins, it's really easy to see that there's an outlier. If we look at the data, we know that's soy sauce. So let's remove it and see how the central tendency is affected.

In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CentralTendencyAndOutlier/data/example-data-no-outlier.csv')
fig2 = go.Figure()
steps2 = []
for i in range(15):
    fig2.add_trace(go.Histogram(x=df2['Sodium Content'], nbinsx=i+1, visible=False))
    visible_list = [False] * 15
    visible_list[i] = True
    steps2.append(dict(label=i+1, method='restyle', args=[{'visible':visible_list}]))
fig2.data[0]['visible'] = True
sliders2 = [dict(active=0, currentvalue={'prefix': 'Number of Bins: '}, steps=steps)]
fig2.update_layout(sliders=sliders2, title='Interactive Histogram of Sodium Content in 30 Products from Australian Supermarkets (without outlier)')
fig2.show()

In [None]:
print('original mean:', df['Sodium Content'].mean())
print('new mean:', df2['Sodium Content'].mean())
print('original median:', df['Sodium Content'].median())
print('new median:', df2['Sodium Content'].median())
print('original mode:', df['Sodium Content'].mode()[0])
print('new mode:', df2['Sodium Content'].mode()[0])

Excluding soy sauce has a large effect on the mean, because it's very salty.

Should soy sauce be excluded though? It is a common food that many people eat, so it's not out of place in the data. To be called an outlier, a data point should be out of place in your data set, not just a large or small value.

Let's look at a scatterplot of another data set with obvious outliers, and then with the outliers removed:

In [None]:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
y = [2.5, 4.8, 25, 9.6, 12, 14.4, 22.8, 19, 21.7]

import plotly.express as px
px.scatter(x=x, y=y, title='Scatterplot with Trendline and Outliers', trendline='ols').show()
for i in [6, 2]:  # remove the 7th and 3rd values, starting from the higher index to avoid reindexing errors
    x.pop(i)
    y.pop(i)
px.scatter(x=x, y=y, title='Scatterplot with Trenline without Outliers', trendline='ols').show()

We can see that removing those outliers had a large effect on the position of the [trendline](https://simple.wikipedia.org/wiki/Linear_regression).

We use measures of central tendency to communicate important information about a data set, without describing every value in the set. Usually mean (or a trendline) is a good representation, since it is affected by all the points, but that's not always a good thing. If there are outliers in the data, median might be a more accurate representation of the middle of the data set. 

For the food data, if we don't eat soy sauce often, then it would be an outlier and the mean value would be less accurate than the median. If we use the median, we can include data about soy sauce and it would still be an accurate representation of the salt we eat.

## Conclusion

In this notebook, we learned how an outlier affects central tendency. 

When an outlier is added to (or removed from) a data set:
* mean changes the most 
* median changes a little bit
* mode doesn't change unless there are multiple outliers with the same value

Also, not all larger or smaller values should be called outliers and excluded from data. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)