![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fcallysto%2Fcurriculum-notebooks&branch=master&subPath=Mathematics/CentralTendencyAndOutlier/outliers-and-central-tendency.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a>

# Effect of Outliers on Central Tendency

This notebook focuses on what an outlier is and how it affects central tendency. Remember central tendency means the mean, median, and mode of some data. If you need review on central tendency, check out this previous notebook.

Things you will learn in this notebook:
* What an outlier is
* How an outlier affects mean
* How an outlier affects median
* How an outlier affects mode
* Why outliers can be a problem

<img style="float: right;" src="images/PunnyOutlier.jpg" width="400" height="300">

## What is an outlier?

An outlier is a value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.

Let's look at an example: <br>
Here is a data set: $26, 23, 27, 25, 28, 29, 24, 99 $ <br>
Lets put it in order: $23, 24, 25, 26, 27, 28, 29, 99$ <br> 
$99$ is larger than all the other data, by a lot. <br> 
Therefore, we can call $99$ an outlier.

An outlier is also data that is out of place, or that might be a mistake when it was collected.

## Central tendency

First let's look at the central tendency of the example above. 

Mean = $\frac{23+24+25+26+27+28+29+99}{8} = \frac{281}{8} = 35.125$. <br>
The median is found between 26 and 27, so the median is 26.5. <br>
There is no mode because none are repeated more than once.

Then we will remove the outlier (99) and recalculate the central tendency.

Mean = $\frac{23+24+25+26+27+28+29}{7} = \frac{182}{7} = 26$. <br>
The median is the middle number so the median is 26. <br>
There is no mode because none are repeated more than once.

What changes do you notice? What changed the most?

## A bigger set of data

Let's look at some data about the sodium content (amount of salt) in different common foods. Is there an outlier?

In [None]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CentralTendencyAndOutlier/data/example-data.csv')
df

Now let's calculate the mean, median, and mode.

In [None]:
print('mean:', df['Sodium Content'].mean())
print('median:', df['Sodium Content'].median())
print('mode:', df['Sodium Content'].mode()[0])  # the [0] means we want just the first value, although there is only one

We can visualize this dataset as a histogram, which allows us to group together foods that have a similar sodium content into "bins". You can control how many bins are on the graph by using the slider below. Look how the graph changes.

Can you tell if there's an outlier when there's only a couple bins? How about when there are a lot of bins?

In [None]:
import plotly.express as px
px.histogram(df, x='Sodium Content', title='Histogram of Sodium Content in 30 Products from Australian Supermarkets', nbins=20)

In [None]:
import plotly.graph_objects as go
fig = go.Figure()
steps = []
for i in range(15):
    fig.add_trace(go.Histogram(x=df['Sodium Content'], nbinsx=i+1, visible=False))
    visible_list = [False] * 15
    visible_list[i] = True
    steps.append(dict(label=i+1, method='restyle', args=[{'visible':visible_list}]))
fig.data[0]['visible'] = True
sliders = [dict(active=0, currentvalue={'prefix': 'Number of Bins: '}, steps=steps)]
fig.update_layout(sliders=sliders, title='Interactive Histogram of Sodium Content in 30 Products from Australian Supermarkets')
fig.show()

When there are a lot of bins, it's really easy to see that there's an outlier. If we look at the data, we know that's soy sauce. So let's remove it and see how the central tendency is affected.

In [None]:
df2 = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Mathematics/CentralTendencyAndOutlier/data/example-data-no-outlier.csv')
fig2 = go.Figure()
steps2 = []
for i in range(15):
    fig2.add_trace(go.Histogram(x=df2['Sodium Content'], nbinsx=i+1, visible=False))
    visible_list = [False] * 15
    visible_list[i] = True
    steps2.append(dict(label=i+1, method='restyle', args=[{'visible':visible_list}]))
fig2.data[0]['visible'] = True
sliders2 = [dict(active=0, currentvalue={'prefix': 'Number of Bins: '}, steps=steps)]
fig2.update_layout(sliders=sliders2, title='Interactive Histogram of Sodium Content in 30 Products from Australian Supermarkets (without outlier)')
fig2.show()

Should soy sauce be excluded though? It is a common food that many people own, so it's not out of place in the data. It's just very salty.

To be called an outlier, the data points should be out of place in your data set, not just a large or small value.

-------

#### Now try adding your own data points into this set. 

Try adding really big and/or really small numbers. You can also add many repeating numbers to change the mode. <br>
**Restart**: press to remove all your outliers to start again.<br>
**Add Outlier**: press to add a new outlier. You can add as many as you want. (value has to be less than 10,000) <br>
**Compute**: press to calculate the central tendency and see a histogram of the dataset with your outlier(s).<br>
**Compare**: press to see the central tendency before and after your outliers are added. How do your outliers change the central tendency?

In [None]:
global datasetWithOutlier
global centralTendencyWithOutlier
datasetWithOutlier = dataWithoutOutlier.copy()

add = widgets.Button(description='Add Outlier',disabled=False,button_style='')
remove = widgets.Button(description='Restart',disabled=False,button_style='')
outlier = widgets.BoundedIntText(value=None,max=10000,description='Outlier:',disabled=False)
compute = widgets.Button(description='Calculate',disabled=False,button_style='')
compare = widgets.Button(description='Compare',disabled=False,button_style='')

def displayButtons():
    IPython.display.clear_output()
    IPython.display.display(remove)
    IPython.display.display(add)
    IPython.display.display(compute)
    IPython.display.display(compare)

def addOutlier(a):
    global datasetWithOutlier
    IPython.display.clear_output()
    IPython.display.display(remove)
    IPython.display.display(add)
    IPython.display.display(outlier)
    IPython.display.display(compute)
    IPython.display.display(compare)
    datasetWithOutlier.append(outlier.value)
    print("Current Dataset:")
    print(datasetWithOutlier)

def showOutlier(a):
    IPython.display.clear_output()
    IPython.display.display(remove)
    IPython.display.display(add)
    IPython.display.display(outlier)
    IPython.display.display(compute)
    IPython.display.display(compare)
    outlier.observe(addOutlier, 'value')
    print("Current Dataset:")
    print(datasetWithOutlier)
    
def reset(a):
    displayButtons()
    datasetWithOutlier = dataWithoutOutlier.copy()
    print("Current Dataset:")
    print(datasetWithOutlier)
    
def callPlottingFunctionWithOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(datasetWithOutlier, num_bins, 'Sodium Content', 'values', 'Histogram with Outliers')

def calculate(a):
    global centralTendencyWithOutlier
    centralTendencyWithOutlier = computeCenTendency(datasetWithOutlier)
    print(centralTendencyWithOutlier)
    interact(callPlottingFunctionWithOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));
    
def compareTendencies(a):
    centralTendencyWithoutOutlierArray = np.around(np.asarray(centralTendencyWithoutOutlier), 3)
    centralTendencyWithOutlierArray = np.around(np.asarray(centralTendencyWithOutlier), 3) 
    arr = { 'Central Tendency ':  np.array(['Mean','Median','Mode' ]),
            'Before adding outlier ':  np.array([centralTendencyWithoutOutlierArray[0],  centralTendencyWithoutOutlierArray[1], 
                                            centralTendencyWithoutOutlierArray[2] ]),
            'After adding outlier ': np.array([centralTendencyWithOutlierArray[0], centralTendencyWithOutlierArray[1],
                                           centralTendencyWithOutlierArray[2]])}
    print(Table(arr))
    

displayButtons()
print("Current Dataset:")
print(datasetWithOutlier)
compare.on_click(compareTendencies)
compute.on_click(calculate)
add.on_click(showOutlier)
remove.on_click(reset)


### Effects of Outliers 

##### Mean
From the above results and histogram, we see that adding outliers can change the mean dramatically.<br>
This is because we need to add all the values to determine the mean value.<br>
Outliers are often values that are much larger or smaller than the other values.<br>
When we add these values to the sum, the average can change a lot.

##### Median
For the median, we need to consider the middle number(s).<br>
Adding one outlier adds a single point at the far end of our data set, so everything else shifts over only one spot. <br>
The median might change or might not.

##### Mode
The mode is very unlikely to change, because mode is the most repeated value.<br>
Unless you add many outliers, all with the same value, the mode probably won't change.<br>
And if there are multiple outliers with the same value, then maybe they should be kept in the dataset, they might be important.


### What should we use to represent data?

The reason we use central tendency is to tell important information about the data set we want to talk about. We want to know what this data is like without saying every value in the set. Usually mean is used to represent data as it is affected by all points, but that's not always a good thing. If there's an outlier in the data, maybe mean is not the best way to represent the middle of the data in general. Median might be a more truthful representation of the middle in that case. 

In the food data, we might want to represent how much sodium is in food we eat nearly every day. If we don't eat soy sauce often, then it would be an outlier. Then the mean will tell us we eat a lot more sodium every day than the median would tell us. If we use the mean to represent this data, someone could argue that we eat too much sodium. But its not true since we don't eat it often. So if we use the median to represent this data, we can still include soy sauce in the data, but it would represent what we eat more truthfully.

## Conclusion

In this notebook, we learned how an outlier affects central tendency. 

When an outlier is added to (or removed from) a data set:
* Mean changes the most 
* Median changes a little bit
* Mode doesn't change unless there are multiple outliers with the same value

Also, not all larger or smaller values should be called outliers and excluded from data. That depends more on context if some is given. 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)