![alt text][top-banner]

[top-banner]: ./callysto-top-banner.jpg

In [37]:
%%html

<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [38]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets, Button, Layout
from scipy import stats
from collections import Counter
from array import array
from statistics import mode
import IPython
import pandas
import sys
from astropy.table import Table
import tkinter
import csv
import pandas as pd
from pandas import DataFrame

# Effect of Outliers on Central Tendency

This notebook focuses on what an outlier is and how it affects central tendency. Remember central tendency means the mean, median, and mode of some data. If you need review on central tendency, check out this previous notebook.

Things you will learn in this notebook:
* What an outlier is
* How an outlier affects mean
* How an outlier affects median
* How an outlier affects mode
* Why outliers can be a problem

<img style="float: right;" src="CC66_materials/PunnyOutlier.jpg" width="400" height="300">

## What is an outlier?

An outlier is a value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.

Let's look at an example: <br>
Here is a data set: $26, 23, 27, 25, 28, 29, 24, 99 $ <br>
Lets put it in order: $23, 24, 25, 26, 27, 28, 29, 99$ <br> 
$99$ is larger than all the other data, by a lot. <br> 
Therefore, we can call $99$ an outlier.

An outlier is also data that is out of place, or that might be a mistake when it was collected.

## Central tendency

First let's look at the central tendency of the example above. 

Mean = $\frac{23+24+25+26+27+28+29+99}{8} = \frac{281}{8} = 35.125$. <br>
The median is found between 26 and 27, so the median is 26.5. <br>
There is no mode because none are repeated more than once.

Then we will remove the outlier (99) and recalculate the central tendency.

Mean = $\frac{23+24+25+26+27+28+29}{7} = \frac{182}{7} = 26$. <br>
The median is the middle number so the median is 26. <br>
There is no mode because none are repeated more than once.

What changes do you notice? What changed the most?

### Try it yourself

Here are the ratings of a new restaurant out of 10:

$4.5, 7, 15, 3.5, 6, 5, 9, 10, 1$

In [39]:
answer622 = widgets.RadioButtons(options=['Yes', 'No'], value = None)
answernext = widgets.RadioButtons(options=['1', '10', '15', '-1'], value = None)

question622 = 'Is there an outlier?'
questionnext = 'Which value is the outlier?'

def checknext(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    display(questionnext, answernext)
    if answernext.value == '15':
        print("You're right! 15 is an outlier because the ratings are only supposed to be out of 10.")
    else:
        print("Not quite, try again.")


def check622(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    if answer622.value == 'Yes':
        print("Correct!")
    else:
        print("Actually there is an outlier.")
    display(questionnext, answernext)
    answernext.observe(checknext, 'value')
        
IPython.display.clear_output(wait=False)
display(question622,answer622 )
answer622.observe(check622, 'value')

Is there an outlier?


RadioButtons(options=('Yes', 'No'), value=None)

In [40]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question622">
 <button id = "622"
onclick="toggle('answer622');">Solution</button> 
</div>
<div style="display:none" id="answer622">
An outlier is a value that does not seem to fit with the rest of the data. <br/>
In this case, we are looking at ratings that should be between 0 and 10. <br/>
All the values in this data set are within that range except for 15. <br/>
So 15 must be the outlier. <br/>

Also -1 would be an outlier if it was in the data set, but because it's not then it's not the right answer. <br/>

</div>

</body>
</html>

## A bigger set of data

Let's look at some data about the sodium content (amount of salt) in different common foods. Is there an outlier?

<img style="float: center;" src="CC66_materials/walking-food.jpg" width="500" height="200">

In [41]:
df = pd.read_csv('CC66_materials/exampleDataSource.csv')
print(df)

                             Product   Sodium content
0             Tomato sauce (Brand A)              871
1             Tomato sauce (Brand B)             1250
2                          Soy sauce             6458
3                    Tinned tomatoes              270
4                      Four bean mix              250
5                       Corn kernels              205
6                 Mushrooms (tinned)              340
7                       Tomato paste              843
8                 Chunky pasta sauce              482
9                      Chicken stock              521
10                        Beef stock              450
11                           Peanuts              360
12               Beef strogonoff mix              780
13            Dry roasted macadamias              340
14                           Cashews              510
15                            Butter              460
16                  Sunflower spread              380
17                       Bak

In [42]:
data = []
with open('CC66_materials/exampleData.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(int(row[1]))

Now let's calculate the mean, median, and mode using the computer.

In [43]:
def computeCenTendency(dataset):
    
    #mean value
    mean= np.mean(dataset)
    print("Mean: ", round(mean,3))

    #median value
    # First we need to sort the data in ascending order
    dataset.sort()
    median = np.median(dataset)
    print("Median: ", round(median,3))
    
    # Mode
    hits = []
    for item in dataset:
        tally = dataset.count(item)
        #Makes a tuple that is the number of huts paired with the relevant number
        values = (tally, item)
        # Only add one entry for each number in the set
        if values not in hits:
            hits.append(values)
    hits.sort(reverse=True)
    if hits[0][0]>hits[1][0]:
        print("Mode:", round(hits[0][1],3), "(appeared", hits[0][0], "times.)")
    else:
        print("There is no mode")

    Counter(dataset)
    return mean, median, hits[0][1]

centralTendency = []
centralTendency = computeCenTendency(data)

Mean:  508.857
Median:  459.0
Mode: 340 (appeared 2 times.)


Now let's show this data in a graph. But not just any graph. We will display it in a histogram. This way we can group together foods that have a similar sodium content into "bins". You can control how many bins are on the graph by using the slider below. Look how the graph changes.

Can you tell if there's an outlier when there's only a couple bins? How about when there's a lot of bins?

In [44]:
def plotHistogram(x_values, num_bins, xLabel, yLabel, histTitle):
    n, bins, patches = plt.hist(x_values, num_bins, facecolor='blue', alpha=0.5)

    plt.xlabel(xLabel)
    plt.ylabel(yLabel)
    plt.title(histTitle)
    plt.show()
    
def callPlottingFunction(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(data, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')

interact(callPlottingFunction, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

interactive(children=(IntSlider(value=50, description='num_bins'), Output()), _dom_classes=('widget-interact',…

When there are a lot of bins, it's really easy to see that there's an outlier. If we look at the data, we know that's soy sauce. So let's remove it and see how the central tendency is affected.

In [45]:
dataWithoutOutlier = []
with open('CC66_materials/exampleData_NoOutlier.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        dataWithoutOutlier.append(int(row[1]))
print("Sodium content read from the file: ", dataWithoutOutlier)

centralTendencyWithoutOutlier = []
centralTendencyWithoutOutlier = computeCenTendency(dataWithoutOutlier)

def callPlottingFunctionNoOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(dataWithoutOutlier, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')

interact(callPlottingFunctionNoOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

Sodium content read from the file:  [871, 1250, 270, 250, 205, 340, 843, 482, 521, 450, 360, 780, 340, 510, 460, 380, 290, 458, 335, 355, 590, 600, 595, 547, 1036, 410, 190, 530]
Mean:  508.857
Median:  459.0
Mode: 340 (appeared 2 times.)


interactive(children=(IntSlider(value=50, description='num_bins'), Output()), _dom_classes=('widget-interact',…

Should soy sauce be excluded though? It is a common food that many people own, so it's not out of place in the data. It's just very salty.

**An outlier should really be out of place in your data set, not just a large or small value.**

-------

#### Now try adding your own data points into this set. 

Try adding really big and/or really small numbers. You can also add many repeating numbers to change the mode. Just press the button below to add a new outlier. You can add as many as you want.

In [46]:
global datasetWithOutlier
datasetWithOutlier = dataWithoutOutlier

add = widgets.Button(description='Add Outlier',disabled=False,button_style='')
outlier = widgets.IntText(value=None,description='Outlier:',disabled=False)

def addOutlier(a):
    global datasetWithOutlier
    IPython.display.clear_output()
    IPython.display.display(add)
    IPython.display.display(outlier)
    datasetWithOutlier.append(outlier.value)
    print(datasetWithOutlier)
    print("You can press the button Add Outlier again to add even more outliers!")

def showOutlier(a):
    IPython.display.clear_output()
    IPython.display.display(add)
    IPython.display.display(outlier)
    outlier.observe(addOutlier, 'value')
    

IPython.display.display(add)
add.on_click(showOutlier)

Button(description='Add Outlier', style=ButtonStyle())

Now we will compute the central tendency and produce histogram of the dataset with outlier.

In [47]:
compute = widgets.Button(description='Calculate',disabled=False,button_style='')
global centralTendencyWithOutlier

def callPlottingFunctionWithOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(datasetWithOutlier, num_bins, 'Sodium Content', 'values', 'Histogram with Outliers')

def calculate(a):
    global centralTendencyWithOutlier
    centralTendencyWithOutlier = computeCenTendency(datasetWithOutlier)
    print(centralTendencyWithOutlier)
    interact(callPlottingFunctionWithOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));
    
IPython.display.display(compute)
compute.on_click(calculate)

Button(description='Calculate', style=ButtonStyle())

Click the button compare to see the central tendency before your outliers next to the central tendency after your outliers are added. How do your outliers change the central thendency?

In [48]:
compare = widgets.Button(description='Compare',disabled=False,button_style='')

def compareTendencies(a):
    centralTendencyWithoutOutlierArray = np.around(np.asarray(centralTendencyWithoutOutlier), 3)
    centralTendencyWithOutlierArray = np.around(np.asarray(centralTendencyWithOutlier), 3) 
    arr = { 'Central Tendency ':  np.array(['Mean','Median','Mode' ]),
            'Before adding outlier ':  np.array([centralTendencyWithoutOutlierArray[0],  centralTendencyWithoutOutlierArray[1], 
                                            centralTendencyWithoutOutlierArray[2] ]),
            'After adding outlier ': np.array([centralTendencyWithOutlierArray[0], centralTendencyWithOutlierArray[1],
                                           centralTendencyWithOutlierArray[2]])}
    print(Table(arr))
    
IPython.display.display(compare)
compare.on_click(compareTendencies)

Button(description='Compare', style=ButtonStyle())

*Please note that if you press these buttons out of order, you might get errors.<br>
If that happens, just select the box with the error then press 'shift' and 'enter' at the same time.*

### Effects of Outliers 

##### Mean
From the above results and histogram, we see that adding outliers can change the mean dramatically.<br>
This is because we need to add all the values to determine the mean value.<br>
Outliers are often values that are much larger or smaller than the other values.<br>
When we add these values to the sum, the average can change a lot.

##### Median
For the median, we need to consider the middle number(s).<br>
Adding one outlier adds a single point at the far end of our data set, so everything else shifts over only one spot. <br>
The median might change or might not.

##### Mode
The mode is very unlikely to change, because mode is the most repeated value.<br>
Unless you add many outliers, all with the same value, the mode probably won't change.

-------------- 

## Test yourself 

### Question 1

The purple tornades played 8 games this season and these are their scores.<br>
15, 10, 18, 11, 1, 18, 25, 12

In [49]:
answer642 = widgets.RadioButtons(options=['25 and 18', '1', '25 and 1', '1 and 10', '25', 'None of these'],
                             value = None,  description='Choices:')

question642 = "Which of these values could be outliers?"

def check642(g):
    IPython.display.clear_output(wait=False)
    display(question642, answer642)
    if answer642.value == '25 and 1':
        print("That's right!")
    else:
        print("Not quite. Try again.")

IPython.display.clear_output(wait=False)
display(question642, answer642)
answer642.observe(check642, 'value')

Which of these values could be outliers?


RadioButtons(description='Choices:', options=('25 and 18', '1', '25 and 1', '1 and 10', '25', 'None of these')…

In [50]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question642">
 <button id = "642"
onclick="toggle('answer642');">Solution</button> 
</div>
<div style="display:none" id="answer642">
Outliers are values which are much bigger or smaller than the rest of the data.<br />
Since there is only one value in the 20s, which is 25, that might be an outlier.<br/>
Also there is only one value less than 10, which is 1, so that might also be an outlier.<br/>

Everything else is pretty close together.<br>
<br>
We do not know for sure if these are real outliers, but they could be, so we call them potential outliers. <br>

</div>

</body>
</html>

### Question 2

Sam's class recorded the height of 10 the students in the class in centimeters. This is the data they got:<br>
120, 125, 133, 146, 180, 152, 154, 138, 122, 140<br>
The current mean height of these 10 students is 141 cm.<br>
The current median height is 139 cm. <br>

In [51]:
def display(question, answerList):
    print(question)
    IPython.display.display(answerList)

In [52]:
question611 = " If we say that 180 cm is an outlier, what will happen to the central tendency?"
answer611 = widgets.RadioButtons(options=['mean will increase', 'median will stay the same', 'mean will decrease', 'None of the above'],
                             value= None, description='Choices:')


def checkAnswer(a):
    IPython.display.clear_output(wait=False)
    display(question611, answer611)
    if answer611.value == 'mean will decrease':
        print("Correct Answer!")
    else:
        print("Sorry, that's not right! Try again.")

display(question611, answer611)

answer611.observe(checkAnswer, 'value')

 If we say that 180 cm is an outlier, what will happen to the central tendency?


RadioButtons(description='Choices:', options=('mean will increase', 'median will stay the same', 'mean will de…

In [53]:
%%HTML
<html>
<head>
<script type="text/javascript">

<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question611">
 <button id = "611"
onclick="toggle('answer611');">Solution</button> 
</div>
<div style="display:none" id="answer611">
Since we are saying that the outlier is 180 cm and that is larger than the rest of the data, the mean will decrease.  <br/>
If the outlier was smaller than the rest of the data, the mean would increase. <br/>
<br/>
Since there was 10 values, the median was the average between the 5th and 6th values when the list is sorted. <br/>
But if we remove one value, then the median will become the 5th value which is smaller than the average of the 5th and 6th values. <br/>
This is only because those 2 values are not the same number, or else the median would stay the same.


</div>

</body>
</html>

### Question 3

A teacher collected the grades of one test and these are the results:
100, 78, 98, 78, 78, 75, 55, 86, 100
Also 4 students missed the test so they got a grade of zero for that test.

In [54]:
question612 = " What is the mode grade for all the students in the test? "

answer612 = widgets.RadioButtons(options=['78', '100', '0', '78 and 100', 'There is none'],
                              value = None, description='Choices:')

def check612(b):
    IPython.display.clear_output(wait=False)
    display(question612, answer612)
    if answer612.value == '0':
        print("That's right!")
    else:
         print("Sorry, that's not right, try again. Don't forget the mode is the value that repeats the most.")

IPython.display.clear_output(wait=False)
display(question612, answer612)
answer612.observe(check612, 'value')

 What is the mode grade for all the students in the test? 


RadioButtons(description='Choices:', options=('78', '100', '0', '78 and 100', 'There is none'), value=None)

In [55]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question612">
 <button id = "612"
onclick="toggle('answer612');">Solution</button> 
</div>
<div style="display:none" id="answer612">
Since there is 3 78s, you might think that 78 would be the mode. <br/>
But we are also counting the 4 studetns who got a zero. <br/>
Therefore theres 4 zeros in the dataset, so the mode is 0. <br/>


</div>

</body>
</html>

In [60]:
answer613 = widgets.RadioButtons(options=['Yes, to 100', 'Yes, to 78', 'Yes, to 0', 'No, it stays the same'],
                             value =None, description='Choices:')

question613 = "If we don't count the students who missed the test, does the mode change? If so, to what?"

def check613(e):
    IPython.display.clear_output(wait=False)
    display(question613, answer613)
    if answer613.value == 'Yes, to 78':
        print("You are correct!")
    else:
        print("Not quite, Try again.")
        
IPython.display.clear_output(wait=False)
display(question613, answer613)
answer613.observe(check613, 'value')

If we don't count the students who missed the test, does the mode change? If so, to what?


RadioButtons(description='Choices:', options=('Yes, to 100', 'Yes, to 78', 'Yes, to 0', 'No, it stays the same…

In [57]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question613">
 <button id = "613"
onclick="toggle('answer613');">Solution</button> 
</div>
<div style="display:none" id="answer613">

When we remove the students who missed the test, since they made the mode 0, the mode will definitely change. <br />
Since there is 3 78s and only 2 100s, the new mode will be 78. <br/>
If there were 3 100s, then the mode would change to 78 and 100.

</div>

</body>
</html>

### Extra Hard Question

A student has gotten the following marks on his tests: 87, 95, 76, and 88. He wants a mean that is at least 85 and only has one more test.

In [58]:
answer631 = widgets.RadioButtons(options=['78', '78.5', '79', '80', 'None of the above'],
                             value =None, description='Choices:')

question631 = "What is the minimum mark he must get on the last test in order to achieve that average?"

def check631(e):
    IPython.display.clear_output(wait=False)
    display(question631, answer631)
    if answer631.value == '79':
        print("Correct Answer!")
    else:
        print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question631, answer631)
answer631.observe(check631, 'value')

What is the minimum mark he must get on the last test in order to achieve that average?


RadioButtons(description='Choices:', options=('78', '78.5', '79', '80', 'None of the above'), value=None)

In [59]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question631">
 <button id = "631"
onclick="toggle('answer631');">Solution</button> 
</div>
<div style="display:none" id="answer631">
The minimum mark is what we need to find. To find the average of all his marks (the known ones, plus the unknown one), 
we have to add up all the grades, and then divide by the number of marks. Since we do not have a score for the last test yet, 
we will use a variable to stand for this unknown value: "x". Then computation to find the desired average is: <br />

the first step : (87 + 95 + 76 + 88 + x) ÷ 5 = 85 <br />
Multiplying through by 5 and simplifying, we get: <br/>
the next step : 87 + 95 + 76 + 88 + x = 425 <br />
the next step : 346 + x = 425 <br />
the final step : x = 79 <br />
so, he needs to get at least a 79 on the last test.
</div>

</body>
</html>

## Conclusion

In this notebook, we learned how an outlier affects central tendency. 

When an outlier is added to (or removed from) a data set:
* Mean changes the most 
* Median changes a little bit
* Mode doesn't change unless there are multiple outliers with the same value

Also, not all larger or smaller values should be called outliers and excluded from data. That depends more on context. 

![alt text][bottom-banner]

[bottom-banner]: ./callysto-bottom-banner.jpg