![alt text][top-banner]

[top-banner]: ./callysto-top-banner.jpg

In [41]:
%%html

<script>
  function code_toggle() {
    if (code_shown){
      $('div.input').hide('500');
      $('#toggleButton').val('Show Code')
    } else {
      $('div.input').show('500');
      $('#toggleButton').val('Hide Code')
    }
    code_shown = !code_shown
  }

  $( document ).ready(function(){
    code_shown=false;
    $('div.input').hide()
  });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [42]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets, Button, Layout
from scipy import stats
from collections import Counter
from array import array
from statistics import mode
import IPython
import pandas
import sys
from astropy.table import Table
import tkinter
import csv
import pandas as pd
from pandas import DataFrame

# Effect of Outliers on Central Tendency

This notebook focuses on what an outlier is and how it affects central tendency. Remember central tendency means the mean, median, and mode of some data. If you need review on central tendency, check out this previous notebook.

Things you will learn in this notebook:
* What an outlier is
* How an outlier affects mean
* How an outlier affects median
* How an outlier affects mode
* Why outliers can be a problem

<img style="float: right;" src="CC66_materials/PunnyOutlier.jpg" width="400" height="300">

## What is an outlier?

An outlier is a value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.

Let's look at an example: <br>
Here is a data set: $26, 23, 27, 25, 28, 29, 24, 99 $ <br>
Lets put it in order: $23, 24, 25, 26, 27, 28, 29, 99$ <br> 
$99$ is larger than all the other data, by a lot. <br> 
Therefore, we can call $99$ an outlier.

## Central tendency

First let's look at the central tendency of the example above. 

Mean = $\frac{23+24+25+26+27+28+29+99}{8} = \frac{281}{8} = 35.125$. <br>
The median is found between 26 and 27, so the median is 26.5. <br>
There is no mode because none are repeated more than once.

Then we will remove the outlier (99) and recalculate the central tendency.

Mean = $\frac{23+24+25+26+27+28+29}{7} = \frac{182}{7} = 26$. <br>
The median is the middle number so the median is 26. <br>
There is no mode because none are repeated more than once.

What changes do you notice? What changed the most?

### Try it yourself

Here are the ratings of a new restaurant out of 10:

$4.5, 7, 15, 3.5, 6, 5, 9, 10, 1$

In [56]:
answer622 = widgets.RadioButtons(options=['Yes', 'No'], value = None)
answernext = widgets.RadioButtons(options=['1', '10', '15', '-1'], value = None)

question622 = 'Is there an outlier?'
questionnext = 'Which value is the outlier?'

def checknext(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    display(questionnext, answernext)
    if answernext.value == '15':
        print("You're right! 15 is an outlier because the ratings are only supposed to be out of 10.")
    else:
        print("Not quite, try again.")


def check622(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    if answer622.value == 'Yes':
        print("Correct!")
    else:
        print("Actually there is an outlier.")
    display(questionnext, answernext)
    answernext.observe(checknext, 'value')
        
IPython.display.clear_output(wait=False)
display(question622,answer622 )
answer622.observe(check622, 'value')

Is there an outlier?


RadioButtons(options=('Yes', 'No'), value=None)

In [57]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question622">
 <button id = "622"
onclick="toggle('answer622');">Solution</button> 
</div>
<div style="display:none" id="answer622">
An outlier is a value that does not seem to fit with the rest of the data. <br/>
In this case, we are looking at ratings that should be between 0 and 10. <br/>
All the values in this data set are within that range except for 15. <br/>
So 15 must be the outlier. <br/>

Also -1 would be an outlier if it was in the data set, but because it's not then it's not the right answer. <br/>

</div>

</body>
</html>

## A bigger set of data

Let's look at some data about the sodium content (amount of salt) in different common foods. Let's also put the data on a number line. Is there an outlier?

In [43]:
df = pd.read_csv('CC66_materials/exampleDataSource.csv')
print(df)

                             Product   Sodium content
0             Tomato sauce (Brand A)              871
1             Tomato sauce (Brand B)             1250
2                          Soy sauce             6458
3                    Tinned tomatoes              270
4                      Four bean mix              250
5                       Corn kernels              205
6                 Mushrooms (tinned)              340
7                       Tomato paste              843
8                 Chunky pasta sauce              482
9                      Chicken stock              521
10                        Beef stock              450
11                           Peanuts              360
12               Beef strogonoff mix              780
13            Dry roasted macadamias              340
14                           Cashews              510
15                            Butter              460
16                  Sunflower spread              380
17                       Bak

In [44]:
data = []
with open('CC66_materials/exampleData.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(int(row[1]))

Now let's calculate the mean, median, and mode using the computer.

In [45]:
def computeCenTendency(dataset):
    
    #mean value
    mean= np.mean(dataset)
    print("Mean: ", round(mean,3))

    #median value
    # First we need to sort the data in ascending order
    dataset.sort()
    median = np.median(dataset)
    print("Median: ", round(median,3))
    
    # Mode
    hits = []
    for item in dataset:
        tally = dataset.count(item)
        #Makes a tuple that is the number of huts paired with the relevant number
        values = (tally, item)
        # Only add one entry for each number in the set
        if values not in hits:
            hits.append(values)
    hits.sort(reverse=True)
    if hits[0][0]>hits[1][0]:
        print("Mode:", round(hits[0][1],3), "(appeared", hits[0][0], "times.)")
    else:
        print("There is no mode")

    Counter(dataset)
    return mean, median, hits[0][1]

centralTendency = []
centralTendency = computeCenTendency(data)

Mean:  508.857
Median:  459.0
Mode: 340 (appeared 2 times.)


Now let's show this data in a graph. But not just any graph. We will display it in a histogram. This way we can group together foods that have a similar sodium content into "bins". We can change this graph to show how changing the number of bins changes the detail we see on the graph.

Can you tell if there's an outlier when there's only a couple bins? How about when there's a lot of bins?

In [46]:
def plotHistogram(x_values, num_bins, xLabel, yLabel, histTitle):
    n, bins, patches = plt.hist(x_values, num_bins, facecolor='blue', alpha=0.5)

    plt.xlabel(xLabel)
    plt.ylabel(yLabel)
    plt.title(histTitle)
    plt.show()
    
def callPlottingFunction(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(data, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')

interact(callPlottingFunction, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

interactive(children=(IntSlider(value=50, description='num_bins'), Output()), _dom_classes=('widget-interact',…

When there are a lot of bins, it's really easy to see that there's an outlier. If we look at the data, we know that's soy sauce. So let's remove it and see how the central tendency is affected.

In [47]:
dataWithoutOutlier = []
with open('CC66_materials/exampleData_NoOutlier.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        dataWithoutOutlier.append(int(row[1]))
print("Sodium content read from the file: ", dataWithoutOutlier)

centralTendencyWithoutOutlier = []
centralTendencyWithoutOutlier = computeCenTendency(dataWithoutOutlier)

def callPlottingFunctionNoOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(dataWithoutOutlier, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')

interact(callPlottingFunctionNoOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

Sodium content read from the file:  [871, 1250, 270, 250, 205, 340, 843, 482, 521, 450, 360, 780, 340, 510, 460, 380, 290, 458, 335, 355, 590, 600, 595, 547, 1036, 410, 190, 530]
Mean:  508.857
Median:  459.0
Mode: 340 (appeared 2 times.)


interactive(children=(IntSlider(value=50, description='num_bins'), Output()), _dom_classes=('widget-interact',…

Should soy sauce be excluded though? It is a common food that many people own, so it's not out of place in the data. It's just very salty.

Now try adding your own data points into this set. Try adding really big and/or really small numbers. You can also add many repeating numbers to change the mode. Just press the button below to add a new outlier. You can add as many as you want.

In [48]:
global datasetWithOutlier
datasetWithOutlier = dataWithoutOutlier

add = widgets.Button(description='Add Outlier',disabled=False,button_style='')
outlier = widgets.IntText(value=None,description='Outlier:',disabled=False)

def addOutlier(a):
    global datasetWithOutlier
    IPython.display.clear_output()
    IPython.display.display(add)
    IPython.display.display(outlier)
    datasetWithOutlier.append(outlier.value)
    print(datasetWithOutlier)
    print("You can press the button Add Outlier again to add even more outliers!")

def showOutlier(a):
    IPython.display.clear_output()
    IPython.display.display(add)
    IPython.display.display(outlier)
    outlier.observe(addOutlier, 'value')
    

IPython.display.display(add)
add.on_click(showOutlier)

Button(description='Add Outlier', style=ButtonStyle())

Now we will compute the central tendency and produce histogram of the dataset with outlier.

In [49]:
compute = widgets.Button(description='Calculate',disabled=False,button_style='')
global centralTendencyWithOutlier

def callPlottingFunctionWithOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(datasetWithOutlier, num_bins, 'Sodium Content', 'values', 'Histogram with Outliers')

def calculate(a):
    global centralTendencyWithOutlier
    centralTendencyWithOutlier = computeCenTendency(datasetWithOutlier)
    print(centralTendencyWithOutlier)
    interact(callPlottingFunctionWithOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));
    
IPython.display.display(compute)
compute.on_click(calculate)

Button(description='Calculate', style=ButtonStyle())

**Effects of Outliers :** 

From the above results and histogram, we see that adding outliers can change the mean dramatically.<br>
This is because we need to add all the values to determine the mean value.<br>
Outliers are often values that are much larger or smaller than the other values.<br>
When we add these values to the sum, the average can change significantly.

For the median, we need to consider the middle number(s).<br>
Adding an outlier adds a data point at the far end of our data set, and everything else shifts over only one spot.

The mode is very unlikely to update, because we define mode by the repetition of values.<br>
Input a very large number as an outlier and observe the histogram. <br>
You will notice all regular data should be in one particular area and the inputted value is far away. Consider the following table to get the clear idea of the effect of outliers on mean, median and mode.

In [50]:
compare = widgets.Button(description='Compare',disabled=False,button_style='')

def compareTendencies(a):
    centralTendencyWithoutOutlierArray = np.around(np.asarray(centralTendencyWithoutOutlier), 3)
    centralTendencyWithOutlierArray = np.around(np.asarray(centralTendencyWithOutlier), 3) 
    arr = { 'Central Tendency ':  np.array(['Mean','Median','Mode' ]),
            'Before adding outlier ':  np.array([centralTendencyWithoutOutlierArray[0],  centralTendencyWithoutOutlierArray[1], 
                                            centralTendencyWithoutOutlierArray[2] ]),
            'After adding outlier ': np.array([centralTendencyWithOutlierArray[0], centralTendencyWithOutlierArray[1],
                                           centralTendencyWithOutlierArray[2]])}
    print(Table(arr))
    
IPython.display.display(compare)
compare.on_click(compareTendencies)

Button(description='Compare', style=ButtonStyle())

##  <font color='Blue'>  Section 6 </font> 
## Test yourself 

### 6.1 Practice Problem : Easy
Consider the following GPAs of	students	from two	semesters	of	Stat-0000 course: 
<br>
Semester $1: 3.1, 3.2, 2.8,  2.9, 3.0, 3.4,	2.3, 3.2, 2.1,	3.5$ <br>
Semester $2: 2.2, 4.1,	2.6, 2.7, 3.8, 2.8,	2.4, 3.2, 2.7,	2.9$


In [51]:
def display(question, answerList):
    print(question)
    IPython.display.display(answerList)

In [52]:
question611 = "What will be the mean of semester 1?"
answer611 = widgets.RadioButtons(options=['2.95', '3', '3.05', '3.10', 'None of the above'],
                             value= None, description='Choices:')


def checkAnswer(a):
    IPython.display.clear_output(wait=False)
    display(question611, answer611)
    if answer611.value == '3.05':
        print("Correct Answer!")
    else:
        print("Wrong answer! Try again.")

display(question611, answer611)

answer611.observe(checkAnswer, 'value')

What will be the mean of semester 1?


RadioButtons(description='Choices:', options=('2.95', '3', '3.05', '3.10', 'None of the above'), value=None)

In [53]:
%%HTML
<html>
<head>
<script type="text/javascript">

<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question611">
 <button id = "611"
onclick="toggle('answer611');">Solution</button> 
</div>
<div style="display:none" id="answer611">
To find out the mean of semester 1 we divide the sum of all data by the numbers of data.<br/>
Therefore, Mean $=  \frac{3.1 + 3.2 + 2.8 + 2.9 + 3.0 + 3.4 + 2.3 + 3.2 + 2.1 + 3.5}{10} = 3.05$ <br/>


</div>

</body>
</html>

In [54]:
question612 = " What will be the median of semester 2? "

answer612 = widgets.RadioButtons(options=['2.43', '2.5', '2.67', '2.75', 'None of the above'],
                              value = None, description='Choices:')

def check612(b):
    IPython.display.clear_output(wait=False)
    display(question612, answer612)
    if answer612.value == '2.75':
        print("Correct Answer!")
    else:
         print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question612, answer612)
answer612.observe(check612, 'value')

 What will be the median of semester 2? 


RadioButtons(description='Choices:', options=('2.43', '2.5', '2.67', '2.75', 'None of the above'), value=None)

In [55]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question612">
 <button id = "612"
onclick="toggle('answer612');">Solution</button> 
</div>
<div style="display:none" id="answer612">
To determine the median at first we need to sort the data. <br/>
Sorted data,<br/>
Semester  $2: 2.2,2.4,2.6,2.7,2.7,2.8,2.9,3.2,3.8,4.1$ <br/>
There are $10$ numbers in the data set, so the average of $5^{th}$ and $6^{th}$ will be the median. <br/>
So, the Median $= \frac{2.7 + 2.8}{2} = 2.75$ <br/>


</div>

</body>
</html>

### 6.2 Practice Problem : Medium


### 6.3 Practice Problem : Hard
A student has gotten the following marks on his tests: 87, 95, 76, and 88. He wants an 85 or better overall.

In [58]:
answer631 = widgets.RadioButtons(options=['78', '78.5', '79', '80', 'None of the above'],
                             value =None, description='Choices:')

question631 = "6.3.1 What is the minimum mark he must get on the last test in order to achieve that average?"

def check631(e):
    IPython.display.clear_output(wait=False)
    display(question631, answer631)
    if answer631.value == '79':
        print("Correct Answer!")
    else:
        print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question631, answer631)
answer631.observe(check631, 'value')

6.3.1 What is the minimum mark he must get on the last test in order to achieve that average?


RadioButtons(description='Choices:', options=('78', '78.5', '79', '80', 'None of the above'), value=None)

In [59]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question631">
 <button id = "631"
onclick="toggle('answer631');">Solution</button> 
</div>
<div style="display:none" id="answer631">
The minimum mark is what we need to find. To find the average of all his marks (the known ones, plus the unknown one), 
we have to add up all the grades, and then divide by the number of marks. Since we do not have a score for the last test yet, 
we will use a variable to stand for this unknown value: "x". Then computation to find the desired average is: <br />

the first step : (87 + 95 + 76 + 88 + x) ÷ 5 = 85 <br />
Multiplying through by 5 and simplifying, we get: <br/>
the next step : 87 + 95 + 76 + 88 + x = 425 <br />
the next step : 346 + x = 425 <br />
the final step : x = 79 <br />
so, he needs to get at least a 79 on the last test.
</div>

</body>
</html>

In [60]:
answer641 = widgets.RadioButtons(options=['1.6 and 4.1', '1.7 and 3.9', '2.2 and 3.8', '1.7 and 4.1', 'None of the above'],
                              value=None,description='Choices:')

question641 = "6.4.1 What is the lower and upper limit of outliers of semester 2?"


def checkAnswer641(f):
    IPython.display.clear_output(wait=False)
    display(question641, answer641 )
    if answer641.value == '1.7 and 4.1':
        print("Correct Answer!")
    else:
        print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question641, answer641)
answer641.observe(checkAnswer641, 'value')

6.4.1 What is the lower and upper limit of outliers of semester 2?


RadioButtons(description='Choices:', options=('1.6 and 4.1', '1.7 and 3.9', '2.2 and 3.8', '1.7 and 4.1', 'Non…

In [61]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question641">
 <button id = "641"
onclick="toggle('answer641');">Solution</button> 
</div>
<div style="display:none" id="answer641">
At first we have to determine the first median$(M_1)$ and third median $(M_3)$ of semester 2.<br/> 
Data for first Median :  2.2, 2.4, 2.6, 2.7, 2.7 <br/>
    So the first median or first quartile is 2.6 <br/>
Similarly, $M_3$ is 3.2 <br/>
So, the inter median/quartile range (IQR) = 3.2 - 2.6 = 0.6 <br/>
Now we can compute the limits for outliers. <br/>

Lower limit : $M_1 - 1.5\times IQR = 2.6 - 1.5 \times 0.6 = 1.7$ <br/>
Similarly, using second equation we can determine the upper limit, <br/>
Upper limit : $ 3.2 + 1.5 \times 0.6 = 4.1$
</div>

</body>
</html>

In [62]:
answer642 = widgets.RadioButtons(options=['2.2', '3.8', '1.7', '2.1', 'None of the above'],
                             value = None,  description='Choices:')

question642 = "6.4.2 What is the Potential outliers  of semester 1?"

def check642(g):
    IPython.display.clear_output(wait=False)
    display(question642, answer642)
    if answer642.value == '2.1':
        print("Correct Answer!")
    else:
        print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question642, answer642)
answer642.observe(check642, 'value')

6.4.2 What is the Potential outliers  of semester 1?


RadioButtons(description='Choices:', options=('2.2', '3.8', '1.7', '2.1', 'None of the above'), value=None)

In [63]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question642">
 <button id = "642"
onclick="toggle('answer642');">Solution</button> 
</div>
<div style="display:none" id="answer642">
In our previous question we determined the lower and upper limit of semester 2. Similarly if we calculate the limits for semester 1 
we find that the lower and upper limits are 2.2 and 3.8 respectively. So as per the definition of outliers, the numbers that are 
lower than the lower limit and higher than the upper limit should be the potential outliers.<br />

So from the given data of semester 1 we can find that 2.1 is less than the lower limit (2.2)and all other GPAs are in between lower
and upper limit.  <br/>Hence, 2.1 is potential outlier.
</div>

</body>
</html>

## Conclusion

In this notebook, we learned how an outlier affects central tendency. 

When an outlier is added to (or removed from) a data set:
* Mean changes the most 
* Median changes a little bit
* Mode doesn't change unless there are multiple outliers with the same value

Also, not all larger or smaller values should be called outliers and excluded from data. That depends more on context. 

![alt text][bottom-banner]

[bottom-banner]: ./callysto-bottom-banner.jpg