# Effect of Outliers on Central Tendency
This notebook(CC-66) will explore the effect of outliers on central tendency (mean, median, and mode).

This notebook is organized as follows. In section 1, we discuss the preliminaries, definition of outliers and followed by two primary examples. In the next section,  we first compute the basic central tendency without outliers by importing data set from local folder. In section 3, once we've covered the basics, you'll have the opportunity to use some basic Python tools to adjust a dataset by adding one or more outliers, and then observe the effect on mean, median, and mode. In the following section, we use python library as a data source. In section 5, we summarized the notebook. Section 6, represents the exercises problem for students. 

## <font color='Blue'> 1 Preliminaries </font> 
To get the very basic ideas of central tendency you may visit another notebook [CC-65](CC65-Central_Tendency.ipynb) 



### 1.1 Outlier : Out, liar
A value that "lies outside" (is much smaller or larger than) most of the other values in a set of data.

##### 1.1.1 Example 
Data set $= 2, 26, 23, 27, 25, 28, 29, 24, 99 $
<br> $2$ and $99$ are samller and larger from other data respectively.
<br>Therefore, the outlier of the data set are $2$ and $99$


##### 1.1.2 Example 
In math-00 course, ten students obtained following marks (out of $100$):
Marks $= 78, 87, 85, 96, 84, 92, 102, 79, 81, 97$

In the data set there is no data that is too smaller or larger than the other values. But as we are given the full marks of the exam was out of $100$ that means an individual cen get at best $100$ mark in the course. So, the mark that exceed this maximum range would be fraud value in the data set. Therefore, $102$ is the outlier in the data set.  

##   <font color='Blue'> 2 Central Tendency Computation: Sodium content </font>

Now we will calculate the mean, median, and mode for a dataset stored in a local text (csv) file. This particular data happens to be the sodium content per serving in a selection of supermarket items. You'll notice one number in the list below that's far greater than the others: this is the outlier in our dataset. (The product? In case you were wondering, it's soy sauce.)

### 2.1 Import Data
The block of code below imports our data. The first line imports the file. The next lines tell Jupyter how to read the file, and the last line tells Jupyter to display the information it read from the file.

In [30]:
import csv
data = []
with open('exampleData.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(int(row[1]))
print("Data read from the file: ", data)

Data read from the file:  [871, 1250, 6458, 270, 250, 205, 340, 843, 482, 521, 450, 360, 780, 340, 510, 460, 380, 290, 458, 335, 355, 590, 600, 595, 547, 1036, 410, 190, 530]


Now that we've loaded our data, we need to process it. We're using the Python language, which relies on a number of preconfigured programs called *libraries* to accomplish tasks. The first lines of code below import the tools we'll need. 

### 2.2 Calcualtion : General function
After that, we define a *function*, named "computeCenTendency", which we will use to calculate the mean, median, and mode of our dataset. (Together, the mean, median, and mode statistics are known as *measures of central tendency* since they are used to measure the extent to which our data is concentrated around a central value.)

In [31]:
import numpy as np
from scipy import stats
from collections import Counter
from array import array
from statistics import mode
from ipywidgets import interact, widgets
import IPython

def computeCenTendency(dataset):
    
    #mean value
    mean= np.mean(dataset)
    print("Mean: ", mean) 

    #median value
    # First we need to sort the data in ascending order
    dataset.sort()
    median = np.median(dataset)
    print("Median: ", median)
    
    # Mode
    hits = []
    for item in dataset:
        tally = dataset.count(item)
        #Makes a tuple that is the number of huts paired with the relevant number
        values = (tally, item)
        # Only add one entry for each number in the set
        if values not in hits:
            hits.append(values)
    hits.sort(reverse=True)
    if hits[0][0]>hits[1][0]:
        print("Mode:", hits[0][1], "(appeared", hits[0][0], "times.)")
    else:
        print("There is no mode")

    Counter(dataset)


### 2.3 Calculation : Using the function
Now we call the above defined function. The require parameters of the function is the list of data. 

In [32]:
computeCenTendency(data)

Mean:  714.0
Median:  460.0
Mode: 340 (appeared 2 times.)


### 2.4 Histogram

Next, we will produce a **histogram** of our data. A histogram is a useful plot for visualizing how the values in our dataset are distributed. Again, we begin by importing the Python libraries needed to accomplish the task. Then we define our histogram plot. The definition depends on a number of parameters, such the number of 'bins'. When there are a lot of data points, we don't plot each point individually. Instead, we group together nearby values. Each such grouping, given by a range of values, is called a *bin*. The height of each bar in the histogram is given by the number of data points in the corresponding bin.

#### 2.4.1 General function
First, we will imlement a general function for plotting histogram so that we can use it for every example. 

In [33]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets


def plotHistogram(x_values, num_bins, xLabel, yLabel, histTitle):
    n, bins, patches = plt.hist(x_values, num_bins, facecolor='blue', alpha=0.5)

    plt.xlabel(xLabel)
    plt.ylabel(yLabel)
    plt.title(histTitle)
    plt.show()

Now we define another function to get the value of *bin* value size from the user and generate a histogram based on the value.

In [34]:

def callPlottingFunction(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(data, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')



#### 2.4.2 Interactive histogram
To present the histogram in a interactive way, here we have used python widgets called slider. We invoke a built in function named *interact()*, that automatically creates user interface controls for exploring code and data interactively. 

In the following slider, we can adjust the *bin* size (labeled as *num_bins* ) and generate histogram with the value of num_bins. Student can change the value between the range of $0$ to $100$. By default the value of *num_bins* is set to 50. To generate the histogram for different values click anywhere on the slider. 

N.B. The interactivity here is a little rough: the graph disappears and reappears each time the slider value changes!

In [56]:
interact(callPlottingFunction, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

## <font color='Blue'> 3 Exploring the effect of outliers </font> 

In the example above, the sodium content in the soy sauce produced a clear outlier. Next, we will begin with a dataset that does not initially contain any outliers, and compute its mean, median, and mode. You will then have the opportunity to add additional data points to the set, and then compute their effect on these statistics.

### 3.1 Dataset without outlier
We remove the entry of sodium content in the soy sauce from the above used csv file. Now we perform another iteration to compute every task that we defined in section 2.  

In [36]:
dataWithoutOutlier = []
with open('exampleData_NoOutlier.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        dataWithoutOutlier.append(int(row[1]))
print("Data read from the file: ", dataWithoutOutlier)

computeCenTendency(dataWithoutOutlier)

def callPlottingFunctionNoOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(dataWithoutOutlier, num_bins , 'Sodium Content', 'values', 'Histogram of 30 products in Australian supermarkets')

interact(callPlottingFunctionNoOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

Data read from the file:  [871, 1250, 270, 250, 205, 340, 843, 482, 521, 450, 360, 780, 340, 510, 460, 380, 290, 458, 335, 355, 590, 600, 595, 547, 1036, 410, 190, 530]
Mean:  508.85714285714283
Median:  459.0
Mode: 340 (appeared 2 times.)


### 3.2 Adding Outlier
Here, we provide an opportunity so that student can add outlier by their own choice. First, they have to enter how many outlier(s) they want to add. Then the program will take the input outlier(s).  

In [57]:
inputs = []
numbers = int(input("Number(s) of Outlier : "))
inputs = [input("Outlier " + str(i+1) + ": ")  for i in range(numbers)]
inputs = list(map(int, inputs))

datasetWithOutlier = []
datasetWithOutlier = dataWithoutOutlier + inputs

if(numbers == 1):
    print("Dataset with outlier (last", numbers, "entry was added) : ", datasetWithOutlier)
else:
    print("Dataset with outliers (last", numbers, "entries were added) : ", datasetWithOutlier)

Number(s) of Outlier : 2
Outlier 1: 34564
Outlier 2: 567675
Dataset with outliers (last 2 entries were added) :  [190, 205, 250, 270, 290, 335, 340, 340, 355, 360, 380, 410, 450, 458, 460, 482, 510, 521, 530, 547, 590, 595, 600, 780, 843, 871, 1036, 1250, 34564, 567675]


Now we will compute the central tendency and produce histogram of the dataset with outlier.

In [38]:
computeCenTendency(datasetWithOutlier)

Mean:  917.0
Median:  460.0
Mode: 340 (appeared 2 times.)


### 3.3 Historgram with Outliers
Here, we plot a histogram of the data that contain outlier(s). Student can also define the size of *bins* on the slider and observe the change of grpah representation. 

In [39]:
def callPlottingFunctionWithOutlier(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(datasetWithOutlier, num_bins, 'Sodium Content', 'values', 'Histogram with Outliers')
interact(callPlottingFunctionWithOutlier, num_bins=widgets.IntSlider(min=0,max=100,step=1,value=50));

**Effects of Outlier :** From the above results and histogram, we can see after adding outliers with the dataset the mean value changed dramatically(based on the number and weight of outlier). Because to determine the mean value we need to add all values.  Otliers are the numbers that does not belong into the regular values. Generally it would be a large number. That is why we sum up all numbers it produce the result so high. On the contrary, to find out the median value we need to considere the middle number(s), that is why it moves to the right and will increase a bit. But the mode is very unlikely to update, because we define mode by the repetition. Input a very large number as an outlier and observe the histogra. You will notice all regular data should be in one particular area and the inputted value is far away. 

## <font color='Blue'> 4 Example Using Python Library </font>
In this section we discuss about pandas dataframe. We use **nba_2013.csv** file as dataset that contains $30$ different criteria of $480$ nba players of 2013.

First, we import the Python library (pandas) and then read the csv rows. 

In [40]:
import pandas
nba = pandas.read_csv("nba_2013.csv")
print(nba)

                    player pos  age bref_team_id   g  gs    mp   fg   fga  \
0               Quincy Acy  SF   23          TOT  63   0   847   66   141   
1             Steven Adams   C   20          OKC  81  20  1197   93   185   
2              Jeff Adrien  PF   27          TOT  53  12   961  143   275   
3            Arron Afflalo  SG   28          ORL  73  73  2552  464  1011   
4            Alexis Ajinca   C   25          NOP  56  30   951  136   249   
5             Cole Aldrich   C   25          NYK  46   2   330   33    61   
6        LaMarcus Aldridge  PF   28          POR  69  69  2498  652  1423   
7              Lavoy Allen  PF   24          TOT  65   2  1072  134   300   
8                Ray Allen  SG   38          MIA  73   9  1936  240   543   
9               Tony Allen  SG   32          MEM  55  28  1278  204   413   
10         Al-Farouq Aminu  SF   23          NOP  80  65  2045  234   494   
11          Louis Amundson  PF   31          TOT  19   0   185   16    32   

Determining the dimensions of nba dataset.

In [41]:
print("Total number of rows : ", nba.shape[0])
print("Total number of colums : ", nba.shape[1])

Total number of rows :  481
Total number of colums :  31


###  4.1 Mean

In [42]:
#Mean 
nba.mean()

age             26.509356
g               53.253638
gs              25.571726
mp            1237.386694
fg             192.881497
fga            424.463617
fg.              0.436436
x3p             39.613306
x3pa           110.130977
x3p.             0.285111
x2p            153.268191
x2pa           314.332640
x2p.             0.466947
efg.             0.480752
ft              91.205821
fta            120.642412
ft.              0.722419
orb             55.810811
drb            162.817048
trb            218.627859
ast            112.536383
stl             39.280665
blk             24.103950
tov             71.862786
pf             105.869023
pts            516.582121
season_end    2013.000000
dtype: float64

###  4.2 Median 

In [43]:
#Median
nba.median()

age             26.000000
g               61.000000
gs              10.000000
mp            1141.000000
fg             146.000000
fga            332.000000
fg.              0.438000
x3p             16.000000
x3pa            48.000000
x3p.             0.330976
x2p            110.000000
x2pa           227.000000
x2p.             0.474475
efg.             0.488000
ft              53.000000
fta             73.000000
ft.              0.751000
orb             35.000000
drb            135.000000
trb            168.000000
ast             65.000000
stl             32.000000
blk             14.000000
tov             58.000000
pf             104.000000
pts            401.000000
season_end    2013.000000
dtype: float64

###  4.3 Mode

In [44]:
#Mode
nba.mode()

Unnamed: 0,player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,...,drb,trb,ast,stl,blk,tov,pf,pts,season,season_end
0,A.J. Price,SG,25.0,TOT,82.0,0.0,15.0,0.0,1.0,0.429,...,0.0,3.0,0.0,0.0,0.0,0.0,1.0,0.0,2013-2014,2013.0
1,Aaron Brooks,,,,,,392.0,,,,...,,,,,,,,,,
2,Aaron Gray,,,,,,1416.0,,,,...,,,,,,,,,,
3,Adonis Thomas,,,,,,,,,,...,,,,,,,,,,
4,Al Harrington,,,,,,,,,,...,,,,,,,,,,
5,Al Horford,,,,,,,,,,...,,,,,,,,,,
6,Al Jefferson,,,,,,,,,,...,,,,,,,,,,
7,Al-Farouq Aminu,,,,,,,,,,...,,,,,,,,,,
8,Alan Anderson,,,,,,,,,,...,,,,,,,,,,
9,Alec Burks,,,,,,,,,,...,,,,,,,,,,


###  4.4 Histogram 

In [45]:
def callPlottingFunctionNBA(num_bins):
    print("Generating... plot for :", num_bins)
    plotHistogram(nba[["ast"]], num_bins, "nba players", "values", "Histogram of 2013 nba players")
interact(callPlottingFunctionNBA, num_bins = widgets.IntSlider(min=0,max=100,step=1,value=10));

## <font color='Blue'> 5 Conclusion </font>
The three most common statistical averages are (arithmetic) mean, median and mode. We observed the effect of outliers on mean, median, and mode. Mean updated very quickly for adding outlier(s),median changes very slowly and most cases mode would remain same. 

## <font color='Blue'>6 Test yourself </font>

### 6.1 Practice Problem : Easy
Consider the following GPAs of	students	from two	semesters	of	Stat-0000 course: 
<br>
Semester $1: 3.1, 3.2, 2.8,  2.9, 3.0, 3.4,	2.3, 3.2, 2.1,	3.5$ <br>
Semester $2: 2.2, 4.1,	2.6, 2.7, 3.8, 2.8,	2.4, 3.2, 2.7,	2.9$


In [47]:
answer = widgets.RadioButtons(options=['', '2.95', '3', '3.05', '3.10', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.1.1 What is be the mean of semester 1?")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '3.05':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.1.1 What is be the mean of semester 1?


In [48]:
answer = widgets.RadioButtons(options=['', '2.43', '2.5', '2.6', '2.7', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.1.2 What is be the median of semester 2?")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '2.7':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.1.2 What is be the median of semester 2?


### 6.2 Practice Problem : Medium


#### 6.2.1 Distribution type
We want to know how the values are distributed. To determine the type, if $mean<median<mode$ then it is called negatively skewed and if $mean > median >mode$ then it is positively skewed. 

In [49]:
answer = widgets.RadioButtons(options=['', 'Negatively Skewed', 'Positively skewed', 'both of them', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.2.1 What is the distribution type of semester 1? (Hint:first, calcualte mean, median, mode)")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == 'Negatively Skewed':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.2.1 What is the distribution type of semester 1? (Hint:first, calcualte mean, median, mode)


#### 6.2.2 Median of medians
Now, we learn how to calculate the median of medians. We define the term as quartile. When we compute the median (M) of a dataset we divide them into two groups. Let say, in group $1$ all values should be less than or equal of median.
<br>
**First quartile $(M_1):$** the median of first group. Now, we have smaller data sets and we can find the median as we discussed in preliminaries.   <br>
**Third quartile $(M_3)$** the median of second group where every value is equal or greater than median. <br>
The **Second quartile** is our median (M).
<br>
For example, our data set is: 
$S_1 = {2, 5, 7, 8, 9}$<br>
So the median (M) is $: 7$<br>
and now we have two groups:
<br>
Group $1: {2,5}$ <br>
Group $2: {8,9}$ <br>
So the first quartile $(M_1)$ is: $\frac{2+5}{2} = 3.5$ <br>
similarly, third quartile, $(M_3) = 8.5$

<br>
    Based on the discussion, answer the following questions. 

In [50]:
answer = widgets.RadioButtons(options=['', '2.5 and 3.1', '2.6 and 3.1', '2.6 and 3.2', '2.8 and 3.2', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.2.2 What is the first and third quartiles of semester 2?")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '2.6 and 3.2':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.2.2 What is the first and third quartiles of semester 2?


### 6.3 Practice Problem : Hard
A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants an 85 or better overall. What is the minimum grade he must get on the last test in order to achieve that average?

In [51]:
answer = widgets.RadioButtons(options=['', '78', '78.5', '79', '80', 'None of the above'],
                              value='', description='Choices:')

def display():
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '79':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

The minimum grade is what I need to find. To find the average of all his grades (the known ones, plus the unknown one), I have to add up all the grades, and then divide by the number of grades. Since I don't have a score for the last test yet, I'll use a variable to stand for this unknown value: "x". Then computation to find the desired average is:

(87 + 95 + 76 + 88 + x) ÷ 5 = 85

Multiplying through by 5 and simplifying, I get:

87 + 95 + 76 + 88 + x = 425

346 + x = 425

x = 79

He needs to get at least a 79 on the last test.

### 6.4 Practice Problem: Miscellaneous
In this section we discuss how to determine the outliers based on first and third quartile. First, we need to find out the inter quartile range (IQR). And then we compute the lower and upper limit for outliers by the following formulae: <br>
Lower limit: $$M_1 - 1.5 \times \text{IQR}$$
and upper limit: $$M_3 + 1.5 \times \text{IQR}$$
where,<br>
IQR $= (M_3 - M_1)$ <br>
So, all the data that are not belong to this range are outliers.

Now, answer the following question:

In [52]:
answer = widgets.RadioButtons(options=['', '1.6 and 4.1', '1.7 and 3.9', '2.2 and 3.8', '1.7 and 4.1', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.4.1 What is the lower and upper limit of outliers of semester 2?")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '1.7 and 4.1':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.4.1 What is the lower and upper limit of outliers of semester 2?


In [53]:
answer = widgets.RadioButtons(options=['', '1.6 and 4.1', '1.7 and 3.9', '2.2 and 3.8', '1.7 and 4.1', 'None of the above'],
                              value='', description='Choices:')

def display():
    print("6.4.2 What is the Potential outliers  of semester 1?")
    IPython.display.display(answer)

def check(a):
    IPython.display.clear_output(wait=False)
    display()
    if answer.value == '1.7 and 4.1':
        print("Voila!! You are right!")
    else:
        print("Wrong answer! Try again.")

display()
answer.observe(check, 'value')

6.4.2 What is the Potential outliers  of semester 1?
