![alt text][top-banner]

[top-banner]: ./callysto-top-banner.jpg

In [3]:
%%html

<script>
 function code_toggle() {
   if (code_shown){
     $('div.input').hide('500');
     $('#toggleButton').val('Show Code')
   } else {
     $('div.input').show('500');
     $('#toggleButton').val('Hide Code')
   }
   code_shown = !code_shown
 }

 $( document ).ready(function(){
   code_shown=false;
   $('div.input').hide()
 });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [4]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets, Button, Layout

from scipy import stats
from collections import Counter
from array import array
from statistics import mode
import IPython
import pandas

# Data Analysis : Mean, Median, and Mode 

In this notebook we will discuss about the central tendency (mean, median and mode).
<br>First, we will review the very basiscs of mean, median, mode, and outlier. Further details are readily available on the internet, such as [this website for Australian junior high students](https://www.mathsteacher.com.au/year8/ch17_stat/02_mean/mean.htm). 

We will use the Python programming language to import and process several sets of data. In particular, using Python we can easily calculate the mean, median, and mode for a dataset. 

This notebook is organized as follows. In section 1, we discuss the preliminaries, definition of outliers and followed by two primary examples. In the next section,  we first compute the basic central tendency without outliers by importing data set from local folder. In section 3, once we've covered the basics, you'll have the opportunity to use some basic Python tools to adjust a dataset by adding one or more outliers, and then observe the effect on mean, median, and mode. In the following section, we use python library as a data source. In section 5, we summarized the notebook. Section 6, represents the exercises problem for student.

<br>
## <font color='Blue'>  1 Preliminaries </font>
### 1.1 What do we mean about the 'mean'?
The *average* of a set of numbers is called the **mean**. 
<br> To calculate: just add up all the numbers and then divide by how many numbers there are. 
<br> If there are $n-$elements in a set, $S = \{a_1, a_2, a_3, .... , a_n\}$. Then the mean of the set will be, 
$$mean=\frac{\sum_{i=1}^{n} a_i}{n}$$

##### Example 1 : easy
Data set $= 2, 6, 3, 7, 5, 3, 9$
<br> number of elements in data set $= 7$
<br>So the mean is, 
$$mean = \frac{2 + 6 + 3 + 7 + 5 + 3 + 9}{7} = 5 $$

##### Example 2 : Never average averages
**If Tommy averaged $20$ points for $3$ basketball games and $30$ points in next two, what was his average for all five games?**
###### solution:
Total points for first three games $= 20 \times 3 = 60$
<br>Points for next two games $ = 30 \times 2 = 60 $
<br>Total number of games $ = 3 + 2 = 5$
<br>Therefore the mean is, $ = \frac{120}{5} = 24$

As we will soon see, the mean of a dataset is sensitive to extreme values (outliers). This can be useful in situations where we want to take those values into account. For example, if we're including the cost of home heating in a household budget, the average cost of our heating bill will give the best measure of heating's impact on our budget over the course of a year.

### 1.2 Median : don't play in the MEDIAN at home kids!
The median is the middle number in a sequence of numbers.
To compute the median, first reorder the data set from the smallest to the largest. If the number of elements are odd, then the median is the element in the middle of the data set. Otherwise, the median is the average of the two middle terms.


##### Example 1 : odd number of elements
Data set $= 2, 6, 3, 7, 5, 3, 9$
<br> sorted data set $= 2, 3, 3, 5, 6, 7, 9$
<br>So, median will be $5$.

##### Example 2 : even number of elements 
Data set $= 2, 6, 3, 7, 5, 3, 9, 4 $
<br> sorted data set $= 2, 3, 3, 4, 5, 6, 7, 9$
<br>Therefore, the median will be, $\frac{(4+5)}{2} = 4.5$

Again, as we're about to explore, the median is a useful statistic because it is **not** as sensitive to extreme values. For example, businesses -- or city planners -- are more interested in median income than mean income when making decisions on where to locate a business or service. One very wealthy person moving to an area will increase the mean income in that area, but not the median.


### 1.3 Mode
The "Mode" for a data set is the element that occurs the most often. It is not uncommon for a data set to have more than one mode. This happens when two or more elements occur with equal frequency in the data set. Similarly, A data set with two modes is called **bimodal**. A data set with three modes is called **trimodal**.

##### Example 1 : single mode
Data set $= 2, 6, 3, 7, 5, 3, 9$
<br> sorted data set $= 2, 3, 3, 5, 6, 7, 9$
<br> Here, only the value $3$ repeats more than once. 
<br>So, mode will be $3$.

##### Example 2 : bimodal
Data set $= 2, 6, 3, 7, 5, 3, 9, 5, 3, 5 $
<br> sorted data set $= 2, 3, 3, 3,  5, 5, 5,  6, 7, 9$
<br> Value $3$ and $5$ has same number of repetitions(thrice). 
<br>So, mode will be, $3, 5$

##### Example 3 : trimodal
Data set $= 2, 6, 3, 7, 5, 3, 9, 5, 7$
<br> sorted data set $= 2, 3, 3, 5, 5, 6, 7, 7, 9$
<br> As we can see from the sorted list of data, $3,5,$ and $7$ are repeated twice.
<br>So, mode will be,  $3, 5, 7$.

##### Example 4 : no mode 
Data set $= 2, 6, 3, 7, 5, 8, 9, 4 $
<br> sorted data set $= 2, 3, 4, 5, 6, 7, 8, 9$
<br>There are no value that repeats. 
<br>So, there are no mode.

In [5]:
from ggb import *
ggb = GGB()
ggb.file('centralTendency.ggb').draw()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Test Yourself

In [6]:
def display(question, answerList):
    print(question)
    IPython.display.display(answerList)

In [7]:
question611 = "6.1.1 What will be the mean of semester 1?"

answer611 = widgets.RadioButtons(options=['Select best one','2.95', '3', '3.05', '3.10', 'None of the above'],
                             value= 'Select best one', description='Choices:')


def checkAnswer(a):
    IPython.display.clear_output(wait=False)
    display(question611, answer611)
    if answer611.value == '3.05':
        print("Correct Answer!")
    else:
        if answer611.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

display(question611, answer611)

answer611.observe(checkAnswer, 'value')

6.1.1 What will be the mean of semester 1?


In [8]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question611">
 <button id = "611"
onclick="toggle('answer611');">Solution</button> 
</div>
<div style="display:none" id="answer611">
To find out the mean of semester 1 we divide the sum of all data by the numbers of data.<br/>
Therefore, Mean $=  \frac{3.1 + 3.2 + 2.8 + 2.9 + 3.0 + 3.4 + 2.3 + 3.2 + 2.1 + 3.5}{10} = 3.05$ <br/>


</div>

</body>
</html>

In [9]:
question612 = "6.1.2 What will be the median of semester 2?"

answer612 = widgets.RadioButtons(options=['Select best one', '2.43', '2.5', '2.67', '2.75', 'None of the above'],
                              value = 'Select best one', description='Choices:')

def check612(b):
    IPython.display.clear_output(wait=False)
    display(question612, answer612)
    if answer612.value == '2.75':
        print("Correct Answer!")
    else:
        if answer612.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question612, answer612)
answer612.observe(check612, 'value')

6.1.2 What will be the median of semester 2?


In [10]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question612">
 <button id = "612"
onclick="toggle('answer612');">Solution</button> 
</div>
<div style="display:none" id="answer612">
To determine the median at first we need to sort the data. <br/>
Sorted data,<br/>
Semester  $2: 2.2,2.4,2.6,2.7,2.7,2.8,2.9,3.2,3.8,4.1$ <br/>
There are $10$ numbers in the data set, so the average of $5^{th}$ and $6^{th}$ will be the median. <br/>
So, the Median $= \frac{2.7 + 2.8}{2} = 2.75$ <br/>


</div>

</body>
</html>

In [11]:
question621 = "6.2.1 What is the distribution type of semester 1? (Hint:first, calcualte mean, median, mode)"

answer621 = widgets.RadioButtons(options=['Select best one', 'Negatively Skewed', 'Positively skewed', 'both of them', 'None of the above'],
                              value = 'Select best one', description='Choices:')

def check621(c):
    IPython.display.clear_output(wait=False)
    display(question621, answer621)
    if answer621.value == 'Negatively Skewed':
        print("Correct Answer!")
    else:
        if answer621.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question621, answer621)
answer621.observe(check621, 'value')

6.2.1 What is the distribution type of semester 1? (Hint:first, calcualte mean, median, mode)


In [12]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question621">
 <button id = "621"
onclick="toggle('answer621');">Solution</button> 
</div>
<div style="display:none" id="answer621">
At first we have to calculate the mean, median and mode of semester 1 <br/> 
Mean $=  \frac{3.1 + 3.2 + 2.8 + 2.9 + 3.0 + 3.4 + 2.3 + 3.2 + 2.1 + 3.5}{10} = 2.95$ <br/>
For finding the median we need to sort the data. <br/>
Sorted data,<br/>
Semester  $1: 2.1,2.3,2.8,2.9,3.0,3.1,3.2,3.2,3.4,3.5$ <br/>
There are $10$ numbers in the data set, so the average of $5^{th}$ and $6^{th}$ will be the median. <br/>
Median $= \frac{3.0 + 3.1}{2} = 3.05$ <br/>
From the above sorted data we can see that only $3.2$ repeats twice. <br/>
So, the mode is $3.2$ <br/>

The mean is less than median, and median is less than mode. That means, mean$<$ median $<$ mode. <br/>
Therefore the distribution type of semester 1 data is Negatively Skewed. <br/>

</div>

</body>
</html>

In [13]:
answer622 = widgets.RadioButtons(options=[  'Select best one','2.5 and 3.1', '2.6 and 3.1', '2.6 and 3.2', '2.8 and 3.2', 'None of the above'],
                             value = 'Select best one', description='Choices:')

question622 = "6.2.2 What is the first and third quartiles of semester 2?"


def check622(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    if answer622.value == '2.6 and 3.2':
        print("Correct Answer!")
    else:
        if answer622.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question622,answer622 )
answer622.observe(check622, 'value')

6.2.2 What is the first and third quartiles of semester 2?


In [14]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question622">
 <button id = "622"
onclick="toggle('answer622');">Solution</button> 
</div>
<div style="display:none" id="answer622">
At first we have to sort the data of semester 2 <br/> 
Sorted data,
Semester  2: 2.2,2.4,2.6,2.7,2.7,2.8,2.9,3.2,3.8,4.1
Data for first Median :  2.2, 2.4, 2.6, 2.7, 2.7 <br/>
    So the first median or first quartile is 2.6 <br/>
Similarly, third quartile $M_3$ is 3.2 <br/>

</div>

</body>
</html>

In [15]:
answer631 = widgets.RadioButtons(options=[ 'Select best one','78', '78.5', '79', '80', 'None of the above'],
                             value = 'Select best one', description='Choices:')

question631 = "6.3.1 What is the minimum mark he must get on the last test in order to achieve that average?"

def check631(e):
    IPython.display.clear_output(wait=False)
    display(question631, answer631)
    if answer631.value == '79':
        print("Correct Answer!")
    else:
        if answer631.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question631, answer631)
answer631.observe(check631, 'value')

6.3.1 What is the minimum mark he must get on the last test in order to achieve that average?


In [16]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question631">
 <button id = "631"
onclick="toggle('answer631');">Solution</button> 
</div>
<div style="display:none" id="answer631">
The minimum mark is what we need to find. To find the average of all his marks (the known ones, plus the unknown one), 
we have to add up all the grades, and then divide by the number of marks. Since we do not have a score for the last test yet, 
we will use a variable to stand for this unknown value: "x". Then computation to find the desired average is: <br />

the first step : (87 + 95 + 76 + 88 + x) ÷ 5 = 85 <br />
Multiplying through by 5 and simplifying, we get: <br/>
the next step : 87 + 95 + 76 + 88 + x = 425 <br />
the next step : 346 + x = 425 <br />
the final step : x = 79 <br />
so, he needs to get at least a 79 on the last test.
</div>

</body>
</html>

In [17]:
answer641 = widgets.RadioButtons(options=['Select best one','1.6 and 4.1', '1.7 and 3.9', '2.2 and 3.8', '1.7 and 4.1', 'None of the above'],
                              value='Select best one',description='Choices:')

question641 = "6.4.1 What is the lower and upper limit of outliers of semester 2?"


def checkAnswer641(f):
    IPython.display.clear_output(wait=False)
    display(question641, answer641 )
    if answer641.value == '1.7 and 4.1':
        print("Correct Answer!")
    else:
        if answer641.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question641, answer641)
answer641.observe(checkAnswer641, 'value')

6.4.1 What is the lower and upper limit of outliers of semester 2?


In [18]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question641">
 <button id = "641"
onclick="toggle('answer641');">Solution</button> 
</div>
<div style="display:none" id="answer641">
At first we have to determine the first median$(M_1)$ and third median $(M_3)$ of semester 2.<br/> 
Data for first Median :  2.2, 2.4, 2.6, 2.7, 2.7 <br/>
    So the first median or first quartile is 2.6 <br/>
Similarly, $M_3$ is 3.2 <br/>
So, the inter median/quartile range (IQR) = 3.2 - 2.6 = 0.6 <br/>
Now we can compute the limits for outliers. <br/>

Lower limit : $M_1 - 1.5\times IQR = 2.6 - 1.5 \times 0.6 = 1.7$ <br/>
Similarly, using second equation we can determine the upper limit, <br/>
Upper limit : $ 3.2 + 1.5 \times 0.6 = 4.1$
</div>

</body>
</html>

In [19]:
answer642 = widgets.RadioButtons(options=['Select best one', '2.2', '3.8', '1.7', '2.1', 'None of the above'],
                             value = 'Select best one',  description='Choices:')

question642 = "6.4.2 What is the Potential outliers  of semester 1?"

def check642(g):
    IPython.display.clear_output(wait=False)
    display(question642, answer642)
    if answer642.value == '2.1':
        print("Correct Answer!")
    else:
        if answer642.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question642, answer642)
answer642.observe(check642, 'value')

6.4.2 What is the Potential outliers  of semester 1?


In [20]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question642">
 <button id = "642"
onclick="toggle('answer642');">Solution</button> 
</div>
<div style="display:none" id="answer642">
In our previous question we determined the lower and upper limit of semester 2. Similarly if we calculate the limits for semester 1 
we find that the lower and upper limits are 2.2 and 3.8 respectively. So as per the definition of outliers, the numbers that are 
lower than the lower limit and higher than the upper limit should be the potential outliers.<br />

So from the given data of semester 1 we can find that 2.1 is less than the lower limit (2.2)and all other GPAs are in between lower
and upper limit.  <br/>Hence, 2.1 is potential outlier.
</div>

</body>
</html>

![alt text][bottom-banner]

[bottom-banner]: ./callysto-bottom-banner.jpg