![alt text][top-banner]

[top-banner]: ./callysto-top-banner.jpg

In [27]:
%%html

<script>
 function code_toggle() {
   if (code_shown){
     $('div.input').hide('500');
     $('#toggleButton').val('Show Code')
   } else {
     $('div.input').show('500');
     $('#toggleButton').val('Hide Code')
   }
   code_shown = !code_shown
 }

 $( document ).ready(function(){
   code_shown=false;
   $('div.input').hide()
 });
</script>
<form action="javascript:code_toggle()"><input type="submit" id="toggleButton" value="Show Code"></form>

In [28]:
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
from ipywidgets import interact, widgets, Button, Layout

from scipy import stats
from collections import Counter
from array import array
from statistics import mode
import IPython
import pandas

# Data Analysis : Mean, Median, and Mode 

In this notebook, we discuss the central tendency (mean, median and mode).<br>
First, we will review the very basics of mean, median, mode, and outlier.<br>
Further details are readily available on the internet, such as [this website for Australian junior high students](https://www.mathsteacher.com.au/year8/ch17_stat/02_mean/mean.htm). 

We will use the Python programming language to import and process several sets of data.<br>
In particular, using Python, we can quickly calculate the mean, median, and mode for a dataset. 

This notebook is organized as follows.
* In Section 1, we discuss the preliminaries, definition of mean, median and mode.  
* In Section 2, students can play with these statistics using an interactive bar graph.
* In Section 3, students will find some exercises they can use to check their understanding.

##  1 Preliminaries 
### 1.1 What do we mean about the 'mean'?
The *average* of a set of numbers is called the **mean**. 
<br> To calculate: add up all the numbers and then divide by how many numbers there are. 

##### Example 1 : easy
Data set: $2, 6, 3, 7, 5, 3, 9$
<br> Number of elements in data set: $7$
<br>So the mean is
$$\text{mean} = \frac{2 + 6 + 3 + 7 + 5 + 3 + 9}{7} = 5 $$

##### Example 2 : Never average averages
**If Tommy averaged $20$ points for $3$ basketball games and $30$ points in next two, what was his average for all five games?**
###### Solution:
Total points for first three games: $20 \times 3 = 60$
<br>Points for next two games: $30 \times 2 = 60 $
<br>Total number of games: $3 + 2 = 5$
<br>Therefore the mean is $\dfrac{60+60}{5} = 24$

You might have been tempted to take the average of 20 and 30, but these two averages occur over a different number of games!<br>
In a situation like this, we first "work backward" from the average.<br>
The averages do not tell us how many points Tommy scored in each game.<br>
However, we know that one set of scores that produce our averages is $20, 20, 20, 30, 30$.<br>
Averaging these gives us our result.

**Note:** there is a way to correctly "average averages," but it's beyond the level of this notebook.<br>
If you're curious, you can read online about how to calculated a [weighted mean](https://en.wikipedia.org/wiki/Weighted_arithmetic_mean).

As we will see in [our next notebook](CC66-Effect_Of_Outliers_On_Central_Tendency.ipynb), the mean of a dataset is sensitive to extreme values (outliers).<br>
This can be useful in situations where we want to take those values into account.<br>
For example, if we include the cost of home heating in a household budget, the average cost of our heating bill will give the best measure of heating's impact on our budget over the course of a year.

### 1.2 Median : don't play in the MEDIAN at home kids!
The **median** is the middle number in a sequence of numbers.<br>
To compute the median, first, reorder the data set from the smallest to the largest.<br>
If the number of elements is odd, then the median is the element in the middle of the data set.<br>
Otherwise, the median is the average of the two middle terms.


##### Example 1 : odd number of elements
Data set: $2, 6, 3, 7, 5, 3, 9$
<br> Sorted data set: $2, 3, 3, 5, 6, 7, 9$
<br> So, median will be $5$.

##### Example 2 : even number of elements 
Data set: $2, 6, 3, 7, 5, 3, 9, 4 $
<br> Sorted data set: $2, 3, 3, 4, 5, 6, 7, 9$
<br> Therefore, the median will be $\frac{(4+5)}{2} = 4.5$

The median is a useful statistic because it is **not** as sensitive to extreme values.<br>
(You will see this for yourself in [the sequel to this notebook](CC66-Effect_Of_Outliers_On_Central_Tendency.ipynb).)<br>
For example, businesses -- or city planners -- are more interested in median income than mean income when making decisions on where to locate a business or service.<br>
One very wealthy person moving to an area will increase the mean income in that area, but not the median.


### 1.3 Mode
The **mode** for a data set is the element that occurs the most often.<br>
It is not uncommon for a data set to have more than one mode.<br>
This happens when two or more elements occur with equal frequency in the data set.<br>
Similarly, A data set with two modes is called **bimodal**.<br>
A data set with three modes is called **trimodal**.

##### Example 1 : single mode
Data set: $2, 6, 3, 7, 5, 3, 9$
<br> Sorted data set: $2, 3, 3, 5, 6, 7, 9$
<br> Here, only the value $3$ repeats more than once. 
<br>So, mode will be $3$.

##### Example 2 : bimodal
Data set: $2, 6, 3, 7, 5, 3, 9, 5, 3, 5 $
<br> Sorted data set: $2, 3, 3, 3,  5, 5, 5,  6, 7, 9$
<br> Values $3$ and $5$ have same number of repetitions. 
<br>So, the mode will be, $3, 5$

##### Example 3 : trimodal
Data set: $2, 6, 3, 7, 5, 3, 9, 5, 7$
<br> Sorted data set: $2, 3, 3, 5, 5, 6, 7, 7, 9$
<br> As we can see from the sorted list of data, $3,5,$ and $7$ are repeated twice.
<br> So, mode will be,  $3, 5, 7$.

##### Example 4 : no mode 
Data set: $2, 6, 3, 7, 5, 8, 9, 4 $
<br> Sorted data set: $2, 3, 4, 5, 6, 7, 8, 9$
<br>There are no value that repeats. 
<br> So, there is no mode.

## 2 Interactive bar graph
Now you can play with a bar graph and examine how to calculate mean, median, and mode by changing values of the dataset. 

### 2.1 Instructions: 
* Step 1: enter how many data points you have in "Enter the numbers of the dataset." <br>
* Step 2: based on your input in step 1, sliders will be available (one for each data point). <br>
* Step 3: change the value of sliders and see the associated changes in the bar graph.<br>
* Step 4: to show the breakdown for calculating mean check the box Mean, for median check Median and so on.<br>

In [29]:
from ggb import *
ggb = GGB()
ggb.file('CC65_materials/centralTendency.ggb').draw()

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 2.2 Drawbacks of bar graph
* Additional information is needed to describe the dataset. 
* Data patterns, critical features of data cannot be readily assumed.  

## <font color='Blue'>  Section 3 </font> 
## Example Using Python Library 
In this section we compute the central tendency of pandas dataframe. We use **nba_2013.csv** file as dataset that contains $30$ different criteria of $480$ nba players of 2013.

First, we read the csv file and then display a portion of the file. Here, **mp** stands for Minutes Played
, **drb and trb** represent Defensive Rebound Percentage and Total Rebound Percentage and so on.  Moreover, you may visit [Basketball Statistics](https://en.wikipedia.org/wiki/Basketball_statistics) for other colunm name. 

In [30]:
nba = pandas.read_csv("CC65_materials/nba_2013.csv")
print(round(nba.head(10),3))

              player pos  age bref_team_id   g  gs    mp   fg   fga    fg.  \
0         Quincy Acy  SF   23          TOT  63   0   847   66   141  0.468   
1       Steven Adams   C   20          OKC  81  20  1197   93   185  0.503   
2        Jeff Adrien  PF   27          TOT  53  12   961  143   275  0.520   
3      Arron Afflalo  SG   28          ORL  73  73  2552  464  1011  0.459   
4      Alexis Ajinca   C   25          NOP  56  30   951  136   249  0.546   
5       Cole Aldrich   C   25          NYK  46   2   330   33    61  0.541   
6  LaMarcus Aldridge  PF   28          POR  69  69  2498  652  1423  0.458   
7        Lavoy Allen  PF   24          TOT  65   2  1072  134   300  0.447   
8          Ray Allen  SG   38          MIA  73   9  1936  240   543  0.442   
9         Tony Allen  SG   32          MEM  55  28  1278  204   413  0.494   

      ...      drb  trb  ast  stl  blk  tov   pf   pts     season  season_end  
0     ...      144  216   28   23   26   30  122   171  2013-

The dimension of the nba dataset. 

In [31]:
print("Total number of rows : ", nba.shape[0])
print("Total number of colums : ", nba.shape[1])

Total number of rows :  481
Total number of colums :  31


###  3.1 Mean
Here, we determine the mean for first ten columns.

In [32]:
#Mean 
fullDataframeMean = []
fullDataframeMean = nba.mean()
print(round(fullDataframeMean.head(10),3))

age       26.509
g         53.254
gs        25.572
mp      1237.387
fg       192.881
fga      424.464
fg.        0.436
x3p       39.613
x3pa     110.131
x3p.       0.285
dtype: float64


###  3.2 Midean
Similarly, 

In [33]:
#Median
fullDataframeMedian = []
fullDataframeMedian = nba.median()
print(round(fullDataframeMedian.head(10),3))

age       26.000
g         61.000
gs        10.000
mp      1141.000
fg       146.000
fga      332.000
fg.        0.438
x3p       16.000
x3pa      48.000
x3p.       0.331
dtype: float64


###  3.3 Mode
''

In [34]:
#Mode
fullDataFrameMode = []
fullDataFrameMode = nba.mode()
print(round(fullDataFrameMode.head(5),3))

          player  pos   age bref_team_id     g   gs      mp   fg  fga    fg.  \
0     A.J. Price   SG  25.0          TOT  82.0  0.0    15.0  0.0  1.0  0.429   
1   Aaron Brooks  NaN   NaN          NaN   NaN  NaN   392.0  NaN  NaN    NaN   
2     Aaron Gray  NaN   NaN          NaN   NaN  NaN  1416.0  NaN  NaN    NaN   
3  Adonis Thomas  NaN   NaN          NaN   NaN  NaN     NaN  NaN  NaN    NaN   
4  Al Harrington  NaN   NaN          NaN   NaN  NaN     NaN  NaN  NaN    NaN   

      ...      drb  trb  ast  stl  blk  tov   pf  pts     season  season_end  
0     ...      0.0  3.0  0.0  0.0  0.0  0.0  1.0  0.0  2013-2014      2013.0  
1     ...      NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN        NaN         NaN  
2     ...      NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN        NaN         NaN  
3     ...      NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN        NaN         NaN  
4     ...      NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN        NaN         NaN  

[5 rows x 31 columns]


## <font color='Blue'>  Section 4 </font> 
## Test yourself

## Easy

In [35]:
def display(question, answerList):
    print(question)
    IPython.display.display(answerList)

In [36]:
question611 = "4.1.1 What number would you divide by to calculate the mean of 3, 4, 5, and 6?"

answer611 = widgets.RadioButtons(options=['Select best one','3', '4', '4.5', '5', 'None of the above'],
                             value= 'Select best one', description='Choices:')


def checkAnswer(a):
    IPython.display.clear_output(wait=False)
    display(question611, answer611)
    if answer611.value == '4':
        print("Correct Answer!")
    else:
        if answer611.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

display(question611, answer611)

answer611.observe(checkAnswer, 'value')



4.1.1 What number would you divide by to calculate the mean of 3, 4, 5, and 6?


In [37]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question611">
 <button id = "611"
onclick="toggle('answer611');">Solution</button> 
</div>
<div style="display:none" id="answer611">
From the definition of mean, we know that we have to divide by the total number of elements in the data set.<br/>
In this case, that number is 4. <br/>
So, to compute the mean, we have to divide the sum of the data set by 4. <br/>

</div>

</body>
</html>

In [38]:
question612 = "4.1.2 What is the mean of the following numbers? \n 10, 91, 39, 71, 17,  39, 76, 37, 25"

answer612 = widgets.RadioButtons(options=['Select best one','45', '45.11', '43.67', '46', 'None of the above'],
                             value= 'Select best one', description='Choices:')

def check612(b):
    IPython.display.clear_output(wait=False)
    display(question612, answer612)
    if answer612.value == '45':
        print("Correct Answer!")
    else:
        if answer612.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question612, answer612)
answer612.observe(check612, 'value')

4.1.2 What is the mean of the following numbers? 
 10, 91, 39, 71, 17,  39, 76, 37, 25


In [39]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question612">
 <button id = "612"
onclick="toggle('answer612');">Solution</button> 
</div>
<div style="display:none" id="answer612">
To find out the mean of semester 1 we divide the sum of all data by the numbers of data.<br/>
Therefore, Mean $=  \frac{10 + 91 +  39 + 71 + 17 +  39 + 76 + 37 + 25}{9} = 45$ <br/>


</div>

</body>
</html>

## Medium

In [40]:
question621 = "4.2.1 The mean of four numbers is 71.5. If three of the numbers are 58, 76, and 88, what is the value of the fourth number?"

answer621 = widgets.RadioButtons(options=['Select best one', '62', '73.37', '64', '74',  'None of the above'],
                              value = 'Select best one', description='Choices:')

def check621(c):
    IPython.display.clear_output(wait=False)
    display(question621, answer621)
    if answer621.value == '64':
        print("Correct Answer!")
    else:
        if answer621.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question621, answer621)
answer621.observe(check621, 'value')

4.2.1 The mean of four numbers is 71.5. If three of the numbers are 58, 76, and 88, what is the value of the fourth number?


In [41]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question621">
 <button id = "621"
onclick="toggle('answer621');">Solution</button> 
</div>
<div style="display:none" id="answer621">
We are given three numbers but told the mean of four numbers is 71.5. Let assume the fourth number is x. <br/>
So, by the definition of mean, we can write <br>
$\frac{58 + 76 + 88 + x}{4} = 71.5 $ <br/>

or, $222 + x = 71.5 \times 4 $ <br/>

or, $x = 286 - 222 $ <br/>

or, $x = 64 $ <br/>

Therefore, the fourth number is 64. <br/>


</div>

</body>
</html>

In [42]:
answer622 = widgets.RadioButtons(options=[  'Select best one','1', '23', '11.5', '13', 
                                          'None of the above'],
                             value = 'Select best one', description='Choices:')

question622 = "4.2.2 Suppose, the front row in your classroom has 23 seats. If you were asked to sit in the seat \n that occupied the median position, in which seat would you have to sit?"


def check622(d):
    IPython.display.clear_output(wait=False)
    display(question622, answer622)
    if answer622.value == 'None of the above':
        print("Correct Answer!")
    else:
        if answer622.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question622,answer622 )
answer622.observe(check622, 'value')

4.2.2 Suppose, the front row in your classroom has 23 seats. If you were asked to sit in the seat 
 that occupied the median position, in which seat would you have to sit?


In [43]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question622">
 <button id = "622"
onclick="toggle('answer622');">Solution</button> 
</div>
<div style="display:none" id="answer622">
Let consider the arrangement of the seats are as follows: 
<img src="seatsArrangement.jpg" alt="Empty Classrrom" width="500" height="333">
    <br/> 
    The number of seats is odd (23). To find out the median of the seats we need to divide <br/>
    the total number of seats by 2 and we get 11.5. So, 12-th seat is in the median position <br/>
    in the first row. But in the given choices there is no option for 12.<br/>
    <b> Therefore the correct answer is 'None of the above.' </b>

</div>

</body>
</html>

## Hard 
Sara tried to compute the mean average of her 8 test scores. She mistakenly divided the correct sum of all her test scores by 7, which yields 96.

In [44]:
answer631 = widgets.RadioButtons(options=[ 'Select best one','78', '83.5', '100', '84', 'None of the above'],
                             value = 'Select best one', description='Choices:')

question631 = "4.3.1 What is Sara's mean test score? (Source: dummies)"

def check631(e):
    IPython.display.clear_output(wait=False)
    display(question631, answer631)
    if answer631.value == '84':
        print("Correct Answer!")
    else:
        if answer631.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question631, answer631)
answer631.observe(check631, 'value')

4.3.1 What is Sara's mean test score? (Source: dummies)


In [45]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question631">
 <button id = "631"
onclick="toggle('answer631');">Solution</button> 
</div>
<div style="display:none" id="answer631">
You know the test score mean of Sara when divided by 7, you can determine the sum of her scores. This information will then allow you
to determine her mean average over eight tests.<br/>

Apply the average formula to her wrong calculation. <br/>

96 = sum of scores ÷ 7 <br/>

96 × 7 = sum of scores <br/>

672 = sum of scores <br/> 

Now that you know her test score sum, you can figure her true mean average.<br/> 

Mean average = 672 ÷ 8 <br/>

Mean average = 84 <br/>

So, the correct answer is D. <br/>

</div>

</body>
</html>

In [46]:
answer632 = widgets.RadioButtons(options=[ 'Select best one','32, 34, 35, 36', '32, 35, 36, 41', '32, 34, 40, 44', '32, 32, 38, 38', 'None of the above'],
                             value = 'Select best one', description='Choices:')

question632 = "4.3.2 A set of four numbers that begins with the number 32 is arranged from smallest to largest.\n If the median is 35, which of the following could be the set of numbers?"

def check632(e):
    IPython.display.clear_output(wait=False)
    display(question632, answer632)
    if answer632.value == '32, 32, 38, 38':
        print("Correct Answer!")
    else:
        if answer632.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")
        
IPython.display.clear_output(wait=False)
display(question632, answer632)
answer632.observe(check632, 'value')

4.3.2 A set of four numbers that begins with the number 32 is arranged from smallest to largest.
 If the median is 35, which of the following could be the set of numbers?


In [47]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question632">
 <button id = "632"
onclick="toggle('answer632');">Solution</button> 
</div>
<div style="display:none" id="answer632">
We are given the median of four numbers 35. We can imply from the given information is that the sum <br/> 
of the numbers is 140. Now we check all the choices one by one. We add all the numbers of the first choice and get 137. <br/>
So it is not our correct answer. <br/>
Similarly, for the next choices we get 144, 150, and 140 for second, third and fourth choices, respectively, <br/>
Therefore, the correct series is: 32, 32, 38, 38.<br/>

</div>

</body>
</html>


A student recorded her scores on weekly math quizzes that were marked out of a possible 10 points. Her scores were as follows: <br> 
8, 5, 8, 5, 7, 6, 7, 7, 5, 7, 5, 5, 6, 6, 9, 8, 9, 7, 9, 9, 6, 8, 6, 6, 7

In [48]:
answer641 = widgets.RadioButtons(options=['Select best one','6', ' 7', '6 and 7', '8 and 9', 'None of the above'],
                              value='Select best one',description='Choices:')

question641 = "4.4.1 What is/are the mode of her records?"


def checkAnswer641(f):
    IPython.display.clear_output(wait=False)
    display(question641, answer641 )
    if answer641.value == '6 and 7':
        print("Correct Answer!")
    else:
        if answer641.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question641, answer641)
answer641.observe(checkAnswer641, 'value')

4.4.1 What is/are the mode of her records?


In [49]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question641">
 <button id = "641"
onclick="toggle('answer641');">Solution</button> 
</div>
<div style="display:none" id="answer641">
At first, let sort the scores on weekly math quizzes: <br/>
    $ 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9$ <br/>
From the sorted points, we can imply, <br/>
The repetitions of 5 is: five <br/>
    6 and 7 repeat six times and 8 and 9 repeat four times.<br/> 
    Therefore, the mode of her scores are: 6 and 7 <br/>
   
</div>

</body>
</html>

In [50]:
answer642 = widgets.RadioButtons(options=['Select best one', '1', '2', '4', '19', 'None of the above'],
                             value = 'Select best one',  description='Choices:')

question642 = "4.4.2 The average of five positive integers is less than 20. What is the smallest possible median of this set?"

def check642(g):
    IPython.display.clear_output(wait=False)
    display(question642, answer642)
    if answer642.value == '1':
        print("Correct Answer!")
    else:
        if answer642.value == 'Select best one':
            pass
        else:
            print("Wrong answer! Try again.")

IPython.display.clear_output(wait=False)
display(question642, answer642)
answer642.observe(check642, 'value')

4.4.2 The average of five positive integers is less than 20. What is the smallest possible median of this set?


In [51]:
%%html
<html>
<head>
<script type="text/javascript">
<!--
function toggle(id) {
var e = document.getElementById(id);
if(e.style.display == 'none')
e.style.display = 'block';
else
e.style.display = 'none';
}
//-->
</script>
</head>

<body>
<div id="question642">
 <button id = "642"
onclick="toggle('answer642');">Solution</button> 
</div>
<div style="display:none" id="answer642">
As we are given that the numbers are positive integers that means all numbers should be
<br/>greater than or equal 1. Let assume the five numbers are 1, 1, 1, 1, 16. The sum is equal
<br/>to 20, and the average is 20/5 = 4 which is less than 20. All given conditions are satisfied. 
<br > <b>Therefore, the smallest possible median is 1. <b> <br/>

</div>

</body>
</html>

![alt text][bottom-banner]

[bottom-banner]: ./callysto-bottom-banner.jpg