## Measures of Central Tendency
https://www.youtube.com/watch?v=kn83BA7cRNM

This notebook walks through the basic measures of central tendency in the Crash Course "#3 Mean, Median, and Mode: Measures of Central Tendency" material using Python 3 in Jupyter notebooks.

In [None]:
# Boilerplate Environment. See notebook: 0_Getting_Started_with_Statistics_and_Python

# Determining enviornment
def at_google_colab():
    try:
        cfg = get_ipython().config 
        if cfg['IPKernelApp']['kernel_class'] == 'google.colab._kernel.Kernel':
            return True
        else:
            return False
    except NameError:
        return False

# where are we?
location = None
if at_google_colab():
    location = 'at Google'
else:
    location = 'locally'

# print prediction
print('I think you are running {}!'.format(location))

# Import packages
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt    
if at_google_colab:
    %matplotlib inline
else:
    %matplotlib notebook


## Calculating the Measure of Central Tendency (Mean, Median, Mode)


### 1. Sample Data (Small Coffee in Uptown) and DataFrame Tutorial
The average cost of a cup of a small coffee in uptown Oakland (from my memory).

In [None]:
# Load the data
d = {'shop': ['Donut Savant','Modern Coffee', 'Tierra Mia Coffee', 'Tertulia Coffee', 'Farley\'s East', 'Starbucks', 'Aroma'], 'price': [1.75, 2.00, 2.75, 3.00, 2.50, 1.95, 1.50]}
coffee = pd.DataFrame(data=d)
# let's look at the data
print(coffee)

What we've created is a Pandas DataFrame, which is a relational table. 

In [None]:
# type will tell you the data structure of a variable
print(type(coffee))

The dataframe has rows with index places, and columns with headers. To better a better sense of scale you can see the size or "shape" of the dataframe.

In [None]:
print("Shape of data (rows, columns): {}".format(coffee.shape))
print("Size (count of all values): {}".format(coffee.size))

Print all the columns in the dataframe

In [None]:
print(coffee.columns)

Or print the data types. 

In [None]:
print("Data Types: {}".format(coffee.dtypes))

When your data is very big, you may only want to print out a couple rows.

In [None]:
# default for header is 5 rows, but you can specify any number
print(coffee.head(3))

_**Exercise**: Can you guess the command to print the last three rows?_

In [None]:
#             shop  price
# 4  Farley's East   2.50
# 5      Starbucks   1.95
# 6          Aroma   1.50

If you want only the data, you can print a simple numpy array, with no columns or index rows. 

In [None]:
print(type(coffee.values))
print(coffee.values)

Or maybe you only want the column as a list, in which case you can grab the series.

In [None]:
print(type(coffee['price']))
print(coffee['price'])

It may be handy to know you can sort in a dataframe too.

In [None]:
# this returns a sorted array, the original 'coffee' remains original order.
sorted_coffee = coffee.sort_values(by=['price'])

Finally, you may want to search for a subset. This is a little unintuitive if you are used to SQL. Let's say we want all the coffee less than 2.50. The first step is to create an evaluation of every row on a condition, returning a Pandas Series.

In [None]:
select_criteria = coffee['price'] < 2.50
print(type(select_criteria))
print(select_criteria)

Then you pass that series into the original dataframe. It will only return the rows that are true in the series you passed.

In [None]:
print(coffee[select_criteria])

For shorthand you can combine it into one statement.

In [None]:
print(coffee[coffee['price'] < 2.50])

### 2. Mean (aka Average)
"For a data set, the arithmetic mean, also called the mathematical expectation or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values." - Wikipedia
#### Calculating the Mean

In [None]:
print('Mean')

# Even easier way: with pandas
mean = coffee['price'].mean()
print('Pandas: {}'.format(mean))

_**Exercise**: Finish a method to calculate the mean by scratch?_

In [None]:
def my_mean(df,col):
    count_of_values = len(df)
    sum_of_column_values = df[col].sum()
    # mean = ?
    
print('By scratch: {}'.format(my_mean(coffee,'price')))

### 3. Median
"The median is the value separating the higher half from the lower half of a data sample (a population or a probability distribution). For a data set, it may be thought of as the "middle" value. For example, in the data set {1, 3, 3, 6, 7, 8, 9}, the median is 6, the fourth largest, and also the fourth smallest, number in the sample." Wikipedia
#### Calculating the Median

In [None]:
print('Median')

# Easy way: with numpy
median = np.median(np.array(coffee['price']))
print('Numpy: {}'.format(median))

# Even easier way: with pandas
median = coffee['price'].median()
print('Pandas: {}'.format(median))

**Modern Coffee is the** middle value, making it's coffee price the **median**.

_**Exercise**: Finish a method to calculate the median by scratch_

In [None]:
def my_median(df,col):
    sorted_col_series = df.sort_values(col)[col]
    # return center value when series odd
    # or
    # return mean between the two center values when series even
    
print('By scratch: {}'.format(my_median(coffee,'price')))

### 4. Mode
"The mode of a set of data values is the value that appears most often. It is the value x at which its probability mass function takes its maximum value. In other words, it is the value that is most likely to be sampled." Wikipedia

#### Calculating the Mode

In [None]:
print('Mode')

# Pandas
mode = coffee['price'].mode()
print('Pandas: {}'.format(mode))

**Oh no!** There is not a useful mode in the data, since no value occur more than any other value. Let's create a pretty print mode that doesn't do that.

In [None]:
def pp_mode(df,col):
    if len(df[col].mode()) < len(df[col]):
        return df[col].mode()

mode_pp = pp_mode(coffee,'price')
print('Pretty mode: {}'.format(pp_mode(coffee,'price')))

### 5. Updating data to get an interesting mode
Let's see if we can get some more interesting results with a slightly more interesting dataset. We will add Gastropig's pricey cup of joe to the mix.

In [None]:
# New dataframe with Gastropig added. 
# Side note: ignore index because we are appending an object w/o indexes to concat...I think.
gp_coffee = coffee.append({'shop': 'Gastropig', 'price':3},ignore_index=True)
print("Coffee+Gastropig")
print(gp_coffee)

In [None]:
# calculate with gastropig dataset
print("Coffee+Gastropig")
gp_mean = gp_coffee['price'].mean()
print('Mean: {}'.format(gp_mean)) 
gp_median = gp_coffee['price'].median()
print('Median: {}'.format(gp_median)) 
gp_mode_pp = pp_mode(gp_coffee,'price')
print('Mode: {}'.format(gp_mode_pp))  

**Yuck!** The mode is still returned as a Pandas series, which is still not a very pretty print.

In [None]:
def pp_mode(df,col):
    if len(df[col].mode()) < len(df[col]):
        return ','.join(map(str, df[col].mode())) 
    
gp_mode_pp = pp_mode(gp_coffee,'price')

In [None]:
# Create a comparison dataframe (table)
d = {'dataset': ['Coffee Dataset','Coffee+Gastropig'], 'mean': [mean,gp_mean], 'median': [median,gp_median], 'mode': [mode_pp,gp_mode_pp]}
comparison = pd.DataFrame(data=d)
print('Comparison of Datasets')
print(comparison)

_**Exercise**: If you wrote your own median method, let's see how it hands an even series._

In [None]:
print("It is *{}* that my_median method can handle even length series.".format(my_median(gp_coffee,'price')==gp_coffee['price'].median()))

### 6. Updating data to get a double mode
To make the **mode** more even interesting, let's add a new place. Let's say Blue Bottle opens in uptown and is selling 2.50 coffee.


In [None]:
bb_coffee = gp_coffee.append({'shop': 'Blue Bottle', 'price':2.50},ignore_index=True)
print(bb_coffee)

In [None]:
# calculate with blue bottle dataset
bb_mean = bb_coffee['price'].mean() 
bb_median = bb_coffee['price'].median()
bb_mode_pp = pp_mode(bb_coffee,'price')

# Create a comparison dataframe (table)
d = {'dataset': ['Coffee Dataset','Coffee+Gastropig','C+G+BB'], 'mean': [mean,gp_mean,bb_mean], 'median': [median,gp_median,bb_median], 'mode': [mode_pp,gp_mode_pp,bb_mode_pp]}
comparison = pd.DataFrame(data=d)

print('Another Comparison of Datasets')
print(comparison)

Yay, now we have mode displaying the two most common values.

### 7. Updating data to skew the mean
The mean and median are still pretty close. Let's see if skewed we can get the data. What if Donut Savant decides to push the limit and sell coffee for $7 after the former-Uber building is filled with tech workers.

In [None]:
# we will be updating in place, so lets make a new copy for new analysis
ds_coffee = bb_coffee.copy()
# Reset any row where shop is 'Donut Savant'
# There may be better ways to do this, but I don't know it.
ds_coffee.loc[ds_coffee['shop']=='Donut Savant', 'price'] = 7.00
print(ds_coffee)

In [None]:
# calculate with blue bottle dataset
ds_mean = ds_coffee['price'].mean() 
ds_median = ds_coffee['price'].median()
ds_mode = None
if len(ds_coffee['price'].mode()) < len(ds_coffee['price']):
    ds_mode = ds_coffee['price'].mode()
ds_mode_pp = ','.join(map(str, ds_mode)) 

# Create a comparison dataframe (table)
d = {'dataset': ['Coffee Dataset','Coffee+Gastropig','C+G+BB', 'Upscale DS'], 'mean': [mean,gp_mean,bb_mean,ds_mean], 'median': [median,gp_median,bb_median,ds_median], 'mode': [mode_pp,gp_mode_pp,bb_mode_pp,ds_mode_pp]}
comparison = pd.DataFrame(data=d)

print('Comparison with major upcharge')
print(comparison)

_**Exercise**: It's getting pricey. Create a 'cheap' coffee dataframe that only includes shops that sell for $2 or less_

In [None]:
cheap_coffee = ds_coffee # minus the coffee over $2
print('Cheap Coffee')
print(cheap_coffee)

## Done
That's it for now! I think next notebook I'll try to put in more interesting data.