In [1]:
import numpy as np
from scipy import stats
from plotly import graph_objs as go
import plotly.offline as py
import plotly.tools as tls
import matplotlib.pylab as plt
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
import ipywidgets as widgets
from IPython.display import display, Math, Latex, clear_output

import plotly.figure_factory as ff
import pandas as pd

# These options allow plots to display.
py.init_notebook_mode(connected=True)
fig = plt.Figure()
ax = fig.gca()
canvas = FigureCanvas(fig)

# Introduction

Jerry Sokoloski is a Canadian actor and former basketball player. At 7'4" (2.24 m) he is also Canada's tallest man! 

https://en.wikipedia.org/wiki/Jerry_Sokoloski

![jerry](jerry1.jpg)
*Jerry Sokoloski walking down the street in Toronto. Photo credit: STAN BEHAL/Toronto Sun.*

Ok, so 7'4" seems really tall, right? What if we want to find out how Jerry - ahem - *measures up* to the rest of the Canadian population of adult men? For that task, we're going to need some more information.

# Background

We can find out just how out*stand*ing Jerry is by comparing his height to thousands of other men's heights. We'll use Python to load in some data made available by the 2016 Center for Disease Control (CDC) National Health Interview Survey.

In [2]:
# Load in the data.
data = pd.read_csv('samadult.csv')

# We only want men's heights, so we'll remove the women's heights.
data = data[data.SEX != 2]

# We'll only use the variable called 'AHEIGHT' (adult height).
data = data.AHEIGHT

# This dataset uses the codes 96, 97, 98, and 99 for missing entries.
# We'll remove any missing height data.
data = data[data < 96]

print('Height data loaded with ' + str(data.index[-1]) + ' entries.')

Height data loaded with 33027 entries.


As you may have guessed from the Python output, we are going to compare Jerry's height to 33027 other men. The heights of these men were collected in the United States, but since the average height of Americans can be expected to be the same as the average height of Canadians, we will not worry about this.

Now we'll make a **histogram** of the men's height data. This is done by a process called *binning*. To *bin* the men's heights, we just count up every height within a predetermined interval. For example, we count up every height that is between 60-62 inches (remember, these are American measurements!). After this, count every height between 62-64 inches, and so on. We then plot the number of heights in each bin.

In [3]:
hist_data = [data]
group_labels = ["Men's Height"]

# Change y-label scaling.

fig = ff.create_distplot(hist_data, group_labels, show_rug=False, bin_size=1)
py.iplot(fig)

The shape of this histogram may not seem very special, but it is! When a histogram from a dataset has a large bump in the middle and smaller values at either end, we say that the dataset may be **normally** distributed. When a dataset is **normally** distributed, we can do all kinds of analysis on it.

We say that the dataset *may be* normally distributed, because there are a few things we need to check first.

Let's leave this example for a few minutes to explore a bit more about normal distributions.

## Normal Distributions

You've probably heard the term 'bell curve' at some point. This is *the* normal distribution! It's an ideal model for any normally distributed dataset. 

Using the NumPy random module, let's generate a normal distribution by looking at the histogram of 5000 randomly generated numbers.

% Make this interactive.

In [4]:
norm = np.random.randn(5000)

hist_norm = [norm]
labels = ['Normal data']

fig = ff.create_distplot(hist_norm, labels, show_rug=False, bin_size=.25)
py.iplot(fig)

### The Mean of a Distribution

Notice the big bump in the middle of the distribution. This is telling us that most of the values are close in value to 0. We call 0 the **mean** of this distribution.

The mean of a normally distributed dataset is usually the middle value. It's the average of all of the values in the dataset. Usually, we use $\bar{x}$ as a symbol for the mean.

### *Exercise*

What do you think the mean is for the above normally distributed dataset?

### The Median and the Mode of a Distribution

There are two others numbers that can help us describe the distribution of a dataset. The first is the **median**, which is easy to find. The median is just the **middle** number of a dataset.

% Example finding the median.

% Example finding the mode.

Now we'll find the mean, median, and mode of the male height dataset using Python.

In [5]:
mean = np.mean(data)
median = np.median(data)
mode = stats.mode(data, axis = None)

print('Mean: ' + str(mean) + '\nMedian: ' + str(median) + '\nMode: ' + str(mode))

Mean: 69.8717154992
Median: 70.0
Mode: ModeResult(mode=array([70]), count=array([2037]))


So the mean and median are very close in value. The median and the mode have the same value, 70. The Python output from when we called the ``stats.mode()`` function says that the mode is 70, and the value of 70 occurs 2037 times in our dataset. So, the most frequently occurring height in our dataset is 70", or 5'10".