## Intro to Histograms

Welcome to Lab 8! In this lab, we'll be reviewing usage of tables, and learning how to create and analyze histograms.

In [1]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

# These lines load the tests.
from client.api.notebook import Notebook
ok = Notebook('lab08.ok')
_ = ok.auth(inline=True)

## Review of Tables

To get started, let's review what we know about working with tables. The table `twins` has a row for each pair of twins, containing the height of the two twins in inches. Run the cell below to see the table.

In [2]:
twins = Table().read_table('twins.csv')
twins

#### Question 1
As a warmup, find the average height of all people in the `twins` table.

In [3]:
avg_height = (np.mean(twins.column("Height1")) + np.mean(twins.column("Height2"))) / 2 #SOLUTION
avg_height

In [4]:
_ = ok.grade('q1')

#### Question 2
We want to figure out the average difference in heights among twins. Find that value for the people in our `twins` table.

In [5]:
avg_diff = np.mean(np.abs(twins.column("Height1") - twins.column("Height2"))) #SOLUTION
avg_diff

In [6]:
_ = ok.grade('q2')

#### Question 3
What was the height of the twin of the shortest person in the first column? What about the twin of the tallest person?

In [7]:
shortest_twin_height = twins.sort("Height1").column("Height2").item(0) #SOLUTION
tallest_twin_height = twins.sort("Height1", descending=True).column("Height2").item(0) #SOLUTION
print("Height of the twin of the shortest person in column 1: {}".format(shortest_twin_height))
print("Height of the twin of the tallest person in column 1: {}".format(tallest_twin_height))

In [8]:
_ = ok.grade('q3')

#### Question 4
We have noted the genders of every pair of twins. There are four possible gender pairs - twin 1 is male and so is twin 2, twin 1 is male and twin 2 is female, twin 1 is female and so is twin 2, or twin 2 is female and twin 1 is male. How many pairs of twins of each possible gender pair do we have? Create a bar chart that visualizes this data.

In [10]:
twins.group_barh("Genders") #SOLUTION

#### Question 5
We have created new tables, with one corresponding to pairs of twins where both twins are male, both are female, or the two are of differing genders. Using this data, make a bar chart of average height difference for each of the three possible gender pairs (both male, both female, or mixed).

*Hint: You will need to create a new table to do this. You may want to re-use code from question 2.*

In [11]:
both_male_table = twins.where(twins.column("Genders")=="Male & Male")
both_female_table = twins.where(twins.column("Genders")=="Female & Female")
mixed_gender_table = twins.where((twins.column("Genders")=="Male & Female")
                                      +(twins.column("Genders")=="Female & Male"))

both_male_diff = np.mean(np.abs(both_male_table.column("Height1") - both_male_table.column("Height2"))) #SOLUTION
both_female_diff = np.mean(np.abs(both_female_table.column("Height1")  - both_female_table.column("Height2"))) #SOLUTION
mixed_gender_diff = np.mean(np.abs(mixed_gender_table.column("Height1") - mixed_gender_table.column("Height2"))) #SOLUTION

avg_height_differences = Table().with_columns("Gender", make_array("Both Male", "Both Female", "Mixed"),  "Average Height Difference", make_array(both_male_diff, both_female_diff, mixed_gender_diff)) #SOLUTION

# Now make the chart
avg_height_differences.barh("Gender") #SOLUTION

## Interpreting Histograms

In a further experiement, we measure the heights of the members of 200 families that each included 1 mother, 1 father, and some varying number of adult sons. We make the following histograms, with all bins being two inches wide.

![](three_height_histograms.png)

#### Question 6

For each quantity listed below, either calculate its value using the histograms, or write *Unknown* if it is not possible to calculate the value numerically given the information we have.
1. The **percentage** of mothers that are at least 60 inches but less than 64 inches tall.
2. The **percentage** of fathers that are at least 64 inches but less than 67 inches tall.
3. The **number** of mothers that are at least 60 inches tall.
4. The **number** of sons that are at least 70 inches tall.

**SOLUTION:** 1. 40 percent 2. Unknown 3. 192 mothers 4. Unknown

#### Question 7
If the fathers' histogram was redrawn, replacing the two bins from 72-74 and 74-76 with one bin from 72-76, what would be the height of that bar?

**SOLUTION:** 4 percent per inch

#### Question 8
Some of the sons in the dataset are taller than all of the mothers - but, it isn't possible to tell exactly how many. We can calculate upper and lower bounds on the value using our histograms. What's the lowest possible value for the percentage of sons who are taller than all of the mothers? The highest possible value?

**SOLUTION:** Lowest possible - 20 percent. Highest possible - 48 percent.

## Creating Histograms in Python

Run the following cell to load height data on 100 adult men and women.

In [13]:
height_data = Table().read_table("Height_Data.csv")
male_heights = height_data.column("Male Height")
female_heights = height_data.column("Female Height")
all_heights = np.hstack([male_heights, female_heights])
height_data

#### Question 9
Create a histogram of the heights of the various men in the sample. Then, do the same for women.

In [14]:
height_data.hist("Male Height") #SOLUTION

In [15]:
height_data.hist("Female Height") #SOLUTION

#### Question 10
Create two overlapping histograms of the heights of everyone in the sample, split between men and women but in a single chart. Then, create a single histogram of the heights of everyone in the sample, both men and women. 

*Hint: For the second part, you will need to use the `all_heights` variable, and make a new table*.

In [16]:
height_data.hist(make_array("Male Height", "Female Height")) #SOLUTION

In [17]:
Table().with_columns("All Heights", all_heights).hist() #SOLUTION

Nice work - you've finished lab 8.

**Please remember to submit your lab!**

In [18]:
# For your convenience, you can run this cell to run all the tests at once!
import os
print("Running all tests...")
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]
print("Finished running all tests.")

In [None]:
# Run this cell to submit your work *after* you have passed all of the test cells.
# It's ok to run this cell multiple times. Only your final submission will be scored.

_ = ok.submit()