## Intro to Histograms

Welcome to Lab 8! In this lab, we'll be reviewing usage of tables, and learning how to create and analyze histograms.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets

# These lines load the tests.
import otter 
grader = otter.Notebook('lab08-tests')

## Review of Tables

To get started, let's review what we know about working with tables. The table `twins` has a row for each pair of twins, containing the height of the two twins in inches. Run the cell below to see the table.

In [None]:
twins = Table().read_table('twins.csv')
twins

#### Question 1
As a warmup, find the average height of all people in the `twins` table.

In [None]:
avg_height = ...
avg_height

In [None]:
grader.check('q01')

#### Question 2
We want to figure out the average difference in heights among twins. Find that value for the people in our `twins` table.

In [None]:
avg_diff = ...
avg_diff

In [None]:
grader.check('q02')

#### Question 3
What was the height of the twin of the shortest person in the first column? What about the twin of the tallest person?

In [None]:
shortest_twin_height = ...
tallest_twin_height = ...
print("Height of the twin of the shortest person in column 1: {}".format(shortest_twin_height))
print("Height of the twin of the tallest person in column 1: {}".format(tallest_twin_height))

In [None]:
grader.check('q03')

#### Question 4
We have noted the genders of every pair of twins. There are four possible gender pairs - twin 1 is male and so is twin 2, twin 1 is male and twin 2 is female, twin 1 is female and so is twin 2, or twin 2 is female and twin 1 is male. How many pairs of twins of each possible gender pair do we have? Create a bar chart that visualizes this data.

In [None]:
...

#### Question 5
We have created new tables, with one corresponding to pairs of twins where both twins are male, both are female, or the two are of differing genders. Using this data, make a bar chart of average height difference for each of the three possible gender pairs (both male, both female, or mixed).

*Hint: You will need to create a new table to do this. You may want to re-use code from question 2.*

In [None]:
both_male_table = twins.where(twins.column("Genders")=="Male & Male")
both_female_table = twins.where(twins.column("Genders")=="Female & Female")
mixed_gender_table = twins.where((twins.column("Genders")=="Male & Female")
                                      +(twins.column("Genders")=="Female & Male"))

both_male_diff = ...
both_female_diff = ...
mixed_gender_diff = ...

avg_height_differences = ...

# Now make the chart
...

## Interpreting Histograms

In a further experiement, we measure the heights of the members of 200 families that each included 1 mother, 1 father, and some varying number of adult sons. We make the following histograms, with all bins being two inches wide.

![](three_height_histograms.png)

#### Question 6

For each quantity listed below, either calculate its value using the histograms, or write *Unknown* if it is not possible to calculate the value numerically given the information we have.
1. The **percentage** of mothers that are at least 60 inches but less than 64 inches tall.
2. The **percentage** of fathers that are at least 64 inches but less than 67 inches tall.
3. The **number** of mothers that are at least 60 inches tall.
4. The **number** of sons that are at least 70 inches tall.

*Write your answer here, replacing this text.*

#### Question 7
If the fathers' histogram was redrawn, replacing the two bins from 72-74 and 74-76 with one bin from 72-76, what would be the height of that bar?

*Write your answer here, replacing this text.*

#### Question 8
Some of the sons in the dataset are taller than all of the mothers - but, it isn't possible to tell exactly how many. We can calculate upper and lower bounds on the value using our histograms. What's the lowest possible value for the percentage of sons who are taller than all of the mothers? The highest possible value?

*Write your answer here, replacing this text.*

## Creating & Interpreting Histograms in Python

### The Data

For this section, we will be using data from the Ofice of Environmental Health Hazard Assessment. This dataset includes population and pollution data for several counties in California. First, we must import the dataset into our notebook and create a table. We will call this table `ces_data`.

In [None]:
ces_data = Table.read_table('ces_data.csv')
ces_data.take(np.arange(40, 50)).show(5)

If we scroll to the right, we can see a column called `Pesticides`. Notice how a lot of the entries are 0s. When dealing with large datasets, we will often encounter **missing** values. These values are simply empty values that appear when we do not have a value available for a particular record. It is important to clean these meaningless values to carry out analysis of the dataset. Much of data science consists of **cleaning data** which includes **renaming columns**, **reducing the table size to include only the columns of interest**, and **removing missing values.**  

There are various methods of dealing with missing values -- for our purposes, it is safe to simply remove these values from our table.

This has been done for you: simply run the cell below to save a clean version of the data as `clean_ces_data`. From this point forward, we'll use this cleaned CES data to run our analysis.

In [None]:
clean_ces_data = Table.read_table('cleaned_data.csv')
clean_ces_data.show(5)

To understand our data, it is important to understand what **each row** represents. Notice our first two columns: `California County` & `census_tract`. Each row represents a specific census tract (some specific geographic region) for a given county.  

For instance, our first row represents some small region in the county of Fresno. 

#### Question 9

We can see that the first census tract has a population that is 65.3% hispanic. For this question, we are interested in the **distribution** of hispanic & white populations in our dataset. We can visualize this distribution by using a **histogram**. 

Create a histogram of the heights of the `hispanic` column. Then, do the same for `white`.

In [None]:
# Type your code here
hispanic_population_hist = ...

In [None]:
# Type your code here
white_population_hist = ...

In [None]:
grader.check('q09')

#### Question 10

To practice our interpretation of these histograms, assign `avg_pop_hispanic` to the **average** amount of hispanic citizens living in each census tract. Do the same for white citizen, and assign that to `avg_pop_white`.

*Hint: The average is given by the tallest bin in the histogram.*

In [None]:
avg_pop_hispanic = ...
avg_pop_hispanic

In [None]:
avg_pop_white = ...
avg_pop_white

In [None]:
grader.check('q10')

#### Question 11
Now, create two **overlapping** histograms representing the distributions of both hispanic and white in a single graph.

In [None]:
...

Now, create a single histogram of the distribution of hispanic and white in the sample.

*Hint: For the second part, you will need to use the `hispanic_and_white` variable that has already been provided for you.*

In [None]:
white = clean_ces_data.select('white')[0]
hispanic = clean_ces_data.select('hispanic')[0]

hispanic_and_white = np.hstack([white, hispanic])

In [None]:
#Assign this to a new table

hispanic_and_white_data = ...
hispanic_and_white_data.show(2)

In [None]:
#Create a histogram here

...

In [None]:
grader.check('q11')

Nice work! You've finished lab 8. Remember to submit!