## Intro to Histograms

Welcome to Lab 8! In this lab, we'll be reviewing usage of tables and learning how to create and analyze histograms.

In [None]:
# Run this cell, but please don't change it.

# These lines import the Numpy and Datascience modules.
import numpy as np
from datascience import *

# These lines do some fancy plotting magic
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', FutureWarning)
from matplotlib import patches
from ipywidgets import interact, interactive, fixed
import ipywidgets as widgets
Table.interactive_plots()

# These lines load the tests.
import otter 
grader = otter.Notebook('tests')

## Review of Tables: Twin Heights

To get started, let's review what we know about working with tables. The table `twins` has a row for each pair of twins, containing the height of the two twins in inches. Run the cell below to see the table.

In [None]:
## Just run this cell.
twins = Table().read_table('twins.csv')
twins

#### Question 1
As a warmup, find the average height of Twin 1 (designated by `Height1`) in the `twins` table and assign it to `avg_height`.

In [None]:
avg_height = ...
avg_height

In [None]:
grader.check('q01')

#### Question 2

Now, we want to figure out the average of the absolute differences in heights among all pairs of twins. Find that value using arrays from the `twins` table, assigning it to the name `avg_diff`. 

*Hint:* What is the order of steps we need to take here?

In [None]:
avg_diff = ...
avg_diff

In [None]:
grader.check('q02')

#### Question 3

Now, let's learn more about the various sets of twins. `tbl.where("column", value)` is a function that allows us to "filter" the table by a specific condition. By default, `tbl.where("column", value)` looks for the rows that are *equal to* the specified value in the specified column and returns a table containing those rows only. 

What was the height of the sibling for the shortest person in `Height1`? What was the height of the sibling for the tallest person in `Height1`? Assign these values to `shortest_sibling` and `tallest_sibling` respectively. 

*Hint:* We recommend that you use arrays and a function to find the shortest height in Height1 and the tallest height in Height2, and assign these values to the variables `shortest_height1` and `tallest_height1`. Notice that we can find this information (i.e. the corresponding sibling) because each row in the table represents 1 specific set of twins.

In [None]:
shortest_twin1 = ...
tallest_twin1 = ...

shortest_sibling = ...
tallest_sibling = ...

## This code will print your answers in a readable format.
print("Height of the twin of the shortest person in column 1: {} inches".format(shortest_sibling))
print("Height of the twin of the tallest person in column 1: {} inches".format(tallest_sibling))

In [None]:
grader.check('q03')

#### Question 4
As part of the Human Contexts & Ethics approach to data science, we should be aware of the classification systems that are present within our datasets. How do labels play into larger cultural and societal norms? In this case, we will take a binary approach to the idea of gender (male/female) which may not be the best, but it is what we have from the person/group who collected the data.

In this case, the person who recorded the data has information about genders of every pair of twins. There are four possible gender pairs - twin 1 is male and so is twin 2; twin 1 is male and twin 2 is female; twin 1 is female and so is twin 2; or twin 2 is female and twin 1 is male. How many pairs of twins of each possible gender pair do we have? Create a bar chart using `tbl.barh()` to visualizes this data.

*Hint:* What table method do we know that finds the number of times each unique value appears in a table? We'll need to use it to find the number of each type of pair (Male/Male, Female/Male, Female/Female, Male/Female) in the dataset before we generate the chart.

In [None]:
...

#### Question 5
Using `tbl.where()`, create three new tables: one with pairs of twins where both twins are male; one where both are female; and one where the twins are of differing genders. We have provided the mixed_gender array to help you. 


After you create the tables, calculate the average of the absolute height differences for each of the three possible gender pairs (both male, both female, or mixed) and save it in an array called `avg_height_differences`. Then, create a 2 column table with the columns "Pairs" and "Average Height Differences" containing the 3 gender pairs and their corresponding average height differences and use the `tbl.barh()` function to generate a bar chart. 


*Hint:* You may need to use a Table.where() *predicate* to find the mixed gender rows. Predicates are help modify arguments and are designated by "are." For example, if you wanted to find all twin 1s that are taller than 65 inches, you would write `twins.where("Height1", are.above(65))`. The table below has some predicates you can use.

| Predicate | Description |
| --- | --- |
|are.above(x) | Greater than x | 
| are.below(x) | Less than x | 
| are.between(x,y) | Greater than or equal to x and less than y |
| are.contained_in(A) | 	Is a substring of A (if A is a string) or an element of A (if A is a list/array) |


In [None]:
mixed_gender = make_array("Male & Female", "Female & Male")

both_male_table = ...
both_female_table = ...
mixed_gender_table = ...

both_male_diff = ...
both_female_diff = ...
mixed_gender_diff = ...

avg_height_differences = ...

# Now complete the table and graph it.
pair_heights = Table().with_columns("Pairs", ...,
                                   "Average Height Differences", ...).barh(0)

## Interpreting Histograms: Family Heights

In a further experiment, we measure the heights of the members of **200 families** that each included 1 mother, 1 father, and some varying number of adult sons. We make the following histograms, with all bins being two inches wide.

Notice the y-axis, which has the units "Percent per inch." This specific histogram is called a **density histogram**, which uses *density* (a measure of probability) instead of *frequency* (the count) on the y-axis. Usually, we prefer to use density because it has a few properties:

1) The **area** of a bar is the **percentage of values in the dataset within that range of values** (the bin). Try canceling the units to see why!

2) The sum of all the bars in a histogram is equal to 1, because 100% of the data is accounted for by all of the bars.

3) Changing the size of the bins will cause the densities to scale accordingly.

![](three_height_histograms.png)

#### Question 6

For each quantity listed below, either calculate its value using the histograms, or write *Unknown* if it is not possible to calculate the value numerically given the information we have.
1. The **percentage** of mothers that are at least 60 inches but less than 64 inches tall.
2. The **percentage** of fathers that are at least 64 inches but less than 67 inches tall.
3. The **number** of mothers that are at least 60 inches tall.
4. The **number** of sons that are at least 70 inches tall.

*Write your answer here, replacing this text.*

#### Question 7
If the fathers' histogram was redrawn, replacing the two bins from 72-74 and 74-76 with one bin from 72-76, what would be the height of that bar?

*Write your answer here, replacing this text.*

#### Question 8 (Challenge Question)
Remember that, within a bin [A, B), a single data point may have a value of at least A and at most B. 

Some of the sons in the dataset are taller than all of the mothers - but, it isn't possible to tell exactly how many. We can calculate upper and lower bounds for the percentage of the sons using our histograms. What's the lowest possible value for the percentage of sons who are taller than all of the mothers? The highest possible value?

If you're struggling with this question, ask your peers or a TA!

*Write your answer here, replacing this text.*

## Creating & Interpreting Histograms in Python

### The Data

For this section, we will be using data from the Office of Environmental Health Hazard Assessment. This dataset includes population and pollution data for several counties in California. We will explore this more in depth in the homework, but first, we must import the dataset into our notebook and create a table. We will call this table `ces_data`.

In [None]:
ces_data = Table.read_table('ces_data.csv')
ces_data.take(np.arange(40, 50)).show(5)

If we scroll to the right, we can see a column called `Pesticides`. Notice how a lot of the entries are 0s. When dealing with large datasets, we will often encounter **missing** values. These values are simply empty values that appear when we do not have a value available for a particular record. Some common ways these missing values appear in datasets are blanks, NaNs ("Not a Number"), 0s, or 8888s. It is important to clean these meaningless values to carry out analysis of the dataset. Much of data science consists of **cleaning data** which includes **renaming columns**, **reducing the table size to include only the columns of interest**, and **removing missing values.**  

There are various methods of dealing with missing values -- for our purposes, it is safe to simply remove these values from our table.

This has been done for you: simply run the cell below to save a clean version of the data as `clean_ces_data`. From this point forward, we'll use this cleaned CES data to run our analysis.

In [None]:
clean_ces_data = Table.read_table('clean_ces.csv')
clean_ces_data.show(5)

To understand our data, it is important to understand what **each row** represents. Notice our first two columns: `California County` & `census_tract`. Each row represents a specific census tract (some specific geographic region, similar to a neighborhood) for a given county.  

For instance, our first row represents some small region in the county of Fresno. 

#### Question 9

We can see that the first Census tract has a population that is 65.3% Hispanic. For this question, we are interested in the **distribution** of Hispanic & white populations in our dataset. We can visualize this distribution by using a **histogram**, created by the function `tbl.hist("column")`. 

Create a histogram of the percent populations of the `hispanic` column. Then, do the same for `white`.

*Note:* Again, "bins" are how we decide to split the data in the dataset, with each bin comprising a range of values and each bar representing a different bin. `tbl.hist()` automatically chooses the bins for us, but to better compare, we should use the same set of bins for both datasets. To manually change the bins, you should call `tbl.hist("column", bins = arr)` where arr is an array of values. We defined the bins for you in the variable `bins`.

In [None]:
# Type your code here
bins = np.arange(0, 105, 10)
hispanic_population_hist = ...

In [None]:
# Type your code here
white_population_hist = ...

In [None]:
grader.check('q09')

#### Question 11

To better compare the two distributions together, we can create an overlapping histogram. To do so, use `tbl.hist("column")`, but instead of using only 1 column name, you can replace it with an array of strings containing the column names you would like to overlap. 

Now, create an overlapping histogram that represents the distributions of both Hispanic and white in a single graph. Use the same bins.

In [None]:
...

#### Question 12

To practice our interpretation of these histograms, describe and compare the shape of the distributions you see (are they left skewed? right skewed? symmetric?). Is the average percentage of Hispanic populations living in each Census tract greater than, less than, or the same as the average percentage for white populations? Write your answers in the cell below and discuss with a classmate or TA.

*Type your answer here, replacing this text.* 

## Submission
You're done with this lab!

To submit this notebook, please download your notebook as a .ipynb file and submit to Gradescope. You can do so by navigating to the toolbar at the top of this page, clicking File > Download as... > Notebook (.ipynb). Then, go to our class's Gradescope page [here](https://www.gradescope.com/courses/136698) and upload your file under "Lab 8." 

To check your work for all autograded questions, run the cell below. 

It's fine to submit multiple times, but we will only grade the final notebook you submit for each assignment. Make sure you pass all tests to receive credit.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
grader.check_all()