# Homework 4: Tables and Functions
The tools that we've learned over the last week (for example, function definitions, histograms, and the table methods `where`, `apply`, and `group`) are enough to analyze a wide range of questions and datasets.  

Please complete this notebook by filling in the cells provided. Before you begin, execute the following cell to load the provided tests.

In [None]:
# Don't change this cell; just run it. 
import datascience 
import pandas as pd 
import numpy as np
from datascience import Table
from datascience import *
import matplotlib

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

import otter
grader = otter.Notebook()

from ipywidgets import interact

Table.interactive_plots()

Reading:
- Textbook chapters [6](https://data-8r.gitbooks.io/textbook/chapters/06/tables.html) and [7](https://data-8r.gitbooks.io/textbook/chapters/07/functions-and-tables.html)

Deadline:

This assignment is due **Wednesday, August 5 at 11:59 PM**. You will receive an early submission bonus point if you turn in your final submission by **Tuesday, August 4 at 11:59 PM**. Late work will not be accepted unless you have made special arrangements with your (u)GSI or the instructor.

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged. 

You should start early so that you have time to get help if you're stuck. Drop-in office hours will be held at various times in the week; check bCourses for the latest schedule.

## 1. Cal Enviroscreen: Race & the Environment

In this problem, we will use the CalEnviroScreen dataset that you may have seen in Chapter 3 of the textbook. From their [website](https://oehha.ca.gov/calenviroscreen/about-calenviroscreen): "CalEnviroScreen is a mapping tool that helps identify California communities that are most affected by many sources of pollution, and where people are often especially vulnerable to pollution's effects."

In general, there are disparities in environmental harms for different communities. Communities of color tend to receive a disproportionate share of the negative consequences of environmental actions and policies. Issues, such as increased pollution or dumping, often lead to worse health outcomes for those who are exposed. This is the basis of the environmental justice movement, which is defined as "the fair treatment and meaningful involvement of all people, regardless of race, color, national origin, or income, with respect to the development, implementation, and enforcement of environmental laws, regulations and policies."

The following table, `ces_data`, documents the percentage of various ethnicities, health outcomes, and exposure to negative environmental effects in census tracts (geographic tracts, similar to neighborhoods, for taking the census) in California. 

In [None]:
## Run this cell.
ces_data = Table.read_table('cal_enviroscreen.csv')
ces_data.show(5)

We will start by creating some tables to then compare histograms. A census tract's CES score (designated by the column `ces_pollution_score`) represent that tract's degree of burdens, such as various socioeconomic factors, sensitive population, and exposure to pollution. A higher score generally means that that community is more burdened. 

**Question 1.** We will spend most of our analysis looking at some of the most overburdened communities - those that rank in the **top 5%** of CES scores, called `ces_pollution_score`, in the state. Set `ces_percentile` to the score that is the 95th percentile in the dataset. 

**Note:** A percentile refers to a value that is *greater than or equal to* X% of the rest of the data after sorting the data from lowest to highest. For example, the 50th percentile is also the median in that it is greater than 50% of the values in the dataset and lower than 50%. We can calculate this by using the `percentile(X, array)` function, which takes 2 arguments: an integer (X) for the percentile you are searching and an array of the data to analyze.

In [None]:
ces_percentile = ...
ces_percentile

In [None]:
grader.check('q1_1')

**Question 2.** Given your percentile in Question 1, create a new table called `overburdened` that contains rows for all of the census tracts that have a CES pollution score **higher** than `ces_percentile`. We will use this table in our analysis to compare the ethnic makeup of each of the groups present in these communities.

In [None]:
overburdened = ...
overburdened

In [None]:
grader.check('q1_2')

In [None]:
## This cell will graph histograms for each of the ethnic groups in the overburdened table. 
# Just run this cell, but click on the dropdown menu to view the various communities.

def communities_hist(ethnicity):
    overburdened.hist(ethnicity, bins = np.arange(0, 100, 5))

interact(communities_hist, ethnicity=["african_american", "hispanic", "native_american", "asian_american", "white", "other"])

**Question 3.** We will use the histogram above to calculate the following values. Notice that the bins are determined by the function np.arange(0, 100, 5), and that 377 census tracts are represented in `overburdened`. 

For each quantity listed below, either calculate its value using the histograms, or write *Unknown* if it is not possible to calculate the value numerically given the information we have. You are allowed to estimate each bar's height up to a tenth of a decimal place, and please **show your work** in the Markdown cell below.

1. The **percentage** of census tracts that have Hispanic populations that make up 50-60% of the community.
2. The **percentage** of census tracts that have African American populations that make up 3-5% of the community.
4. The **number** of census tracts that have white communities that make up at least 40% of the community.

*Write your answer here, replacing this text.*
1. ...
2. ...
3. ...

**Question 4.** If the histogram for Hispanic populations was redrawn with bins of width 10, what would be the height of the bar for the bin from 60-70%?

*Write your answer here, replacing this text.*

**Question 5.** The [Port of Oakland](https://en.wikipedia.org/wiki/Port_of_Oakland) is located in West Oakland and is one of the busiest ports in the United States. One of the census tracts located nearby is designated with the Census Tract ID (`census_tract`) 6001402200. Using the `ces_data` table and `where`, find the entry for the Port of Oakland.

Then, find the ozone values and toxic release values as **floats** for the Port of Oakland and assign them to `port_ozone` and `port_tox_release`. Do not round the values. 

In [None]:
port_of_oakland = ...
port_ozone = ...
port_tox_release = ... 
port_of_oakland

In [None]:
grader.check('q1_5')

**Question 6.** The Port, as a major shipping center, contributes to gas emissions (via diesel and greenhouse gas use in maritime operations) that may negatively affect air quality in the nearby areas. We will compare the asthma rates for the Port of Oakland and similar communities to the rest of the state in terms of **ozone and toxic release**, and we will find the similar tracts by using the [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). Distance in this case is not necessarily geographic distance, but a measure of **how different or similar** two tracts are.

You may have used this in the past as the Pythagorean theorem to measure the hypotenuse of a triangle. We will use it to find similar census tracts for our 2 traits - in this case, a smaller "distance" means that the two tracts are more similar.

The equation is as follows (similar to a^2 + b^2 = c^2):

((Tract Ozone - Port Ozone) `**` 2 + (Tract tox_release - Port tox_release) `**` 2) `**` (1/2)


First, use `port_ozone`, `port_tox_release`, and `ces_data` to find the distances for ozone and toxic release rates between the Port of Oakland and the rest of the census tracts in California (`ces_data`) and save it to the variable `distances`. Then, create a table called `ces_with_distances` that is a copy of `ces_data`, but with an extra column called `distance_from_oak` containing the `distances` for each tract. 

**Note:** In a better analysis, we would need to standardize the units so we can better compare these 2 traits. Ozone is in parts per million (ppm) and Toxic Releases from facilities is a weighted measurement of various concentrations of chemicals released into the air. There is a way to do this that you will learn in Data 8 or another statistics class, if you decide to take one!

In [None]:
distances = ...
ces_with_distances = ...
ces_with_distances

In [None]:
grader.check('q1_6')

**Question 7.** Now that we have `ces_with_distances`, sort the table by `distance_from_oak` to find the tracts that are most similar to the Port of Oakland in terms of ozone and toxic release. Take the first 757 rows (10% of the table) and assign it to the variable `port_tracts`.

*Hint:* Sort it so that the most similar (i.e. smallest distances) tracts are first in the table. 

In [None]:
port_tracts = ...
port_tracts

In [None]:
grader.check('q1_7')

**Question 8.** Finally, create 2 histograms showing the distribution of `asthma` rates (measured in age-adjusted visits to emergency departments for asthma); one histogram will graph `ces_data` and the other will graph `port_tracts`. Use appropriate bins and make sure that both histograms have the **same** bins to compare. 

Do areas similar to the Port of Oakland in terms of ozone and toxic release emissions have higher rates of emergency department visits for asthma? Describe what you see in the Markdown cell below.

In [None]:
### Type your code for both histograms here so they appear next to each other. 


*Write your answer here, replacing this text.*


**Question 9.** In the following cell, we created a table called `random_ces_sample` which randomly selected 1000 census tracts without replacement. Using `random_ces_table`, create a scatter plot that compares the percent of people in `poverty` and the `ces_pollution_score` for that tract.

Describe what you see and any associations or relationships between the two variables. If you do notice a pattern or relationship, what is a possible cause? Write your findings in the Markdown cell below. 

In [None]:
random_ces_sample = ces_data.sample(1000, with_replacement = False)
...

*Write your answer here, replacing this text.*



## 2. Working with Text in Python: Fixing Misspellings

Computation is not limited to quantitative or numerical data. Oftentimes, social scientists– such as those in linguistics, sociology, legal studies, and communications– need to use Python or other languages to analyze text. This may be in the form of surveys, social media posts, court proceedings, or in this case, essays. In this section, we will do some basic text processing by using strings and functions to make some modifications to text. If you're interested in learning more about advanced text analysis that is used in many fields today, check out [Natural Language Processing (also known as NLP)](https://becominghuman.ai/a-simple-introduction-to-natural-language-processing-ea66a1747b32).

For this example, imagine that you are editing a collection of your essays for publication, and you discover that you've been misspelling the word "misspell" as "mispel" your whole life.  You decide to use Python to correct this embarrassing mistake.

**Question 1.** Write a function called `correct_mispel`.  It should take a single string as its argument, and return the same string, but with all instances of "mispel" replaced with "misspell".

*Hint:* Use the string method `.replace("to_replace", "replace_with")`.  It takes two arguments: the piece of text you want to find, and the piece of text you want to replace it with.

*Note:* When you create a function, it is good practice to make clear three things: what the function does, the purpose and type of each argument, and what the function returns. You can do this in the documentation or within the body of the function, and it allows others to use and understand your code.

In [None]:
# Write a function called correct_mispel in this cell.
def ...(...):
    """..."""
    ...
    return ...

correct_mispel("This sentence is mispeled.")

In [None]:
grader.check('q2_1')

**Question 2.** Now you need to load your data into Python.  The file `essay_filenames.csv` is a table that contains the *filenames* of your essays.  Each filename is a string that's the name of an essay.  Load it into a table called `essay_filenames`.

In [None]:
essay_filenames = ...
essay_filenames

In [None]:
grader.check('q2_2')

**Question 3.** Below, we've provided a function that takes as its argument the *filename* of an essay and returns the text in that file (as one long string).  Using `apply`, create a table called `essays` with two columns:

1. "Name": The filename of the essay
2. "Text": The whole text of the essay

(The essays are actually books from [Project Gutenberg](gutenberg.org), modified to misspell "misspel".  Attributions and copyright information are contained in the text of each essay.)

In [None]:
## Just run this cell. Do not change anything in it.
def load_essay(filename):
    """Loads the text in the given file, returning it as one long string."""
    with open(filename, 'r') as essay_file:
        return essay_file.read()

In [None]:
essays = ...
essays

**Question 4.** Using `tbl.apply()` and the `correct_mispel` function you wrote earlier, create a table called `corrected_essays` with two columns:

1. "Name": The filename of the essay
2. "Corrected text": The whole text of the essay, with "mispel" corrected as "misspell".

In [None]:
corrected_essays = ...
corrected_essays

In [None]:
grader.check('q2_4')

Did this do anything?  Were there even misspellings in the original essay?  Let's find out.

**Question 5.** The string method `str.splitlines` produces an array of the lines of the string.  Use it to make a table called `news_writing_lines` with a single column called "Line" which contains the **original** lines from the text called "news_writing.txt". There should be one row in `news_writing_lines` for each line in the text called "News Writing"; there may be some empty rows, but you can ignore those.

*Hint:* You don't need to use it, but `news_writing` is available to help break down the steps. How do you access the text of "News Writing"?

In [None]:
news_writing = ...
news_writing_lines = ...
news_writing_lines

In [None]:
grader.check('q2_5')

**Question 6.** Use the table method `where` and the predicate `are.containing` to find all the lines in `news_writing_lines` that include the word "mispel".  Make a table of those lines called `misspelled_lines`.

*Note:* You should also find versions of "mispel" like "mispeled" or "mispeling", and your code probably corrected those, too.  That's okay.

In [None]:
misspelled_lines = ...
misspelled_lines

In [None]:
grader.check('q2_6')

**Question 7.** In the cell below, repeat the work you did in questions 5 and 6, but for the corrected version of "News Writing" you produced in `corrected_essays`. 

Did your correction fix the misspellings? If so, set fixed_typos to True. If not, you may need to check your code.

In [None]:
# Use this cell to check whether your code fixed the misspellings.


In [None]:
fixed_typos = ...

In [None]:
grader.check('q2_7')

## 3. A "data-driven pandemic": COVID-19 in the United States

This exercise is designed to give you practice using `group` and other table methods.

We'll be looking at the COVID-19 Data Repository from Johns Hopkins University. You can find the raw data [here](https://github.com/CSSEGISandData/COVID-19). We've cleaned up the datasets a bit, but we will be investigating the number of confirmed cases and the number of new cases in the United States from March to June.

The following table, `confirmed_cases`, contains the number of confirmed cases at the start of each month for every county in the United States, starting in March and ending in June.

In [None]:
## Just run this cell. 
confirmed_cases = Table().read_table("covid_by_county.csv")
confirmed_cases.show(10)

**Question 1.** First, let's learn about how the number of confirmed cases has affected each state. Let's find the five states with the most COVID cases in June. 

To do that, create a table with two columns, one for the state and the other for the number of confirmed cases in June. There should be one row for each state and territory, and the total number of confirmed cases for that state in the month of June. Sort it in descending order by count, take the top 5 states, and assign the array of state strings to the variable `june_states`.

*Note:* You should only need to use table methods to answer this question. Do not manually assign `june_states` to the strings of the state names. If you are struggling with this question, try breaking it down into its separate steps and use multiple variables.

In [None]:
grouped_by_state = ...
june_states = ...
june_states

In [None]:
grader.check('q5_1')

**Question 2.** Now, let's create a bar chart to compare some states. However, one problem is that these states have different populations, which makes it difficult to compare. Therefore, we will use **confirmed cases per capita** (total number of cases divided by total population) to compare across states.

To help you, we've assigned `pop_by_state` to a 2 column table that contains the name and the total population of the state, and we assigned `west_coast` to a string that contains the names of the 5 West Coast states. 

Using these variables, create a single horizontal barplot that compares the 5 West Coast states using the confirmed cases per capita with separate bars for March, April, May, and June. (e.g. the Y-axis should be the states, the X-axis should be the confirmed cases per capita, and a separate set of bars for each month)

*Hint:* This is a long question! You should use `grouped_by_state` if you defined it in the previous question, and array arithmetic to find the confirmed cases per capita for each month. It may also be easier to create a new table with a different column for each separate month. 

In [None]:
# Use this cell to make your plot.
west_coast = make_array("Washington", "California", "Oregon", "Hawaii", "Alaska")
pop_by_state = Table().read_table("pop_by_state.csv")

west_coast_cases = ...
confirmed_per_capita = ...

**Question 3.** Finally, let's look at the number of new cases per day. We can calculate this by subtracting the number of confirmed cases in 1 day from the number of confirmed cases from the day after.

For example: If Day 1 had 10, Day 2 had 25, and Day 3 had 15 cases, then we had an increase of 15 cases (25-15) and a decrease of 10 cases (25-15). 

This new table, `covid_by_state`, contains the number of confirmed cases in each state and territory in the United States, starting on March 1st and ending on June 9th (a total of 100 days). Complete the function `new_state_cases`, which will return a table with 2 columns that has the number of days since 3/1 and the number of new cases each day in that state.

*Hint:* What function in numpy have we used in the past that lets us calcualte the difference between subsequent numbers in an array?

In [None]:
## Just run this cell.
covid_by_state = Table().read_table("covid_by_state.csv")
covid_by_state.show(5)

In [None]:
## Complete the following function. 
def new_state_cases(state):
    """Describe the function here."""
    new_cases = ...
    num_days = ...
    state_table = Table().with_columns("Days since 3/1/20", ..., 
                                       "New cases in " + state, ...)
    return ...

In [None]:
## Run this cell. If you do not see a graph or receive an error, there is a problem with your function. 
new_state_cases("New York").scatter(0)
new_state_cases("California").scatter(0)

In [None]:
## For your interest and exploration, the following function will let you choose 2 states and compare their trends.
# Just run this cell. 
from ipywidgets import interact

def two_state_cases(state1 = "California", state2 = "New York"):
    """Take in 2 states, find their new cases, and graph them together"""
    state1tbl = new_state_cases(state1)
    state2tbl = new_state_cases(state2)
    two_state_cases_tbl = state1tbl.join("Days since 3/1/20", state2tbl).relabel(1, state1).relabel(2, state2)
    two_state_cases_tbl.scatter(0)

interact(two_state_cases)

**Question 4.** Although there are significant amounts of data and reporting surrounding the COVID-19 pandemic, there are many factors that may confound our analysis. One example is the levels of testing; as our levels of testing increase, it is natural to assume that the number of new cases will rise as well. 

What are some factors with our data that may affect our findings? This may include how the data is collected or recorded, how it is analyzed, etc. Discuss them in a few sentences below.

*Note:* If you are interested in why there may be problems with COVID-19 data, check out this [article](https://www.citylab.com/life/2020/05/coronavirus-data-positive-tests-deaths-covid-19-stats/610306/) by CityLab.

*Write your answer here, replacing this text.*

## 4. Submission

To submit your homework, please download your notebook as a .ipynb file and submit to Gradescope. You can do so by navigating to the toolbar at the top of this page, clicking File > Download as... > Notebook (.ipynb). Then, go to our class's Gradescope page [here](https://www.gradescope.com/courses/136698) and upload your file under "Homework 4". 

To check your work, you may run the cell below. Remember that for homework assignments, passing the tests does not necessarily mean your answer is correct.

In [None]:
grader.check_all()