In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw04.ipynb")

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Homework 04: Visualizations and Functions

## Textbook References

* [Chapter 7](https://ccsf-math-108.github.io/textbook/chapters/07/Visualization.html)
* [Sections 8.0 and 8.1](https://ccsf-math-108.github.io/textbook/chapters/08/Functions_and_Tables.html)

---

## Assignment Reminders

- 🚨 Make sure to run the code cell at the top of this notebook that starts with `# Initialize Otter` to load the auto-grader.
- Your Tasks are categorized as auto-graded (📍) and manually graded (📍🔎).
    - For all the auto-graded tasks:
        - Replace the `...` in the provided code cell with your own code.
        - Run the `grader.check` code cell to run some tests on your code.
        - Keep in mind that for homework and project assignments, sometimes there are hidden tests that you will not be able to see the results of that we use for scoring the correctness of your response. **Passing the auto-grader does not guarantee that your answer is correct.**
    - For all the manually graded tasks:
        - You might need to provide your own response to the provided prompt. Do so by replacing the template text "_Type your answer here, replacing this text._" with your own words.
        - You might need to produce a graphic or something else using code. Do so by replacing the `...` in the code cell to generate the image, table, etc.
        - In either case, review the rubric on the associated <a href="https://ccsf.instructure.com" target="_blank">Canvas</a> Assignment page to understand the scoring criteria.
- Throughout this assignment and all future ones, please be sure to not re-assign variables throughout the notebook! _For example, if you use `max_temperature` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you thought you were passing previously!_
- You may [submit](#Submit-Your-Assignment-to-Canvas) this assignment as many times as you want before the deadline. Your instructor will score the last version you submit once the deadline has passed.
- We encourage you to discuss this assignment with others but make sure to write and submit your own code. Refer to the syllabus to learn more about how to learn cooperatively.

---

## Configure the Notebook

Run the following cell to configure this Notebook.

In [None]:
import numpy as np
from datascience import *
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

---

## Unemployment


The [Federal Reserve Bank of St. Louis](https://fred.stlouisfed.org/categories/33509) publishes data about jobs in the US. We've imported data from this site on unemployment rates in the United States. There are many ways of defining unemployment rates and our dataset includes two.
1. Non-Employment Index (NEI): This is defined as the percentage of people who are able to work and are looking for a full-time job who can't find a job. 
2. NEI+PTER (Non-Employment Index and 'Part-Time for Economic Reasons'): This is defined as the percentage of people who are able to work and are looking for a full-time job who can't find a job or are only working at a part-time job.
The data that you will use in this assignment contains [quarterly average NEI percentages from 1994 until 2024](https://fred.stlouisfed.org/series/NEIM156SFRBRIC) and [quarterly average NEI+PTER percentages from 1994 until 2024](https://fred.stlouisfed.org/series/NEIPTERM156SFRBRIC)

In a previous assignment, you created a table called `unemployment` that contained recent NEI and PTER values. Run the following code cell (which you are not responsible for understanding) to load that data here.

In [None]:
from datetime import datetime, date

def update_date_format(date_string):
    try:
        return datetime.strptime(date_string, '%Y-%m-%d').date()
    except ValueError:
        return None

unemployment = Table.read_table('unemployment.csv')

# Check if the 'DATE' column contains strings, then convert to datetime.date
if not isinstance(unemployment.column('DATE').item(0), date):
    unemployment = unemployment.with_column(
        'DATE', unemployment.apply(update_date_format, 'DATE')
    )

unemployment

---

### Task 01 📍🔎

<!-- BEGIN QUESTION -->

Using the `unemployment` table, create an overlaid line plot showing two lines for NEI and PTER unemployment percentages over the dates. The dates should be on the horizontal axes and the percentages should be on the vertical axes.

**Note:** Our code below will format the graph for you so that the x-axis shows the year portion of the date for every 3 years starting from 1994.

_Points:_ 2

In [None]:
...

# Leave the following code to improve the readability of the horizontal axis tick marks
start_date = min(unemployment.column("DATE"))
end_date = max(unemployment.column("DATE"))
years = [datetime(year, 1, 1) for year in range(start_date.year, end_date.year + 3, 3)]
plt.gca().set_xticks(years)
plt.gca().set_xticklabels([year.strftime('%Y') for year in years])
plt.xticks(rotation=45)
plt.title("Unemployment Percentages (1994 - 2024)")
plt.show()

<!-- END QUESTION -->

### Task 02 📍🔎

<!-- BEGIN QUESTION -->

As you saw early on in the course, visuals produced from data can reveal a story about the data. For this task, review the line graphs above for patterns, identify 3 major events in US history evident in the graph and hypothesize how the events and the patterns in the unemployment data are related.

In your response:
1. Mention 3 events that occurred between 1994 and 2024.
2. Express how those 3 events show up in the patterns you are observing.
3. Hypothesize how those events are associated with unemployment data.

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

## Birth Rates


The CSV file `census.csv` contains census-based population estimates for each state on both July 1, 2022 and July 1, 2023. The data was taken from [the US Census 2020-2023 national totals data set](https://www2.census.gov/programs-surveys/popest/datasets/2020-2023/state/totals/NST-EST2023-ALLDATA.csv). 

Run the following code cell to load that data into a table called `pop`.

In [None]:
pop = Table.read_table('census.csv')
pop

Here is a brief explanation of the column labels:

* `REGION`: Census Region code
* `NAME`: State name
* `'2022'`: 7/1/2022 resident total population estimate
* `'2023'`" 7/1/2023 resident total population estimate
* `'BIRTHS'`: Births in period 7/1/2022 to 6/30/2023			
* `'DEATHS'`: Deaths in period 7/1/2022 to 6/30/2023
* `'MIGRATION'`: Net migration in period 7/1/2022 to 6/30/2023
* `'OTHER'`: Residual for period 7/1/2022 to 6/30/2023

The last four columns describe the components of the estimated change in population during this time interval. 

**Note:** For all questions below, assume that the word "states" refers to all 52 rows including Puerto Rico & the District of Columbia.

---

### Task 03 📍

Assign `us_birth_rate` to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the population size at the start of the time period.

**Hint:** Which year corresponds to the start of the time period?


_Points:_ 2

In [None]:
us_birth_rate = ...
us_birth_rate

In [None]:
grader.check("task_03")

---

### Task 04 📍

In the next question, you will be creating a visualization to understand the relationship between birth and death rates for the states. The annual death rate for a year-long period in a state is the total number of deaths in that period for that state as a proportion of the population size at the start of the time period for that state.

What visualization is most appropriate to see if there is an association between birth and death rates during a given time interval among the states?

1. Line Graph
<br>
2. Scatter Plot
<br>
3. Bar Chart

Assign `visualization` below to the number corresponding to the correct visualization.


_Points:_ 2

In [None]:
visualization = ...

In [None]:
grader.check("task_04")

---

### Task 05 📍🔎

<!-- BEGIN QUESTION -->

In the code cell below, create a visualization that will help us determine if there is an association between birth rate and death rate during this time interval among the states. 

It may be helpful to create intermediate variables. In our template, we've introduced the names `birth_rates_2022` and `death_rates_2022` as suggestions. We will not test those names, so you do not need to use them. We will only score the visualization you produce.


_Points:_ 2

In [None]:
birth_rates_2022 = ...
death_rates_2022 = ...

# Leave the following code to add a title to your graph
plt.title('2022 Death Rate vs. Birth Rate')
plt.show()

<!-- END QUESTION -->

---

### Task 06 📍

`True` or `False`: There is an association between birth rate and death rate during this time interval among the states. 

Assign `assoc` to `True` or `False` in the cell below. 


_Points:_ 2

In [None]:
assoc = ...

In [None]:
grader.check("task_06")

---

## Marginal Histograms


Consider the following scatter plot: 

<img src="scatter.png" alt="The Scatter plot" width=40%>

The axes of the plot represent values of two variables: $x$ and $y$. 

Suppose we have a table called `t` that has two columns in it:

- `x`: a column containing the x-values of the points in the scatter plot
- `y`: a column containing the y-values of the points in the scatter plot

Below, you are given two histograms &ndash; one corresponds to column `x` and one corresponds to column `y`.

### Histogram A

<img src="histogram_A.png" alt="Histogram C: Asymmetrical histogram with a peak around -0.5 and a right skew" width=40%>

### Histogram B

<img src="histogram_B.png" alt="Histogram B: Symmetrical histogram with two peaks at -1 and 1 but no data around 0" width=40%>

---

---

### Task 07 📍

Suppose we run `t.hist('y')`. Which histogram does this code produce? Assign `histogram_column_y` to one of the following strings: `'A'` or `'B'`.

_Points:_ 2

In [None]:
histogram_column_y = ...

In [None]:
grader.check("task_07")

---

## Uber Movement

According to the former [Uber](https://www.uber.com) Movement project page:
> Planning great cities requires great data. Uber gathers trip data in more than 10,000 cities across the world. So why not share it? Enter Uber Movement, which gives urban planners access to Uber’s aggregated data to help make informed decisions about our cities.

The Uber Movement project and data access have ended, but we still have data from Boston and Manila. Below we load tables containing 200,000 weekday Uber rides in the Manila, Philippines, and Boston, Massachusetts metropolitan areas from the Uber Movement project. The `'sourceid'` and `'dstid'` columns contain codes corresponding to start and end locations of each ride. The `'hod'` column contains codes corresponding to the hour of the day the ride took place. The ride time column contains the length of the ride in minutes.

Run the following code cell to create the table `uber` which contains the available Uber ride data for Boston and Manila.

In [None]:
boston = Table.read_table("boston.csv").with_column('city', ['Boston']*200_000)
manila = Table.read_table("manila.csv").with_column('city', ['Manila']*200_000)
uber = boston.append(manila)
uber

---

### Task 08 📍🔎

<!-- BEGIN QUESTION -->

Using the `uber` table, produce an overlaid histogram that visualizes the distributions of all ride times in Boston and Manila. Use the `group='city'` argument with the [`hist` table method](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.hist.html#datascience.tables.Table.hist) to accomplish this. Additionally, use the given bins in `equal_bins` by utilizing the `bins` argument for the `hist` table method.

_Points:_ 2

In [None]:
equal_bins = np.arange(0, 120, 5)
...

# Leave the following code to add a title to your histogram
plt.title('Distribution of Boston and Manila Ride Times')
plt.show()

<!-- END QUESTION -->

---

### Task 09📍🔎

<!-- BEGIN QUESTION -->

Why do you think the distributions for Boston and Manila are different? 

For this task:
* Form a hypothesis that identifies external factors of the two cities that may be causing the difference!
* Provide at least one reference (link) to support your claim.

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

---

### Task 10 📍🔎

<!-- BEGIN QUESTION -->

From the histograms, it looks like there are more 20 to 40-minute Uber rides in Manila compared to Boston. Histograms reflect density, so be careful of interpreting the heights of the bars as counts. It is okay to compare the histograms directly because they both represent 200,000 data points. For this task, produce a bar chart showing that Manila does have more 20 to 40-minute Uber rides than Boston. 

**Note:** For this task, it doesn't matter if you include 40-minute rides in your count or not.

**Hint:** Consider using the [`group` table method](https://datascience.readthedocs.io/en/master/_autosummary/datascience.tables.Table.group.html).

_Points:_ 2

In [None]:
...

# Leave the following code to add a title to your histogram
plt.title('Number of Uber Rides between 20 and 40 Minutes')
plt.show()

<!-- END QUESTION -->

---

## NYC Motor Vehicle Collisions - Crashes

The data in `'nyc_crashes_2024_sample.csv'` contains information from a random sample of 1,000 police-reported motor vehicle collisions in NYC during 2024. The police report ([MV104-AN](https://www.nhtsa.gov/sites/nhtsa.dot.gov/files/documents/ny_overlay_mv-104an_rev05_2004.pdf)) is required to be filled out for collisions where someone is injured or killed, or where there is at least $1000 worth of damage.

You can read more about the data at the [NYC OpenData page: Motor Vehicle Collisions - Crashes](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data).

Run the following code cell to create the table `nyc` from that `CSV` file.

In [None]:
nyc = Table.read_table('nyc_crashes_2024_sample.csv')
nyc

A major part of a data analyst's workload involves prepping the data for analysis. Some refer to this process as cleaning the data. In the next few tasks, you'll use what you've learned about making and applying functions to start to clean up these data.

---

It is very common for data sets to be incomplete. Notice the many `nan` values throughout the table. `nan` refers to "not a number" and is a generic placeholder for missing data. You might see this expressed in a variety of other ways such as `NaN`. For example, in the `'borough'` and `'latitude'` columns, you should see `nan` expressed in the data. A tricky thing about cleaning up the `nan` values in `nyc` is that the data type of the `nan` value depends on the data type of the data in the related column. In general, you should know the data type of a column before working with its values.

### Task 11 📍

What is the data type for the `nan` values in the `'borough'` column? Assign `borough_nan_type` to the integer 1, 2, 3, or 4 corresponding to your response option from below:

1. String
2. Integer
3. Float
4. None of the above

**Hint**: You can use the `type` function to check the data type of an item from an array.

_Points:_ 2

In [None]:
borough_nan_type = ...
borough_nan_type

In [None]:
grader.check("task_11")

---

### Task 12 📍

What is the data type for the `nan` values in the `'latitude'` column? Assign `latitude_nan_type` to the integer 1, 2, 3, or 4 corresponding to your response option from below:

1. String (`str`)
2. Integer (`int`)
3. Float (`float`)
4. None of the above

**Hint**: You can use the `type` function to check the data type of an item from an array.

_Points:_ 2

In [None]:
latitude_nan_type = ...
latitude_nan_type

In [None]:
grader.check("task_12")

---

Aside from the `nan` values, the following code shows that the `'latitude'` and `'longitude'` columns contain `0` values, which doesn't make sense because [New York City is bounded](https://en.wikipedia.org/wiki/Module:Location_map/data/USA_New_York_City) by the latitude values of 40.49 to 40.92 and the longitude values of -74.27 to -73.68. The center of the city is at (40.705, -73.975).

In [None]:
nyc.sort('latitude').show(5)

One way to handle missing values is to remove the rows for which certain columns have missing values. 

We created the following function for you that determines if a latitude and longitude combination could possibly be in New York City. The function returns a `bool` value depending on whether or not the coordinates are within the NYC boundaries. You'll learn about the code in this function soon!

In [None]:
def in_nyc(lat, long):
    """
    This function checks if the latitude (float) and longitude (float) provided fall within 
    the approximate boundaries of New York City. The function returns True if the latitude is between 
    40.49 and 40.92 and the longitude is between -74.27 and -73.68. Otherwise, it returns False.

    Examples:
    >>> in_nyc(40.705, -73.975)
    True

    >>> in_nyc(27.000, -73.975)
    False
    """
    
    if 40.49 <= lat <= 40.92 and -74.27 <= long <= -73.68:
        return True
    else:
        return False

Notice how the function handles `nan` values:

In [None]:
in_nyc(np.nan, -73.975)

Let's use this tool to clean up the data regarding the latitude and longitude values.

### Task 13 📍

Apply the `in_nyc` function to `nyc` to create the table `nyc_clean_1` that contains only the rows in `nyc` where the latitude and longitude values are possible. In the end, `nyc_clean_1` should have the same columns as `nyc` but with fewer rows.

**Notes**: 
1. Remember from a recent lecture that you can apply a function with multiple arguments.
2. To remove the unwanted rows, try adding a column of the `bool` values to the table and using `where` to filter based on the flawed latitude and longitude values. Don't forget to remove that added column in your final table.

_Points:_ 4

In [None]:
in_nyc_bools = ...
nyc_clean_1 = ...
nyc_clean_1

In [None]:
grader.check("task_13")

---

### PARKWAY vs. PKWY

A classic data cleaning problem involves handling multiple representations of the same thing. For example, run the following code to see that Grand Central Parkway is expressed 2 times as `'GRAND CENTRAL PARKWAY'` and 9 times as 'GRAND CENTRAL PKWY' in the `'on_street_name'` column. For reference, the `'on_street_name'` column gives the street on which the collision occurred according to the [NYC OpenData reference](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data).
 

In [None]:
(nyc_clean_1
 .where('on_street_name', are.containing('GRAND CENTRAL'))
 .group('on_street_name'))

`'PKWY'` is an abbreviation for `'PARKWAY'`. Now, we'll work on cleaning up this issue in the data, because in the end, we should be acknowledging that there are a total of 12 incidents on this street.

### Task 14 📍

Write a function called `update_PKWY` that takes in a string as input, replaces any instance (sub-string) of `'PKWY'` with `'PARKWAY'`, and returns the updated string.

**Notes:**
* Remember that `replace` function that you've used before in this class when working with strings? That could be helpful here!
* As a test case, running `update_PKWY('GRAND CENTRAL PKWY')` should return `'GRAND CENTRAL PARKWAY'`.

_Points:_ 2

In [None]:
...

In [None]:
grader.check("task_14")

---

### Task 15 📍

Using `update_PWKY`, replace every instance of `PWKY` in the `'on_street_name'`, `'off_street_name'`, and `'cross_street_name`' columns of `nyc_clean_1` with `'PARKWAY'`. Save the updated version of `nyc_clean_1` as `nyc_cleanish`.

**Notes**: 
* We recommend that you break up this task into multiple steps (lines of code).
* Try using the `apply` table method to update the street names. If you have any coding experience, you might be tempted to use a loop of some kind. We don't recommend this because loops are generally inefficient for working with large data sets. If you don't know what a loop is, don't worry! You'll learn about those in this class.
* Also, remember that the `with_columns` table method will replace the information in an existing column with new information if you use the same column label.

_Points:_ 4

In [None]:
nyc_cleanish = ...

In [None]:
grader.check("task_15")

---

There is still a lot to clean up in this data set and a lot of ways we could fill in missing information. For example, it turns out that the entire route of the [Grand Central Parkway]((https://en.wikipedia.org/wiki/Grand_Central_Parkway)) is contained within the borough of Queens. This means that we can fill in some of the missing `'borough'` values using this observation. However, that is enough for this assignment.

---

## A Map of Crashes by Borough

To wrap up, here is a map of the crashes by borough using the latitude and longitude values you helped clean up! We just dropped the rows in the table with missing borough information. There is nothing you need to do with this expect preview some upcoming content in the code.

In [None]:
nyc_cleanish = nyc_cleanish.where(nyc_cleanish.column('borough') != 'nan')

colors = (nyc_cleanish
          .group('borough')
          .drop('count')
          .with_column('color', 
                       make_array('blue', 'red', 'green', 'orange', 'purple')))

nyc_map = (nyc_cleanish
           .join('borough', colors)
           .select('latitude', 'longitude', 'borough', 'color')
           .relabeled('latitude', 'lat')
           .relabeled('longitude', 'long')
           .relabeled('borough', 'labels'))

Marker.map_table(nyc_map)

---

## Submit Your Assignment to Canvas

Follow these steps to submit your homework assignment:

1. **Review the Rubric:** View the rubric on the associated Canvas Assignment page to understand the scoring criteria.
2. **Run the Auto-Grader:** Ensure you have executed the code cell containing the command `grader.check_all()` to run all tests for auto-graded tasks marked with 📍. This command will execute all auto-grader tests sequentially.
3. **Complete Manually Graded Tasks:** Verify that you have responded to all the manually graded tasks marked with 📍🔎.
4. **Save Your Work:** In the notebook's Toolbar, go to `File -> Save Notebook` to save your work and create a checkpoint.
5. **Download the Notebook:** In the notebook's Toolbar, go to `File -> Download IPYNB` to download the notebook (`.ipynb`) file.
6. **Upload to Canvas:** On the Canvas Assignment page, click "Start Assignment" or "New Attempt" to upload the downloaded `.ipynb` file.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()