In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw05.ipynb")

# Homework 5

Welcome to Homework 5 of DATA 271: Data Wrangling and Visualization! In this assignment, we will practice importing data from an API and exploring data. Remember that part of working with APIs is learning how to find answers—whether that's through documentation, forums, or tools like ChatGPT. If you get stuck, practice searching for solutions just like a real data scientist would! As always, if you have questions feel free to discuss problems with your peers and come to office hours.

For this homework and all future ones, please be sure to not re-assign variables throughout the notebook! For example, if you use `my_list` in your answer to one question, do not reassign it later on. Otherwise, you will fail tests that you passed previously. **NOTE:** This homework assignment contains hidden tests. Passing all the auto-graded tests in the assignment does *NOT* mean you have answered everything correctly. So be careful and check your work!

### Import libraries

**Question 0:** Import the appropriate modules for this assignment including NumPy, Pandas, Matplotlib, and Seaborn, Regular Expression (re), and Requests. **(1 Points)**

### 1. Importing the data

In this assignment, you will practice using an API to retrieve air quality data from [OpenAQ](https://openaq.org/). You will explore available data, extract air quality measurements for different locations in California, and analyze trends over time.

To acquire the data, you will use the [OpenAQ API](https://docs.openaq.org/about/about). To get started, go to [OpenAQ API documentation](https://docs.openaq.org/using-the-api/api-key) and sign up for a free account if needed. Once signed in, navigate to your profile and copy your API Key.

Before making requests, read through the documentation to understand the available endpoints and data structure.

NOTE: Creating an account is entirely free. You should NOT enter any payment information.

**Question 1.1:** Assign your copied API Key to the string `api_key`. We will use this as a header when you make your API requests.**(1 Points)**

In [None]:
api_key = ...
headers = {"X-API-Key": api_key}

In [None]:
grader.check("q1_1")

**Question 1.2:** All requests from the OpenAQ API will use a common base URL. Refer to the documentation to determine what the base URL is and assign it to `base_url`. **(1 Points)**

In [None]:
base_url = ...

In [None]:
grader.check("q1_2")

**Question 1.3:** We will focus on several locations in California: Davis, Anaheim, Fresno, and San Francisco . The respective location IDs are provided. In each location, we will look at PM2.5 data, which refers to fine particulate matter with a diameter of 2.5 micrometers or smaller. These particles are small enough to be inhaled deep into the lungs and can have significant health effects.

Use the sensors endpoint to get the sensor ID associated with the PM2.5 parameter for each of these locations. Put the sensor IDs in a dictionary called `pm25_sensors` where the keys are the names of the locations and the values are the PM2.5 sensor ID in that location. Your sensor IDs should be type string. **(5 Points)**

In [None]:
location_ids = {'Davis': '878', 'Anaheim': '8875', 'Fresno':'895', 'San Francisco': '2009'}

In [None]:
# Use this cell to find sensor ids

In [None]:
pm25_sensors = ...

In [None]:
grader.check("q1_3")

**Question 1.4:** Pull the daily PM2.5 air quality data for each location over January 1, 2020 to December 31, 2022. Store the data for each location in separate response objects. 

*NOTE:* Set the limit to 1000. **(5 Points)**

In [None]:
# Use this cell to pull the Davis data
davis_pm25 = ...

In [None]:
# Use this cell to pull the Anaheim data
anaheim_pm25 = ...

In [None]:
# Use this cell to pull the Fresno data
fresno_pm25 = ...

In [None]:
# Use this cell to pull the San Francisco data
sf_pm25 = ...

In [None]:
grader.check("q1_4")

### 2. Inspecting the data
We'll start understanding the output by focusing on the data from the Davis location.

**Question 2.1:** Recall that a payload in API is the actual data pack that is sent with the `GET` method. The payload can be sent or received in various formats, including JSON.
The two primary parts that make up JSON are keys and values like a dictionary. Use the `.json()` method on the Davis response object to get the payload and assign it to `payload`. 

Then inspect the keys of `payload`. 

In a markdown cell, explain what information is associated with each key. **(2 Points)**

In [None]:
payload = ...

In [None]:
grader.check("q2_1")

**Question 2.2:** Take some time to inspect the results of the payload to get a feel for what is stored in it. Then, when you are ready, put the results into a Pandas DataFrame called `davis_df`.
**(5 Points)**

In [None]:
davis_df = ...
davis_df.head()

In [None]:
grader.check("q2_2")

**Question 2.3:** Follow the steps above to make similar Pandas DataFrames for Anaheim, Fresno, and San Francisco. **(5 Points)**

In [None]:
anaheim_df = ...
anaheim_df.head()

In [None]:
fresno_df = ...
fresno_df.head()

In [None]:
sf_df = ...
sf_df.head()

In [None]:
grader.check("q2_3")

**Question 2.4:** Create a copy of the dataframe for each location so that we have an original copy of the data after we make changes to it in our analysis. **(2 Points)**

In [None]:
davis_df_copy = ...
anaheim_df_copy = ...
fresno_df_copy = ...
sf_df_copy = ...

In [None]:
grader.check("q2_4")

<!-- BEGIN QUESTION -->

**Question 2.5:** Begin inspecting the dataframes with methods like `.describe()`, `.info()`, etc. Describe what each column means. Which columns have null values? Are there any unexpected data types? Describe what the `value` column represents. **(3 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### 3. Preprocess/Clean the Data

**Question 3.1:** You may have noticed in the previous problem that some of columns have many null values. Drop any columns from your dataframes that have more than 950 null values. **(5 Points)**

In [None]:
grader.check("q3_1")

**Question 3.2:** Take a look at the `period.datetimeFrom.utc` column in each dataframe. What data type is each element in the column? Turn it into a Pandas datetime datatype. **(5 Points)**

In [None]:
davis_df['period.datetimeFrom.utc'] = ...
anaheim_df['period.datetimeFrom.utc'] = ...
fresno_df['period.datetimeFrom.utc'] = ...
sf_df['period.datetimeFrom.utc'] = ...

In [None]:
grader.check("q3_2")

**Question 3.3:** We want to focus on air quality over time. Create subsets of the dataframe for each location containing `value`, `period.datetimeFrom.utc`, `summary.median`, and `summary.max`. Rename the columns in each to `pm25_avg`, `date`, `pm25_median`, and `pm25_max` respectively. **(5 Points)**

In [None]:
davis_subset = ...
anaheim_subset = ...
fresno_subset = ...
sf_subset = ...


In [None]:
grader.check("q3_3")

**Question 3.4:** Each DataFrame currently has daily air quality data. Create new DataFrames for each location (based on the subset dataframes) with the average monthly data. **(5 Points)**

In [None]:
davis_monthly = ...
anaheim_monthly = ...
fresno_monthly = ...
sf_monthly = ...

In [None]:
grader.check("q3_4")

**Question 3.5:** We want to be able to look at data in each location together. Before we merge our datasets, add a column called `location` to each monthly DataFrame containing the location. For example, `davis_monthly` should have a location column containing `Davis` in every entry. **(5 Points)**

In [None]:
grader.check("q3_5")

**Question 3.6:** Concatenate all of the monthly DataFrames to create a single DataFrame with monthly data from all four locations. **(5 Points)**

In [None]:
monthly_df = ...
monthly_df.head()

In [None]:
grader.check("q3_6")

**Question 3.7:** Create separate columns `year` and `month` in `monthly_df` containing the year and month from the `date` column. **(5 Points)**

In [None]:
...
monthly_df.head()

In [None]:
grader.check("q3_7")

### 4. Explore and visualize the data

**Question 4.1:** What is the mean of the PM2.5 averages at each location? Your solution should be a series with locations as the indices and PM2.5 levels as the values. Which location has the worst PM2.5 levels on average? Is this what you expect based on what you know about these locations? Put your comments in a Markdown cell.**(5 Points)**

In [None]:
location_averages = ...

In [None]:
grader.check("q4_1")

**Question 4.2:** Create a violin plot (called `location_pm25`) showing the distribution of PM2.5 averages for each location. Comment on what you notice in a Markdown cell. **(5 Points)**

In [None]:
location_pm25 = ...

In [None]:
grader.check("q4_2")

<!-- BEGIN QUESTION -->

**Question 4.3:** Compute the correlation between PM2.5 levels at different locations. Which locations have the strongest correlation in PM2.5 levels? What could explain high or low correlations between locations? (e.g., proximity, local climate, pollution sources) **(5 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

**Question 4.4:** Plot PM2.5 over time for each location in a figure called `pm25_over_time`. Facet the plot by location. Comment on what you notice. **(5 Points)**

In [None]:
pm25_over_time = ...

In [None]:
grader.check("q4_4")

<!-- BEGIN QUESTION -->

**Question 4.5:** Explore the distribution of average PM2.5 levels at different locations. Create a separate density plot (KDE) for each location, using different colors for different years to observe trends over time. **(5 Points)**

In [None]:
pm25_distributions = ...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.6:** Use your choice of data visualization to see how PM2.5 changes by season.
Do certain months have higher pollution levels? **(5 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.7:** Data cleaning decisions can significantly impact the results of an analysis. Were there any assumptions or transformations we made while preparing this dataset that could introduce bias or misrepresent the data? How might these choices affect our interpretation of PM2.5 trends? **(2 Points)**

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 4.8:** This preliminary exploration allowed us to compare the distribution of PM2.5 levels over time across a few different locations. What are some other explorations we could do with this dataset? Use code cells to try a few possibities. 

In what ways could an analysis like this be useful for public health, policy decisions, or environmental justice efforts? What additional data or context might be needed to make this analysis more actionable for communities affected by air pollution? Put your thoughts in a markdown cell. **(3 Points)**

In [None]:
# Additional exploration

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(run_tests=True)