In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab09.ipynb")

# Lab 09: Getting Data from an API and Cleaning the Data

Welcome to Lab 09 of Data Wrangling and Visualization!
## Overview
### APIs
API is an acronym for Application Programming Interface that software uses to access data, server software, or other applications.  In simple terms, it is a software intermediary that allows two applications to "talk" to each other.  They are quite versatile and can be used on web-based systems, operating systems, database systems and computer hardware.

A simplified example would be when you sign into Twitter from your phone you are telling the Twitter application that you would like to access your account. The mobile application makes a call to an API to retrieve your Twitter account and credentials. Twitter would then access this information from one of its servers and return the data to the mobile application.  This is an example of a web API, and will be what we use in this activity.

APIs depend on the owner of the dataset. The data can be either offered for free or be available at a cost. The owner can also limit the number of requests that a single user can make or the amount of data they can access.

### Data Wrangling
Broadly, data wrangling can be split into 3 tasks (in no particular order and often repeated):
- data cleaning (e.g., renaming columns, reordering, handling duplicates or missing data, filtering to desired subsets)
- data transformation (changing the data's structure to facilitate downstream analysis, such as transposing the data, ensuring there is only one observation per row)
- data enrichments (e.g., merge new data with the original data by appending new rows/columns or use the original data to create new data).
This activity just scratches the surface of this topic.


## In today's lab, we will
- Explore an API to find and collect temperature data
- Find the city we want
- Begin to understand the frequently employed steps in data wrangling and clean the data
- Do a small EDA

For this activity, we will collect daily temperature data from the National Centers for Environmental Information (NCEI) API.  The site is here: https://www.ncdc.noaa.gov/cdo-web/webservices/v2 and is part of the National Oceanic and Atmospheric Adminsitration (NOAA).  NOAA's mission is to understand and predict changes in climate, weather, ocean, and coasts,to share that knowledge and information with others, and to conserve and manage coastal and marine ecosystems and resources.


In [None]:
import numpy as np
import pandas as pd
import requests

### 1. Request a token and Setup our URL

**Question 1.1:** To gain access to the NCDC Web Services, you have to register with your email address and will be sent a unique token.  Registration is here: https://www.ncdc.noaa.gov/cdo-web/token.  Get your token and paste it in `access_token` below.

*NOTE:* This API limits users to 5 requests per second and 10,000 requests per day.  (If requests exceed this amount, you will get a client error with a status code in the 400s.  More about status codes can be found here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.)

In [None]:
access_token = ...

In [None]:
grader.check("q1_1")

**Question 1.2:** Online documentation about how to access specific data from this API can be found here: https://www.ncdc.noaa.gov/cdo-web/webservices/v2. Take some time to read through the documentation and learn about the different endpoint urls you can access. Each one has the same base URL. Find the base url from the documentation (without the endpoint), and paste it below.

In [None]:
base_url = ...

In [None]:
grader.check("q1_2")

<!-- BEGIN QUESTION -->

**Question 1.3:** List all the possible endpoints we could select, and give a short description of each. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## 2. Exploring the API
We will begin by exploring the API to get the data we need. Our ultimate goal is to look at daily temperature data from San Francisco. Let's figure out how we need to query the data to get the info we want. 

**Question 2.1:** We are going to start exploring the options in the `datasets` endpoint. Concatenate `datasets` with your base url from the previous question.

In [None]:
endpoint_url = ...
endpoint_url

In [None]:
grader.check("q2_1")

**Question 2.2:** Notice that "token" is listed as a Header in the documentation. We can pass this in to a get request with the `requests` module to allow the API to identify us as users. We will prepare to do that here. 

Create a dictionary containing one element. The key should be "token" and its value should be your access token from problem 1.1. 

In [None]:
header_dict = ...
header_dict

In [None]:
grader.check("q2_2")

**Question 2.3:** Take a look at the `datasets` tab in the API documentation. Notice that there are several optional parameters we can use to query data when we make the request. We can pass this in to a get request with the `requests` module to allow the API to pull only relevant data. We will prepare to do that here. 

Create a dictionary containing one element. The key should be "startdate" and its value should be October 1st, 2018 (in the format specified by the APIs documentation). 

In [None]:
params_dict = ...
params_dict

In [None]:
grader.check("q2_3")

Run the following cell to fetch the information about the different datasets available.

In [None]:
# Get the data
datasets_response = requests.get(endpoint_url,headers=header_dict,params=params_dict)
datasets_response

**Question 2.4:** Check the status code of your response object. 

In [None]:
response_status = ...
response_status

In [None]:
grader.check("q2_4")

**Question 2.5:** As we discussed in class, the payload in API is the actual data pack that is sent with the GET method.  It is the crucial information that you submit to the server when you are making an API request. The payload can be sent or received in various formats, including JSON. This API passes JSON data.

Put the JSON data from your response in the `payload` variable below. 

In [None]:
payload = ...
payload

In [None]:
grader.check("q2_5")

**Question 2.6:** Inspect the keys of `payload`. We want to figure out which dataset we should pull datafrom. The relevant information is in id and name. Create a list of tuples with "id" and "name" from the `results` key.

In [None]:
# Inspect payload here


In [None]:
id_and_name = ...
id_and_name

In [None]:
grader.check("q2_6")

**Question 2.7:** Looking at the results above, we have various options for frequency of data. In this lab, we want to work with daily summaries, so we will ask for the GHCND data. 

Next, we need to figure out what categories of data we can request. Make another API get request below. This time use `datacategories` as your endpoint and `GHCND` as the datasetid parameter. Inspect your response data.

In [None]:
# Make an API request
daily_response = ...
daily_response

In [None]:
# Inspect the response here


In [None]:
grader.check("q2_7")

**Question 2.8:** For this activity, we are interested in air temperature, so we will use the `TEMP` id. 

Now we need to figure out what type of temperature data we can request. Make another API get request below. Use `datatypes` as your endpoint and `TEMP` as the datacategoryid parameter. Also limit your API call to 100 results in the response (read the documentation to figure out how to do this). Ask Dr. Johnson if you need help. 

Once you've successfully made the API request, inspect the payload.

In [None]:
# Make an API request
types_response = ...
types_response

In [None]:
# Inspect the payload here


In [None]:
grader.check("q2_8")

**Question 2.9:** There is a lot going on in this payload. To make it easier to interpret, create a list of tuples with "id" and "name" from the results key of the `types_response` payload.

In [None]:
# Make an API request
id_and_name2 = ...
id_and_name2

In [None]:
grader.check("q2_9")

**Question 2.10:** Wow, that's a lot of temperature data types we can pull. In lab, we'll focus on TMAX and TMIN. We're almost ready to request the data, but we only want to request in a specific location. 

Make another API get request below. Use `locationcategories` as your endpoint and `GHCND` as the datasetid parameter. Inspect the payload.

In [None]:
loc_response = ...
loc_response

In [None]:
grader.check("q2_10")

**Question 2.11:** 
For this lab, we will choose to focus on San Francisco, one of the largest cities in Northern California. We want data from San Francisco, so we will use the "CITY" id.  We will need to find the location ID San Francisco.  However, there are almost 2000 cities to choose from, so we don't want to look through the list to find the id by visual inspection.  

Use the `locations` endpoint to find the city ID for San Francisco.

*HINT:* You might find it helpful to use a `sortfield` by name and an `offset` to request the section of the data that contains San Francisco. Read the documentation to learn how to use these and ask Dr. Johnson if you are stuck. 

In [None]:
pars = ...

In [None]:
city_response = requests.get(base_url+'locations',headers = header_dict, params = pars)
city_response

In [None]:
# Paste the city id for San Francisco here
sf_id = ...

In [None]:
grader.check("q2_11")

**Question 2.12:** Now its time to actually request the data we want with the `data` endpoint. 

Let's request San Francisco's temperature data (in Celsius) for October 2018.  We will need to use all of the parameters we have obtained previously by exploring the API. That is, the parameters should be
- `datasetid` is GHCND
- `locationid` is sf_id
- `startdate` is 2018-10-01
- `enddate` is 2018-10-31
- `datatypeid` is ['TMAX', 'TMIN'] (average, maximum and minimum temperatures)
- `units` is metric (if we want Celsius)
- `limit` is 1000 

In [None]:
# get SF daily summaries data 
data_response = ...
    ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
    ...
...
data_response

In [None]:
grader.check("q2_12")

## 3. Create a Pandas DataFrame

**Question 3.1:** Create a pandas dataframe containing the results from the `data_response` payload.

In [None]:
sf_df = ...
sf_df.head()

In [None]:
grader.check("q3_1")

**Question 3.2:** How many rows are there for each `datatype`? Comment on what you observe. Were you expecting these results? What do you think is going on here?

_Type your answer here, replacing this text._

In [None]:
datatype_counts = ...
datatype_counts

In [None]:
grader.check("q3_2")

**Question 3.3:** Let's update the API request to a single station. Limit the API call results to just the San Francisco Downtown station located near Market Street (on the corner of Hermann and Buchanan Streets). The station ID is GHCND:USW00023272.

In [None]:
query_parameters = ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
        ...
    ...

In [None]:
downtown_sf_response = ...
downtown_sf_response

In [None]:
grader.check("q3_3")

**Question 3.4:** Put your new response data into a Pandas DataFrame. 

In [None]:
downtown_sf_df = ...
downtown_sf_df.head()

In [None]:
grader.check("q3_4")

**Question 3.5:** Check the number of rows for each datatype in your new data. Comment on what you notice.

In [None]:
new_datatype_counts = ...
new_datatype_counts

In [None]:
grader.check("q3_5")

## 4. Clean the Data

**Question 4.1:** The value column contains temperatures in degrees Celsius.  We should rename this column so it is clear at first glance what it represents.  We can rename columns with the `rename()` method, which takes a dictionary mapping the old column name to the new column name.

Rename the `value` column to `tempC`.

In [None]:
downtown_sf_df = ...
downtown_sf_df.head()

In [None]:
grader.check("q4_1")

<!-- BEGIN QUESTION -->

**Question 4.2:** Check the datatypes in the dataframe. 

In [None]:
dtypes = ...
dtypes

<!-- END QUESTION -->

**Question 4.3:** Python has a `datetime` module which supplies classes for manipulating dates and times (e.g., arithmetic of dates and times, efficient attribute extraction for output formatting and manipulation, converting to common time zones). 

The `date` column is not currently being stored as a datetime object, but we can convert it using `pd.to_datetime(your_pandas_series)`. Convert the date column to datetime. 

In [None]:
downtown_sf_df['date'] = ...
downtown_sf_df.dtypes

In [None]:
grader.check("q4_3")

**Question 4.4:** In the US, we are accustomed to thinking in degrees Fahrenheit, so we might wish to add a new column with this.  The linear equation which relates this two scales is $$F = \frac{9}{5}C + 32$$.
Add a new column called `tempF` with degrees in Fahrenheit.

In [None]:
downtown_sf_df.head()

In [None]:
grader.check("q4_4")

**Question 4.5:** For some use cases, temperatures of type int are sufficient. Create two new columns called `tempC_int` and `tempF_int` of type int. 

In [None]:
downtown_sf_df.head()

In [None]:
grader.check("q4_5")

## 5. Small EDA

**Question 5.1:** Sort the dataframe to find which day in October 2018 had the highest `tempC` value.  
Assign `hottest_day` to the day in October (type int) which was the hottest.

In [None]:
# sort here

In [None]:
# answer which day here
hottest_day = ...
hottest_day

In [None]:
grader.check("q5_1")

**Question 5.2:** When you sorted data in the problem above, notice that there are some ties.  For example, 26.1 Celsius appears twice, as does 11.7.  In some cases, the earlier date comes first. In other cases, the opposite is true. Resort the data by maximum temperature, but make sure that that the earliest date consistenty comes first when there is a tie.

In [None]:
# sort here
sorted_df = ...
sorted_df

In [None]:
grader.check("q5_2")

<!-- BEGIN QUESTION -->

**Question 5.3:** Plot the TMAX and TMIN in Fahrenheit by date. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## 6. Dessert

- Create a dataframe with temperatures recorded in Eureka in January 2024. How many entries do you have of each datatype? Does this exceed the number of days in the month? Explain why this might be.
- How many unique stations are included in this data set?
- What was the coldest temperature recorded at station GHCND:USW00024213 in January 2024 in degrees Fahrenheit? On what day did this occur?
- Make a plot with dates along the x-axis and Degrees Fahrenheit along the y-axis showing the minimum and maximum daily temperatures in Eureka in January 2024 (facet by station).

<!-- END QUESTION -->

## You're done! 

Congratulations on finishing the lab! Gus is happy you learned about temperature today even though he doesn't like the cold! Run the cell below and submit to Canvas. 

<img src="gus_goes_on_an_adventure.JPG" alt="drawing" width="500"/>

### References
- Hands on Data Analysis with Pandas by Stefanie Molin
- National Oceanic and Atmospheric Adminstration (NOAA) National Centers for Environmental Information (NCEI): https://www.ncei.noaa.gov

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)