## Exploring an API to find and collect temperature data

Let's explore the weather API that's provided by the [NCDC](https://www.ncdc.noaa.gov/cdo-web/webservices/v2#gettingStarted).

NCDC offers web services that provide access to current data. This API is for developers looking to create their own scripts that use the CDO database of the weather and climage data.


In [1]:
# Import the neccessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import os
import requests

import datetime as dt

from dotenv import load_dotenv

In [2]:
# Load the secret keys
load_dotenv("../../.env")
token = os.environ.get("Token")

In [3]:
def make_request(endpoint, payload=None):
    """
    Make a request to a specific endpoint on the weather API
    passing headers and option payload
    
    :endpoint: API endpoint in string format
    :type endpoint: str
    :payload: dict
    :returns: A response object
    :rtype: object
    """
    api = "https://www.ncdc.noaa.gov/cdo-web/api/v2"
    
    return requests.get(
        f"{api}/{endpoint}",
        headers={"token":token},
        params=payload
    )

In [7]:
# Define the params
# Not this date will change depend on when u run this code 
start_date = dt.datetime(2023,3,1)
end_date = start_date + dt.timedelta(days=30)

# # Convert datetime to string
# start_date = start_date.strftime('%Y-%m-%d')
# end_date = end_date.strftime('%Y-%m-%d')

In [8]:
# Get list of avialable datasets.
# Use with paramets to filter result
# filter by `start_date`

response = make_request("datasets", {"startdate": start_date})

In [9]:
response.status_code

200

In [10]:
response.url

'https://www.ncdc.noaa.gov/cdo-web/api/v2/datasets?startdate=2023-03-01+00%3A00%3A00'

In [11]:
payload = response.json()

We got 11 rows back, lets see what fields are in the `resultsets` portion of the payload

In [12]:
payload["results"][0].keys()

dict_keys(['uid', 'mindate', 'maxdate', 'name', 'datacoverage', 'id'])

We want to look at `ids` and `names` of the dataset

In [13]:
[ (result["id"], result["name"]) for result in payload["results"]]

[('GHCND', 'Daily Summaries'),
 ('GSOM', 'Global Summary of the Month'),
 ('GSOY', 'Global Summary of the Year'),
 ('NEXRAD2', 'Weather Radar (Level II)'),
 ('NEXRAD3', 'Weather Radar (Level III)'),
 ('NORMAL_ANN', 'Normals Annual/Seasonal'),
 ('NORMAL_DLY', 'Normals Daily'),
 ('NORMAL_HLY', 'Normals Hourly'),
 ('NORMAL_MLY', 'Normals Monthly'),
 ('PRECIP_15', 'Precipitation 15 Minute'),
 ('PRECIP_HLY', 'Precipitation Hourly')]

The first entry in the result is what we are looking for `('GHCND', 'Daily Summaries')`
Now that we have a value for `datasetid (GHCND)`, we proceed to identify one for `datacategoryid`, 
which we need to request temperature data. We do so using the datacategories endpoint. 
Here, we can print the JSON payload since it isn't that large (only nine entries):

In [14]:
response = make_request("datacategories", payload={"datasetid": "GHCND"})
payload = response.json()
payload

{'metadata': {'resultset': {'offset': 1, 'count': 9, 'limit': 25}},
 'results': [{'name': 'Evaporation', 'id': 'EVAP'},
  {'name': 'Land', 'id': 'LAND'},
  {'name': 'Precipitation', 'id': 'PRCP'},
  {'name': 'Sky cover & clouds', 'id': 'SKY'},
  {'name': 'Sunshine', 'id': 'SUN'},
  {'name': 'Air Temperature', 'id': 'TEMP'},
  {'name': 'Water', 'id': 'WATER'},
  {'name': 'Wind', 'id': 'WIND'},
  {'name': 'Weather Type', 'id': 'WXTYPE'}]}

Based on the previous result, we know that we want a value of `TEMP` for
datacategoryid. Next, we use this to identify the data types we want by using the
datatypes endpoint. We will use a list comprehension once again to only print the
`names` and IDs; this is still a rather large list, so the output has been abbreviated:

In [15]:
response = make_request("datatypes", 
                        payload={"datacategoryid": "TEMP", "limit": 60})
datatypes = response.json()

# Get the id and names of the datatypes
datatypes_output = [(datatype["id"], datatype["name"]) for datatype in datatypes["results"]]

# Turn the output into a dataframe
datatypes_df = pd.DataFrame(datatypes_output, columns=["id", "name"])
datatypes_df.tail()

Unnamed: 0,id,name
54,MNTM,Monthly mean temperature
55,TAVG,Average Temperature.
56,TMAX,Maximum temperature
57,TMIN,Minimum temperature
58,TOBS,Temperature at the time of observation


In [18]:
response = make_request("locationcategories", payload={"datasetid":"GHCND"})
response.status_code

200

In [19]:
import pprint as pp

pp.pprint(response.json())


{'metadata': {'resultset': {'count': 12, 'limit': 25, 'offset': 1}},
 'results': [{'id': 'CITY', 'name': 'City'},
             {'id': 'CLIM_DIV', 'name': 'Climate Division'},
             {'id': 'CLIM_REG', 'name': 'Climate Region'},
             {'id': 'CNTRY', 'name': 'Country'},
             {'id': 'CNTY', 'name': 'County'},
             {'id': 'HYD_ACC', 'name': 'Hydrologic Accounting Unit'},
             {'id': 'HYD_CAT', 'name': 'Hydrologic Cataloging Unit'},
             {'id': 'HYD_REG', 'name': 'Hydrologic Region'},
             {'id': 'HYD_SUB', 'name': 'Hydrologic Subregion'},
             {'id': 'ST', 'name': 'State'},
             {'id': 'US_TERR', 'name': 'US Territory'},
             {'id': 'ZIP', 'name': 'Zip Code'}]}


I want look at any city, so for the `locationcategory` filter, `CITY` is the proper value.

In [20]:
response = make_request("locations", payload={"locationcategoryid": "CITY"})

In [21]:
response.status_code

200

In [22]:
response.json()["results"][:2]

[{'mindate': '1983-01-01',
  'maxdate': '2024-03-18',
  'name': 'Abu Dhabi, AE',
  'datacoverage': 0.9978,
  'id': 'CITY:AE000001'},
 {'mindate': '1944-03-01',
  'maxdate': '2024-03-18',
  'name': 'Ajman, AE',
  'datacoverage': 1,
  'id': 'CITY:AE000002'}]

In [23]:
response.json()["metadata"]

{'resultset': {'offset': 1, 'count': 1989, 'limit': 25}}

The result returns 1989 rows as shown by `count`. From where we are to search for a particular `city`, say Lagos or New York city. The naive way would be search the entire list with a simple for loop. However, this is not the most efficient way to go about it. A better approach would be to implement a `binary search` for this purpose.

In [24]:
def get_item(endpoint:str, target:str, params: dict, 
             start:int=1, end:int|None=None)->dict:
    """Grab the JSON payload using binary search
    Parameters:
        - target: The item to look for.
        - params: Dictionary specifying what item `target` is.
        - endpoint: Where to look for the item.
        - start: The position to start at. 
        - end: The last position of the items.
    Returns: Dictionary of information of the item if found
        otherwise an empty dictionary
    """
    # Lower case the name to be case insentive
    target = target.lower()
    
    # Define the offset
    mid = (start + (end or 1))//2
    
    # Defined the payload to be sent with each request
    payload = {"datasetid": "GHCND",
              "sortfield": "name",
              "offset": mid, # we change this each time
              "limit": 1 # we only want one value
              }
    
    # make request adding additional paramters from `params
    response = make_request(endpoint, {**payload, **params})
    if response.ok:
        payload = response.json()
        
        # Grab the end index from the response meta for the first time
        end = end or payload["metadata"]["resultset"]["count"]
        
        # Grad the lowercase version of the current name
        current_name = payload["results"][0]["name"].lower()
        
        # If what we are searching for is in the current name, we found the item
        if target in current_name:
            return payload["results"][0]
        else:
            if start >= end: # We have reached the end of the list
                return {}
            elif target < current_name:
                # Search to the left
                return get_item(endpoint, target, params, start, mid-1)
            elif target > current_name:
                # Search to the right
                return get_item(endpoint, target, params, mid+1, end)
    
    else:
        print(f"Response not ok. {response.status_code}")
                
    
    
    

In [25]:
nyc = get_item("locations", 'New York', params={"locationcategoryid": "CITY"})

Alternatively we can drill down to the ID of the station that is collection the data

In [26]:
nyc_central_pack_station = get_item("stations", 
                            'NY City Central Park', 
                            {'locationid': nyc['id']},)

In [27]:
nyc_central_pack_station

{'elevation': 42.7,
 'mindate': '1869-01-01',
 'maxdate': '2024-03-18',
 'latitude': 40.77898,
 'name': 'NY CITY CENTRAL PARK, NY US',
 'datacoverage': 1,
 'id': 'GHCND:USW00094728',
 'elevationUnit': 'METERS',
 'longitude': -73.96925}

Now, let's request NY central park temperature data in Celsius for October 2024, recorded from specific stations. 
For this, we will use the data endpoint and provide all the parameters we
picked up throughout our exploration of the API:

In [32]:
start_date = dt.datetime(2023,3,1)
end_date = start_date + dt.timedelta(days=30)

# Convert datetime to string
start_date = start_date.strftime('%Y-%m-%d')
end_date = end_date.strftime('%Y-%m-%d')


In [33]:
def get_temp_measure(station, city):
    """Get the temp. measure at a particular station in a city"""
    params = {"datasetid": "GHCND",
                 "stationid":station["id"],
                 "locationid": city["id"],
                 "startdate": start_date,
                 "enddate": end_date,
                 "datatypeid": ["TAVG", "TMAX", "TMIN", "TOBS"], 
                 "units": "metric", 
                 "limit": 1000}
    
    response = make_request("data", payload=params)
    return response


In [34]:
# Get the temperation for nyc and central pack station
response = get_temp_measure(nyc_central_pack_station, nyc)

Lastly, we will create a DataFrame object; since the results portion of the JSON
payload is a list of dictionaries, we can pass it directly to pd.DataFrame():

In [35]:
df= pd.DataFrame(response.json()["results"])
df.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2023-03-01T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",6.7
1,2023-03-01T00:00:00,TMIN,GHCND:USW00094728,",,W,2400",0.6
2,2023-03-02T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",12.2
3,2023-03-02T00:00:00,TMIN,GHCND:USW00094728,",,W,2400",3.9
4,2023-03-03T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",8.3


In [36]:
df["datatype"].unique()

array(['TMAX', 'TMIN'], dtype=object)

We asked for `TAVG`, `TMAX`, `TOBS` and `TMIN`, but notice that we didn't get `TAVG` and `TOBS`. This is because
the `Central Park` station isn't recording average temperature, despite being listed in the
API as offering it—real-world data is dirty:

Lets use a different location as station instead of Central Pack. In this instance we will use **LaGuardia**

In [39]:
la_guadia_station = get_item("stations", 'LaGuardia', {'locationid': nyc['id']},)

In [40]:
la_guadia_station

{'elevation': 3,
 'mindate': '1939-10-07',
 'maxdate': '2024-03-19',
 'latitude': 40.77945,
 'name': 'LAGUARDIA AIRPORT, NY US',
 'datacoverage': 1,
 'id': 'GHCND:USW00014732',
 'elevationUnit': 'METERS',
 'longitude': -73.88027}

In [41]:
# Get the temperation for nyc and central pack station
response = get_temp_measure(la_guadia_station, nyc)

In [42]:
# Store result in dataframe
df= pd.DataFrame(response.json()["results"])
df.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2023-03-01T00:00:00,TAVG,GHCND:USW00014732,"H,,S,",3.9
1,2023-03-01T00:00:00,TMAX,GHCND:USW00014732,",,W,2400",6.7
2,2023-03-01T00:00:00,TMIN,GHCND:USW00014732,",,W,2400",0.6
3,2023-03-02T00:00:00,TAVG,GHCND:USW00014732,"H,,S,",6.3
4,2023-03-02T00:00:00,TMAX,GHCND:USW00014732,",,W,2400",11.7


In [43]:
df["datatype"].unique()

array(['TAVG', 'TMAX', 'TMIN'], dtype=object)

In [44]:
# Lets save the result into a csv
df.to_csv("../data/temp_nyc_laguadia_station_2023_02_20.csv", index=False)