## Exploring an API to find and collect temperature data

Let's explore the weather API that's provided by the [NCDC](https://www.ncdc.noaa.gov/cdo-web/webservices/v2#gettingStarted).

NCDC offers web services that provide access to current data. This API is for developers looking to create their own scripts that use the CDO database of the weather and climage data.


In [8]:
# Import the neccessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import os
import requests

import datetime as dt

from dotenv import load_dotenv

In [52]:
# Load the secret keys
load_dotenv(".env")
token = os.environ.get("Token")

In [58]:
def make_request(endpoint, payload=None):
    """
    Make a request to a specific endpoint on the weather API
    passing headers and option payload
    
    :endpoint: API endpoint in string format
    :type endpoint: str
    :payload: dict
    :returns: A response object
    :rtype: object
    """
    api = "https://www.ncdc.noaa.gov/cdo-web/api/v2"
    
    return requests.get(
        f"{api}/{endpoint}",
        headers={"token":token},
        params=payload
    )

In [59]:
# Define the params
# Not this date will change depend on when u run this code 
end_date= dt.date.today() - dt.timedelta(days=1)
start_date = end_date - dt.timedelta(days=365)

start_date

datetime.date(2023, 3, 20)

In [130]:
# Get list of avialable datasets.
# Use with paramets to filter result
# filter by `start_date`

response = make_request("datasets", {"startdate": start_date})

In [131]:
response.status_code

200

In [132]:
response.url

'https://www.ncdc.noaa.gov/cdo-web/api/v2/datasets?startdate=2023-03-20'

In [134]:
payload = response.json()

We got 11 rows back, lets see what fields are in the `resultsets` portion of the payload

In [135]:
payload["results"][0].keys()

dict_keys(['uid', 'mindate', 'maxdate', 'name', 'datacoverage', 'id'])

We want to look at `ids` and `names` of the dataset

In [139]:
[ (result["id"], result["name"]) for result in payload["results"]]

[('GHCND', 'Daily Summaries'),
 ('GSOM', 'Global Summary of the Month'),
 ('GSOY', 'Global Summary of the Year'),
 ('NEXRAD2', 'Weather Radar (Level II)'),
 ('NEXRAD3', 'Weather Radar (Level III)'),
 ('NORMAL_ANN', 'Normals Annual/Seasonal'),
 ('NORMAL_DLY', 'Normals Daily'),
 ('NORMAL_HLY', 'Normals Hourly'),
 ('NORMAL_MLY', 'Normals Monthly'),
 ('PRECIP_15', 'Precipitation 15 Minute'),
 ('PRECIP_HLY', 'Precipitation Hourly')]

The first entry in the result is what we are looking for `('GHCND', 'Daily Summaries')`
Now that we have a value for `datasetid (GHCND)`, we proceed to identify one for `datacategoryid`, 
which we need to request temperature data. We do so using the datacategories endpoint. 
Here, we can print the JSON payload since it isn't that large (only nine entries):

In [142]:
response = make_request("datacategories", payload={"datasetid": "GHCND"})
payload = response.json()
payload

{'metadata': {'resultset': {'offset': 1, 'count': 9, 'limit': 25}},
 'results': [{'name': 'Evaporation', 'id': 'EVAP'},
  {'name': 'Land', 'id': 'LAND'},
  {'name': 'Precipitation', 'id': 'PRCP'},
  {'name': 'Sky cover & clouds', 'id': 'SKY'},
  {'name': 'Sunshine', 'id': 'SUN'},
  {'name': 'Air Temperature', 'id': 'TEMP'},
  {'name': 'Water', 'id': 'WATER'},
  {'name': 'Wind', 'id': 'WIND'},
  {'name': 'Weather Type', 'id': 'WXTYPE'}]}

Based on the previous result, we know that we want a value of `TEMP` for
datacategoryid. Next, we use this to identify the data types we want by using the
datatypes endpoint. We will use a list comprehension once again to only print the
`names` and IDs; this is still a rather large list, so the output has been abbreviated:

In [162]:
response = make_request("datatypes", payload={"datacategoryid": "TEMP", "limit": 70})
datatypes = response.json()
[(datatype["id"], datatype["name"]) for datatype in datatypes["results"]]

[('CDSD', 'Cooling Degree Days Season to Date'),
 ('DATN',
  'Number of days included in the multiday minimum temperature (MDTN)'),
 ('DATX',
  'Number of days included in the multiday maximum temperature (MDTX)'),
 ('DLY-DUTR-NORMAL', 'Long-term averages of daily diurnal temperature range'),
 ('DLY-DUTR-STDDEV',
  'Long-term standard deviations of daily diurnal temperature range'),
 ('DLY-TAVG-NORMAL', 'Long-term averages of daily average temperature'),
 ('DLY-TAVG-STDDEV',
  'Long-term standard deviations of daily average temperature'),
 ('DLY-TMAX-NORMAL', 'Long-term averages of daily maximum temperature'),
 ('DLY-TMAX-STDDEV',
  'Long-term standard deviations of daily maximum temperature'),
 ('DLY-TMIN-NORMAL', 'Long-term averages of daily minimum temperature'),
 ('DLY-TMIN-STDDEV',
  'Long-term standard deviations of daily minimum temperature'),
 ('EMNT', 'Extreme minimum temperature for the period.'),
 ('EMXT', 'Extreme maximum temperature for the period.'),
 ('HDSD', 'Heating De

In [184]:
response = make_request("locationcategories", payload={"datasetid":"GHCND"})
response.status_code

200

In [185]:
import pprint as pp

pp.pprint(response.json())


{'metadata': {'resultset': {'count': 12, 'limit': 25, 'offset': 1}},
 'results': [{'id': 'CITY', 'name': 'City'},
             {'id': 'CLIM_DIV', 'name': 'Climate Division'},
             {'id': 'CLIM_REG', 'name': 'Climate Region'},
             {'id': 'CNTRY', 'name': 'Country'},
             {'id': 'CNTY', 'name': 'County'},
             {'id': 'HYD_ACC', 'name': 'Hydrologic Accounting Unit'},
             {'id': 'HYD_CAT', 'name': 'Hydrologic Cataloging Unit'},
             {'id': 'HYD_REG', 'name': 'Hydrologic Region'},
             {'id': 'HYD_SUB', 'name': 'Hydrologic Subregion'},
             {'id': 'ST', 'name': 'State'},
             {'id': 'US_TERR', 'name': 'US Territory'},
             {'id': 'ZIP', 'name': 'Zip Code'}]}


I want look at Lagos, so for the `locationcategory` filter, `CITY` is the proper value.

In [175]:
response = make_request("locations", payload={"locationcategoryid": "CITY"})

In [176]:
response.status_code

200

In [183]:
response.json()["results"][:2]

[{'mindate': '1983-01-01',
  'maxdate': '2024-03-17',
  'name': 'Abu Dhabi, AE',
  'datacoverage': 0.9978,
  'id': 'CITY:AE000001'},
 {'mindate': '1944-03-01',
  'maxdate': '2024-03-17',
  'name': 'Ajman, AE',
  'datacoverage': 1,
  'id': 'CITY:AE000002'}]

The result returns 1989 rows. From where we are to search for a particular `city`, say Lagos or New York city. The naive way would be search the entire list with a simple for loop. However, this is not the most efficient way to go about it. A better approach would be to implement a `binary search` for this purpose.

In [285]:
def get_item(endpoint:str, target:str, params: dict, 
             start:int=1, end:int|None=None)->dict:
    """Grab the JSON payload using binary search
    Parameters:
        - target: The item to look for.
        - params: Dictionary specifying what item `target` is.
        - endpoint: Where to look for the item.
        - start: The position to start at. 
        - end: The last position of the items.
    Returns: Dictionary of information of the item if found
        otherwise an empty dictionary
    """
    # Lower case the name to be case insentive
    target = target.lower()
    
    # Define the offset
    mid = (start + (end or 1))//2
    
    # Defined the payload to be sent with each request
    payload = {"datasetid": "GHCND",
              "sortfield": "name",
              "offset": mid, # we change this each time
              "limit": 1 # we only want one value
              }
    
    # make request adding additional paramters from `params
    response = make_request(endpoint, {**payload, **params})
    if response.ok:
        payload = response.json()
        
        # Grab the end index from the response meta for the first time
        end = end or payload["metadata"]["resultset"]["count"]
        
        # Grad the lowercase version of the current name
        current_name = payload["results"][0]["name"].lower()
        
        # If what we are searching for is in the current name, we found the item
        if target in current_name:
            return payload["results"][0]
        else:
            if start >= end: # We have reached the end of the list
                return {}
            elif target < current_name:
                # Search to the left
                return get_item(endpoint, target, params, start, mid-1)
            elif target > current_name:
                # Search to the right
                return get_item(endpoint, target, params, mid+1, end)
    
    else:
        print(f"Response not ok. {response.status_code}")
                
    
    
    

In [286]:
city = get_item("locations", 'New York', params={"locationcategoryid": "CITY"})

Alternatively we can drill down to the ID of the station that is collection the data

In [287]:
station = get_item("stations", 'NY City Central Park', {'locationid': city['id']},)

In [288]:
station

{'elevation': 42.7,
 'mindate': '1869-01-01',
 'maxdate': '2024-03-18',
 'latitude': 40.77898,
 'name': 'NY CITY CENTRAL PARK, NY US',
 'datacoverage': 1,
 'id': 'GHCND:USW00094728',
 'elevationUnit': 'METERS',
 'longitude': -73.96925}

Now, let's request City temperature data in Celsius for October 2024, recorded from
specific stations. 
For this, we will use the data endpoint and provide all the parameters we
picked up throughout our exploration of the API:

In [320]:
start_date = dt.datetime(2023,2,20)
end_date = start_date + dt.timedelta(days=30)

# Convert datetime to string
start_date = start_date.strftime('%Y-%m-%d')
end_date = end_date.strftime('%Y-%m-%d')


In [321]:
params = {"datasetid": "GHCND",
         "stationid":station["id"],
         "locationid": city["id"],
         "startdate": start_date,
         "enddate": end_date,
         "datatypeid": ["TAVG", "TMAX", "TMIN", "TOBS"], 
         "units": "metric", 
         "limit": 1000}
response = make_request("data", payload=params)

Lastly, we will create a DataFrame object; since the results portion of the JSON
payload is a list of dictionaries, we can pass it directly to pd.DataFrame():

In [325]:
df= pd.DataFrame(response.json()["results"])
df.head()

Unnamed: 0,date,datatype,station,attributes,value
0,2023-02-20T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",16.1
1,2023-02-20T00:00:00,TMIN,GHCND:USW00094728,",,W,2400",8.3
2,2023-02-21T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",10.0
3,2023-02-21T00:00:00,TMIN,GHCND:USW00094728,",,W,2400",2.8
4,2023-02-22T00:00:00,TMAX,GHCND:USW00094728,",,W,2400",6.7


In [326]:
df["datatype"].unique()

array(['TMAX', 'TMIN'], dtype=object)

We asked for `TAVG`, `TMAX`, `TOBS` and `TMIN`, but notice that we didn't get `TAVG` and `TOBS`. This is because
the `Central Park` station isn't recording average temperature, despite being listed in the
API as offering itâ€”real-world data is dirty: