# Data Wrangling with Pandas

In the previous [session](./0-Lessons_on_pandas_datastructures.ipynb), I meaned about the main pandas data structures, how to create DataFrame and object sample data and API and various ways to inspect, summarize, ,filter, select, and work with DataFrame objects. Now, let venture into the world of data wrangling

There are three common tasks involved in the data wrangling process
* Data cleaning
* Data transformation
* Data enrichment


## Datasets

I will use some collected data set set for initial exploration of this topic


**1. National Centers for Environmental Information (NCEI)** API for climate data and walking through the process of collecting temperature data from it using the **requests** library.
The data can be found at this [url](https://www.ncdc.noaa.gov/cdo-web/webservices/).

_The NCEI is part of the National Oceanic and Atmospheric Administration (NOAA). As indicated by the URL for the API, this resource was created when
the NCEI was called the NCDC. Should the URL for this resource change in
the future, search for NCEI weather API to find the updated one
To use the NCEI API, you will have to request a token by filling out this [form](https://www.ncdc.noaa.gov/cdo-web/token.) with your email address_.

**2. Global Historical Climatology Network – Daily (GHCND)** The data can be found at this [url](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf)

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import datetime as dt

import dotenv
import os



import requests

In [3]:
wide_df = pd.read_csv("data/wide_data.csv", parse_dates=["date"])

long_df = pd.read_csv("data/long_data.csv",
                      usecols=["date", "datatype", "value"],
                      parse_dates=["date"])

In [4]:
wide_df.head()

Unnamed: 0,date,TMAX,TMIN,TOBS
0,2018-10-01,21.1,8.9,13.9
1,2018-10-02,23.9,13.9,17.2
2,2018-10-03,25.0,15.6,16.1
3,2018-10-04,22.8,11.7,11.7
4,2018-10-05,23.3,11.7,18.9


In [5]:
long_df.head()

Unnamed: 0,datatype,date,value
0,TMAX,2018-10-01,21.1
1,TMIN,2018-10-01,8.9
2,TOBS,2018-10-01,13.9
3,TMAX,2018-10-02,23.9
4,TMIN,2018-10-02,13.9


In [6]:
wide_df.describe()

Unnamed: 0,date,TMAX,TMIN,TOBS
count,31,31.0,31.0,31.0
mean,2018-10-16 00:00:00,16.829032,7.56129,10.022581
min,2018-10-01 00:00:00,7.8,-1.1,-1.1
25%,2018-10-08 12:00:00,12.75,2.5,5.55
50%,2018-10-16 00:00:00,16.1,6.7,8.3
75%,2018-10-23 12:00:00,21.95,13.6,16.1
max,2018-10-31 00:00:00,26.7,17.8,21.7
std,,5.714962,6.513252,6.59655


In [7]:
long_df.describe()

Unnamed: 0,date,value
count,93,93.0
mean,2018-10-16 00:00:00,11.470968
min,2018-10-01 00:00:00,-1.1
25%,2018-10-08 00:00:00,6.7
50%,2018-10-16 00:00:00,11.7
75%,2018-10-24 00:00:00,17.2
max,2018-10-31 00:00:00,26.7
std,,7.362354


## Exploring an API to find and collect temperature data

Let's explore the weather API that's provided by the NCEI

In [16]:


def make_request(endpoint, payload=None):
    """
    Make a request to a specific endpoint on the
    weather API passing headers and optional payload
    
    Parameters
    ==========
    endpoint: str
        The endpoint of the API you want to make a GET request to.
    payload: dict
        A dictionary of data to pass along with the request
        
    Returns
    =======
        A response object
    """
    api = "https://www.ncdc.noaa.gov/cdo-web/api/v2"
    
    return requests.get(
        f"{api}/{endpoint}",
        headers={"token":TOKEN},
        params=payload
    )

In [17]:
endtime = dt.date.today() - dt.timedelta(days=1)
startdate = endtime - dt.timedelta(days=365)
startdate

datetime.date(2023, 3, 20)

In [18]:
response = make_request("datasets", payload={"startdate": startdate})

In [19]:
response.status_code

200

In [15]:
payload = response.json()
payload.keys()

dict_keys(['metadata', 'results'])

In [56]:
payload["metadata"]["resultset"]["count"]

11

We got back 11 rows, so lets see what field are in the results portion of the JSON payload. The `results` key contaisn a list of dictionaries

In [37]:
payload["results"][0].keys()

dict_keys(['uid', 'mindate', 'maxdate', 'name', 'datacoverage', 'id'])

I want to look at the IDs and names of the datasets

In [31]:
[(data["id"], data["name"]) for data in payload["results"]]

[('GHCND', 'Daily Summaries'),
 ('GSOM', 'Global Summary of the Month'),
 ('GSOY', 'Global Summary of the Year'),
 ('NEXRAD2', 'Weather Radar (Level II)'),
 ('NEXRAD3', 'Weather Radar (Level III)'),
 ('NORMAL_ANN', 'Normals Annual/Seasonal'),
 ('NORMAL_DLY', 'Normals Daily'),
 ('NORMAL_HLY', 'Normals Hourly'),
 ('NORMAL_MLY', 'Normals Monthly'),
 ('PRECIP_15', 'Precipitation 15 Minute'),
 ('PRECIP_HLY', 'Precipitation Hourly')]

The first entry in the result is what we are looking for. 
Now that we have a value for `datasetid (GHCND)`, we proceed to identify one for `datacategoryid`, which we need to request temperature data. We do so using the datacategories endpoint. Here, we can print the JSON payload since it isn't that large (only nine entries):

In [39]:
response = make_request("datacategories", payload={"datasetid": "GHCND"})

In [40]:
response.ok

True

In [42]:
response.json()["results"]

[{'name': 'Evaporation', 'id': 'EVAP'},
 {'name': 'Land', 'id': 'LAND'},
 {'name': 'Precipitation', 'id': 'PRCP'},
 {'name': 'Sky cover & clouds', 'id': 'SKY'},
 {'name': 'Sunshine', 'id': 'SUN'},
 {'name': 'Air Temperature', 'id': 'TEMP'},
 {'name': 'Water', 'id': 'WATER'},
 {'name': 'Wind', 'id': 'WIND'},
 {'name': 'Weather Type', 'id': 'WXTYPE'}]

Based on the previous result, we know that we want a value of `TEMP` for
`datacategoryid`. Next, we use this to identify the data types we want by using the
datatypes endpoint. We will use a list comprehension once again to only print the
names and IDs; this is still a rather large list, so the output has been abbreviated:

In [43]:
response = make_request("datatypes", 
                        payload={"datacategoryid": "TEMP", "limit":100}
                       )

In [52]:
[(data["id"], data["name"]) for data in response.json()["results"]]

[('CDSD', 'Cooling Degree Days Season to Date'),
 ('DATN',
  'Number of days included in the multiday minimum temperature (MDTN)'),
 ('DATX',
  'Number of days included in the multiday maximum temperature (MDTX)'),
 ('DLY-DUTR-NORMAL', 'Long-term averages of daily diurnal temperature range'),
 ('DLY-DUTR-STDDEV',
  'Long-term standard deviations of daily diurnal temperature range'),
 ('DLY-TAVG-NORMAL', 'Long-term averages of daily average temperature'),
 ('DLY-TAVG-STDDEV',
  'Long-term standard deviations of daily average temperature'),
 ('DLY-TMAX-NORMAL', 'Long-term averages of daily maximum temperature'),
 ('DLY-TMAX-STDDEV',
  'Long-term standard deviations of daily maximum temperature'),
 ('DLY-TMIN-NORMAL', 'Long-term averages of daily minimum temperature'),
 ('DLY-TMIN-STDDEV',
  'Long-term standard deviations of daily minimum temperature'),
 ('EMNT', 'Extreme minimum temperature for the period.'),
 ('EMXT', 'Extreme maximum temperature for the period.'),
 ('HDSD', 'Heating De

I am looking for **TAVG, TMAX, TMIN** data types.

Now that we have everything we need to request temperature data foe all locations, I will narrow it down to a spedific location.

In [53]:
response = make_request("locationcategories", payload={"datasetid": "GHCND"})

In [56]:
import pprint as pp
pp.pprint(response.json())

{'metadata': {'resultset': {'count': 12, 'limit': 25, 'offset': 1}},
 'results': [{'id': 'CITY', 'name': 'City'},
             {'id': 'CLIM_DIV', 'name': 'Climate Division'},
             {'id': 'CLIM_REG', 'name': 'Climate Region'},
             {'id': 'CNTRY', 'name': 'Country'},
             {'id': 'CNTY', 'name': 'County'},
             {'id': 'HYD_ACC', 'name': 'Hydrologic Accounting Unit'},
             {'id': 'HYD_CAT', 'name': 'Hydrologic Cataloging Unit'},
             {'id': 'HYD_REG', 'name': 'Hydrologic Region'},
             {'id': 'HYD_SUB', 'name': 'Hydrologic Subregion'},
             {'id': 'ST', 'name': 'State'},
             {'id': 'US_TERR', 'name': 'US Territory'},
             {'id': 'ZIP', 'name': 'Zip Code'}]}


I want to look at Lagos City, so for the location `locationcategoryid` filter, `CITY` is the proper value.

In [94]:
def get_item(name:str, what:dict, endpoint:str, start=1, end: Optional[int]=None):
    """Compute the payload using binary search
    
    Parameters
    ==========
    name: The item to look for
    what: Dictionary specifying what item `name` is.
    endpoint: Where to look for the item.
    start: The position to start at.
    end: The last position of the items. Used to find
    the midpoint
    
    Returns
    =======
    Dictionary of the information for the item if found
        otherwise an empty dictionary
        
    """
    name = name.lower()
    
    # find the midpoint to cut the data into half each time
    mid = (start + (end or 1))//2
    
    # Define the payload we will send with each request
    payload = {
        "datasetid": "GHCND",
        "sortedfield": "name",
        "offset": mid,
        "limit": 1 # We only want one value back
    }
    
    #make request adding additonal filters from `what`
    response = make_request(endpoint, {**payload, **what})
    
    if response.ok:
        payload = response.json()
        print(payload)
        
        # Grab the end index from the response
        # metadata the first time through
        end = end or payload["metadata"]["resultset"]["count"]
        # grab the lowercase version of the current name
        current_name = payload["results"][0]["name"].lower()
        
        # if what we are searching for is in the current name,
        # we have found out item
        if name in current_name:
            return payload["results"][0]

        if start >= end:
            return {}
        elif name < current_name: # Search to the left
            return get_item(name, what, endpoint, start, mid-1)
        elif name > current_name: # search to the right
            return get_item(name, what, endpoint, mid+1, end)
    else:
        print(f"Reponse not OK, status {response.status_code}")
                

In [96]:
nyc = get_item("Texas", {"locationcategoryid": "CITY"}, "locations")

{'metadata': {'resultset': {'offset': 1, 'count': 1983, 'limit': 1}}, 'results': [{'mindate': '1983-01-02', 'maxdate': '2024-03-10', 'name': 'Abu Dhabi, AE', 'datacoverage': 0.9978, 'id': 'CITY:AE000001'}]}
{'metadata': {'resultset': {'offset': 992, 'count': 1983, 'limit': 1}}, 'results': [{'mindate': '1963-01-13', 'maxdate': '2024-03-10', 'name': 'Kirsehir, TU', 'datacoverage': 0.5009, 'id': 'CITY:TU000034'}]}
{'metadata': {'resultset': {'offset': 1488, 'count': 1983, 'limit': 1}}, 'results': [{'mindate': '1891-10-01', 'maxdate': '2018-05-31', 'name': 'Yazoo City, MS US', 'datacoverage': 1, 'id': 'CITY:US280020'}]}
{'metadata': {'resultset': {'offset': 1240, 'count': 1983, 'limit': 1}}, 'results': [{'mindate': '1891-11-05', 'maxdate': '2024-03-10', 'name': 'Albany, GA US', 'datacoverage': 1, 'id': 'CITY:US130001'}]}
{'metadata': {'resultset': {'offset': 1364, 'count': 1983, 'limit': 1}}, 'results': [{'mindate': '1896-03-01', 'maxdate': '2024-03-11', 'name': 'Elizabethtown, KY US', 'da

KeyboardInterrupt: 

In [92]:
nyc

{'mindate': '1983-01-02',
 'maxdate': '2024-03-10',
 'name': 'Abu Dhabi, AE',
 'datacoverage': 0.9978,
 'id': 'CITY:AE000001'}

In [93]:
"James" in "James, In"

True