# Climate Data Cleaning Notebook

Annual climate data for the USA was downloaded from https://www.ncdc.noaa.gov/cag/national/time-series. This website has 10 climate parameters:
- `Cooling Degree Days`: number of days where it was cooler than the day before
- `Heating Degree Days`: number of days where it was hotter than the day before
- `Precipitation`: total rainfall in inches, averaged across all 50 states
- `Palmer Drought Severity Index (PDSI)`: a measure of relative dryness
- `Palmer Hydrological Drought Index (PHDI)`: a measure of water reserves in lakes, reservoirs, etc.
- `Palmer Modified Drought Index (PMDI)`: a measure of raltive dryness caluclated using the same factors as the PDSI, but the factors are weighted differently. 
- `Average Temperature`: average temperature across all 50 states 
- `Maximum Temperature`: average maximum temperature across all 50 states
- `Minimum Temperature`: average minimum temperature across all 50 states
- `Palmer Z-Index`: a measure of short term drought on a monthly scale

The csv files containing these features as a function of year are stored in the `Climate Data` folder.

This notebook will be used to import, clean, and combine the climate data from 1950 to 2020 into a single data frame. This new climate data frame will then be exported as a csv to be combined later with the corn data.

This notebook is using the standard Python 3 kernel.

I will only need pandas and os for this notebook.

In [1]:
# Import necessary tools
import pandas as pd
import os

I will now import the climate data as individual data frames. Since there are seven different climate csv files being imported as data frames, I will create a list to hold all of the data frames. This will make it easier to itterate through the individual dataframes as I clean them and later combine them into a single data frame.

In [2]:
# First, I will create the list for storing the individual climate data frames
climate_list = []

# I will then read through the csv files in the "Climate Data" folder and add them as data frames to the climate_list

# Iterate through all files in the folder "Climate Data"
for filename in os.listdir("./Climate Data"):

    # import data from the iterated file
    file_data = pd.read_csv(f"./Climate Data/{filename}")
    
    # rename column names so they are all "Year", the value of interest, and the "Anomaly" (measure of deviation from the mean)
    # Create a list of the current keys
    keys = file_data.keys()
    
    # Replace each key with the desired title
    file_data.rename(columns={file_data.keys()[0]: "Year", file_data.keys()[2]: "Anomaly"}, inplace=True)
    
    # Add the data frame to the list of data frames
    climate_list.append(file_data)
    
    # Display the first five rows of the dataframe to ensure it was created successfully
    display(file_data.head(5))

Unnamed: 0,Year,Cooling Degree Days,Anomaly
0,Units: Fahrenheit Degree-Days,,
1,Base Period: 1901-2000,,
2,Missing: -9999,,
3,Date,Value,Anomaly
4,195007,263,-52


Unnamed: 0,Year,Heating Degree Days,Anomaly
0,Units: Fahrenheit Degree-Days,,
1,Base Period: 1901-2000,,
2,Missing: -9999,,
3,Date,Value,Anomaly
4,195007,14,5


Unnamed: 0,Year,Precipitation,Anomaly
0,Units: Inches,,
1,Base Period: 1901-2000,,
2,Missing: -99,,
3,Date,Value,Anomaly
4,195012,30.87,0.93


Unnamed: 0,Year,Palmer Drought Severity Index (PDSI),Anomaly
0,Base Period: 1901-2000,,
1,Missing: -99,,
2,Date,Value,Anomaly
3,195007,1.93,1.55
4,195107,1.65,1.27


Unnamed: 0,Year,Palmer Hydrological Drought Index (PHDI),Anomaly
0,Base Period: 1901-2000,,
1,Missing: -99,,
2,Date,Value,Anomaly
3,195007,1.93,1.51
4,195107,1.65,1.23


Unnamed: 0,Year,Palmer Modified Drought Index (PMDI),Anomaly
0,Base Period: 1901-2000,,
1,Missing: -99,,
2,Date,Value,Anomaly
3,195007,1.93,1.54
4,195107,1.65,1.26


Unnamed: 0,Year,Average Temperature,Anomaly
0,Units: Degrees Fahrenheit,,
1,Base Period: 1901-2000,,
2,Missing: -99,,
3,Date,Value,Anomaly
4,195012,51.39,-0.63


Unnamed: 0,Year,Maximum Temperature,Anomaly
0,Units: Degrees Fahrenheit,,
1,Base Period: 1901-2000,,
2,Missing: -99,,
3,Date,Value,Anomaly
4,195012,63.61,-0.39


Unnamed: 0,Year,Minimum Temperature,Anomaly
0,Units: Degrees Fahrenheit,,
1,Base Period: 1901-2000,,
2,Missing: -99,,
3,Date,Value,Anomaly
4,195012,39.17,-0.87


Unnamed: 0,Year,Palmer Z-Index,Anomaly
0,Base Period: 1901-2000,,
1,Missing: -99,,
2,Date,Value,Anomaly
3,195007,4.14,3.88
4,195107,0.80,0.54


It appears that each of the data frames was created successfully. However, they are still fairly dirty. They all contain:
- An unwanted column: `Anomaly`, which is just a measure of the difference between a given year and the average of all recorded years.
- Specific metadata for the given climate measurement.
- The `Year` also contains a month on it, 

I will first remove the `Anomaly` column, as this is the simplest change to make.

In [3]:
# I need to iterate through each data frame listed in the climate_list
for data_frame in climate_list:
    # Drop the "Anomaly" column
    data_frame.drop(columns="Anomaly", inplace=True)

# Check that the column was dropped.
# Since this is a simple change, I will only check 1 data frame for completion.
display(climate_list[1].head())

Unnamed: 0,Year,Heating Degree Days
0,Units: Fahrenheit Degree-Days,
1,Base Period: 1901-2000,
2,Missing: -9999,
3,Date,Value
4,195007,14


The `Anomaly` column was dropped successfully. I will next need to remove the metadata. Do this by hand would take too long. Each data frame has a different number of rows of metadata. However, the metadata ends for each data frame after the word `Data` appears in the `Year` column. So, I will simply find the index where the word `Data` appears and remove that row and all rows above it.

In [4]:
# drop metadata. Each table has a different amount of metadata. 
# The real data starts after the word 'Date' appears in the 'Year' column.
# first, find the index of the word 'Date', then drop all rows between that index and index 0

# Loop through all of the data frames
for data_frame in climate_list:
    
    # Find the index of where the word "Date" appears.
    ind = data_frame.index[data_frame['Year'] == 'Date'][0]
    
    # Itterate through rows ind +1 to 0, removing each row.
    for i in reversed(range(0, ind + 1)):
        
        # drop the row
        data_frame.drop(index=i, inplace=True)
        
    # Confirm the metadata was dropped successfully
    display(data_frame.head(1))

Unnamed: 0,Year,Cooling Degree Days
4,195007,263


Unnamed: 0,Year,Heating Degree Days
4,195007,14


Unnamed: 0,Year,Precipitation
4,195012,30.87


Unnamed: 0,Year,Palmer Drought Severity Index (PDSI)
3,195007,1.93


Unnamed: 0,Year,Palmer Hydrological Drought Index (PHDI)
3,195007,1.93


Unnamed: 0,Year,Palmer Modified Drought Index (PMDI)
3,195007,1.93


Unnamed: 0,Year,Average Temperature
4,195012,51.39


Unnamed: 0,Year,Maximum Temperature
4,195012,63.61


Unnamed: 0,Year,Minimum Temperature
4,195012,39.17


Unnamed: 0,Year,Palmer Z-Index
3,195007,4.14


All of the metadata appears to have been romved successfully. However, the different data frames now have different indexing schemes. The indeces will need to be reset to make manipulation and merging the data frames easier later. 

In [5]:
# Iterate through the different data frames
for data_frame in climate_list:
    # Reset the indeces
    data_frame.reset_index(drop=True, inplace=True)
    
    # check that the indeces were reset properly.
    display(data_frame.head(1))

Unnamed: 0,Year,Cooling Degree Days
0,195007,263


Unnamed: 0,Year,Heating Degree Days
0,195007,14


Unnamed: 0,Year,Precipitation
0,195012,30.87


Unnamed: 0,Year,Palmer Drought Severity Index (PDSI)
0,195007,1.93


Unnamed: 0,Year,Palmer Hydrological Drought Index (PHDI)
0,195007,1.93


Unnamed: 0,Year,Palmer Modified Drought Index (PMDI)
0,195007,1.93


Unnamed: 0,Year,Average Temperature
0,195012,51.39


Unnamed: 0,Year,Maximum Temperature
0,195012,63.61


Unnamed: 0,Year,Minimum Temperature
0,195012,39.17


Unnamed: 0,Year,Palmer Z-Index
0,195007,4.14


The indeces were reset successfully. The last thing that needs doing is that all of the years in the `Year` column have a "dummy" month attached as two additional digits at the end of the year. There is no actuall meaning to these months. This is simply the way the website exports these data. These dummy months will now be removed.

In [6]:
# trim the month data off the 'Year' column

# iterate through the data frames in the climate_list
for data_frame in climate_list:
    
    # y is the index value, which starts at 0
    y = 0
    
    # Iterate through each row in the "Year" column
    for date in data_frame['Year']:
        
        # change the data in the specified Year cell to just be the first 4 characters in that cell, thus removing the last 2 digits
        data_frame['Year'][y] = date[0:4]
        
        # increase the index by 1
        y += 1
    
    # confirm the year is now formatted properly.
    display(data_frame.head(1))

Unnamed: 0,Year,Cooling Degree Days
0,1950,263


Unnamed: 0,Year,Heating Degree Days
0,1950,14


Unnamed: 0,Year,Precipitation
0,1950,30.87


Unnamed: 0,Year,Palmer Drought Severity Index (PDSI)
0,1950,1.93


Unnamed: 0,Year,Palmer Hydrological Drought Index (PHDI)
0,1950,1.93


Unnamed: 0,Year,Palmer Modified Drought Index (PMDI)
0,1950,1.93


Unnamed: 0,Year,Average Temperature
0,1950,51.39


Unnamed: 0,Year,Maximum Temperature
0,1950,63.61


Unnamed: 0,Year,Minimum Temperature
0,1950,39.17


Unnamed: 0,Year,Palmer Z-Index
0,1950,4.14


The individual data frames are almost ready. To simplify merging, I will now set the year column as the index column for each data frame.

In [7]:
# Iterate through each data frame in the climate_list
for data_frame in climate_list:
    
    # Set the Year column as the index column
    data_frame.set_index('Year', inplace=True)

    # Confirm the year was set as the index
    display(data_frame.head(1))

Unnamed: 0_level_0,Cooling Degree Days
Year,Unnamed: 1_level_1
1950,263


Unnamed: 0_level_0,Heating Degree Days
Year,Unnamed: 1_level_1
1950,14


Unnamed: 0_level_0,Precipitation
Year,Unnamed: 1_level_1
1950,30.87


Unnamed: 0_level_0,Palmer Drought Severity Index (PDSI)
Year,Unnamed: 1_level_1
1950,1.93


Unnamed: 0_level_0,Palmer Hydrological Drought Index (PHDI)
Year,Unnamed: 1_level_1
1950,1.93


Unnamed: 0_level_0,Palmer Modified Drought Index (PMDI)
Year,Unnamed: 1_level_1
1950,1.93


Unnamed: 0_level_0,Average Temperature
Year,Unnamed: 1_level_1
1950,51.39


Unnamed: 0_level_0,Maximum Temperature
Year,Unnamed: 1_level_1
1950,63.61


Unnamed: 0_level_0,Minimum Temperature
Year,Unnamed: 1_level_1
1950,39.17


Unnamed: 0_level_0,Palmer Z-Index
Year,Unnamed: 1_level_1
1950,4.14


The `Year` column was successfully set as the index column. I can now merge the seperate data frames.

In [8]:
# Merge all climate data into a single dataframe

# Create a master data frame for all climate data
climate_df = pd.DataFrame()

# concatenate all of the individaul climate data into the master data frame
# Iterate through the data frames
for data_frame in climate_list:
    
    # Merge the data frames with the master data frame
    climate_df = pd.concat([climate_df, data_frame], axis=1)

# confirm the master data frame was created successfully.
display(climate_df.head(3))

Unnamed: 0_level_0,Cooling Degree Days,Heating Degree Days,Precipitation,Palmer Drought Severity Index (PDSI),Palmer Hydrological Drought Index (PHDI),Palmer Modified Drought Index (PMDI),Average Temperature,Maximum Temperature,Minimum Temperature,Palmer Z-Index
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1950,263,14,30.87,1.93,1.93,1.93,51.39,63.61,39.17,4.14
1951,314,9,31.25,1.65,1.65,1.65,51.12,63.19,39.04,0.8
1952,355,6,26.34,-1.84,-1.84,-1.84,52.27,64.7,39.85,-2.32


The climate data frame appears to have been created successfully and is now ready for export.

In [9]:
# Export the climate_df to the same file this notebook is located in.
climate_df.to_csv('./climate.csv')

This is the end of this notebook.