## Fetch weather data from the NOAA API

#### Documentations for the NOAA API:
https://www.ncdc.noaa.gov/cdo-web/webservices/v2

Specification for each dataset:
https://www.ncdc.noaa.gov/cdo-web/datasets

*the dataset GSOM is being used for the purpose of this project

#### Variables of Interest:
- NAME – STATUS  – details
- TAVG – <span style="color:green">completed</span> – Average Monthly Temperature. Computed by adding the unrounded monthly maximum and minimum temperatures and dividing by 2. Given in Fahrenheit.
- PRCP – <span style="color:green">completed</span> – Total Monthly Precipitation.
- SNOW – <span style="color:orange">completed</span> – Total Monthly Snowfall. Given in inches. 
    - data source missing 2018-09 data, judging from the past years, I think it's safe to assume that SNOW in 2018-09 is 0.
- DP10 – <span style="color:green">completed</span> – Number of days with precipitation >= 1.00 inch/25.4 millimeters in the month.
- EVAP – <span style="color:orange">completed</span> –Total Monthly Evaporation. Given in inches.
    - missing data for a few states in 2015 & early 2016.
- PSUN – <span style="color:red">unuseable</span> – Monthly Average of the daily percents of possible sunshine.
    - Most of the states do not report any data for this label.
- AWND – <span style="color:green">completed</span> – Monthly Average Wind Speed. Given in miles per hour.

__Note:__
- <span style="color:green">green:</span> completed with complete data for the given time range
- <span style="color:orange">orange:</span> completed with missing data
- <span style="color:red">red:</span> completed with too much missing data that the dataset is unuseable

__*List of Datasets:*__
- AWND201501-201809.csv
- DP10201501-201809.csv
- EVAP201501-201809.csv
- PRCP201501-201809.csv
- PSUN201501-201809.csv
- SNOW201501-201809.csv
- TAVG201501-201809.csv

In [4]:
import numpy as np
import pandas as pd
import json
import requests
import urllib.parse
import os

In [5]:
cwd = os.getcwd()
datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'

In [4]:
# NOAA_token = 'gBLAtsMIERRDtXDcGKwInwTndHvNITPF'
NOAA_token = 'kVaQwjUjMEbuzizSybCtCoxpLEPztEzv'

BASE_URL = 'https://www.ncdc.noaa.gov/cdo-web/api/v2'
Datasets_BASE_URL = BASE_URL+ '/data'

'''Function to obtain information from NOAA API'''
def get_result(url, param, NOAAToken) -> dict:
    response = None
    try:
        header = dict(token=NOAAToken)
        response = requests.get(url, param, headers=header)
        json_response = response.content.decode(encoding = 'utf-8')
        returned_dict = json.loads(json_response)
        return returned_dict
    finally:
        if response != None:
            response.close()

'''Function to build the necessary parameters for parsing temperature data from the GSOM database'''
def build_NOAA_dataset_param(locationid:str,startdate:str,enddate:str,datatype:str) -> str:
    query_parameters = [('datasetid','GSOM'),('locationid',locationid),
                        ('startdate',startdate),('enddate',enddate),
                        ('units','standard'),('datatypeid',datatype),('limit','1000')]
    # standard unit is used for all the data (in oppose to metric units)
    return urllib.parse.urlencode(query_parameters)

In [174]:
'''
!!!Skip this step if US_states.csv is already downloaded
Script to obtain a dataframe of all the US states, along with their stateID
- resulting csv is saved as US_states.csv
'''
Locations_BASE_URL = BASE_URL+ '/locations'
state_parameters = [('locationcategoryid','ST'),('limit','52')]
state_param = urllib.parse.urlencode(state_parameters)
state_dict = get_result(Locations_BASE_URL,state_param,NOAA_token)
cleaned_state_dict = {}
for i in state_dict['results']:
    cleaned_state_dict[i['name']] = i['id']

df_state = pd.DataFrame.from_dict(cleaned_state_dict,orient='index',columns=['state_id'])
df_state.to_csv('US_states.csv')

In [5]:
'''Building the time frame and location range (time_space) where we want NOAA data'''
time_year = [str(i) for i in range(2015,2019)]
time_month = [str(i) for i in range(1,13)]
time_month = ['0'+i if len(i)==1 else i for i in time_month]
time_space = [i+'-'+j for i in time_year for j in time_month]

time_space = time_space[:-3] #get rid of 2018Q4
# time_space

In [6]:
'''Read US_states.csv as a pandas DataFrame (with blank structure of the data)'''
df_state = pd.read_csv('US_states.csv',index_col=0,names=['','state_id']+time_space).iloc[1:,:]
# df_state.head()

In [31]:
# retrieving data on a specific parameter for the given time/space range
# run multiple times to ensure that all the data is properly acquired
def retrieve_df(variable:str,df:pd.DataFrame):
    for i in range(len(time_space)-1):
        for index, row in df.iterrows():
            if df.isna().loc[index,time_space[i]]: #to fill all the NaNs
                param = build_NOAA_dataset_param(row['state_id'],time_space[i]+'-01',time_space[i]+'-02',variable)
                try:
                    data_dict = get_result(Datasets_BASE_URL, param, NOAA_token)
                    print(time_space[i], index, data_dict['metadata']['resultset']['count'])#for debugging purposes
                    values = [i['value'] for i in data_dict['results']]
                except KeyError:
                    print(time_space[i], index, 'No Data')
                    values = [np.nan]
                except json.JSONDecodeError:
                    print(time_space[i], index, 'ERROR')
                    pass
                
                df.loc[index,time_space[i]] = np.mean(values)
    return df

#### TAVG: Average Temperature

In [None]:
# Average Temperature for the given Month
TAVG_param = 'TAVG'
TAVG_df = retrieve_df(TAVG_param,df_state)

In [7]:
TAVG_df.to_csv(datadir+'external/'+'TAVG201501-201809.csv')

NameError: name 'TAVG_df' is not defined

In [8]:
# read the csv file with ave temp for each month; then aggregate it by quarter 
# unit = Fahrenheit
TAVG_df = pd.read_csv(datadir+'external/'+'TAVG201501-201809.csv',index_col=0).iloc[:,1:] #this is to get rid of the first column (state_id)
TAVG_df.describe()
# 0 missing data

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,...,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,30.884203,28.187238,41.85747,52.185658,61.821848,69.972817,73.106682,71.89407,67.836481,55.571148,...,33.38942,30.285962,36.201856,40.704044,47.585567,64.993724,70.229447,74.34938,73.123949,67.623165
std,11.998621,14.825078,11.537362,9.57164,7.823726,6.718634,6.35016,5.959807,7.161969,7.9839,...,11.591471,11.504719,14.847127,10.666548,9.615076,8.065945,7.776295,5.409466,6.073042,8.168557
min,11.240541,4.486842,21.322364,32.969206,45.879204,54.399377,56.453086,51.985,41.720189,34.392857,...,13.651705,9.923944,5.108511,20.693,30.470096,42.652649,51.885762,57.74183,51.917,46.54507
25%,23.427848,16.733287,34.024794,44.745497,56.194677,64.442864,68.457392,68.32289,63.853421,50.598669,...,25.005108,23.595806,26.546068,33.465277,40.355543,60.198857,64.900074,70.680441,69.652993,62.121705
50%,30.587097,27.15,39.85,51.244898,62.152597,70.162179,72.478475,71.678195,68.684848,54.383768,...,32.443662,29.226316,34.734351,38.209655,46.54,64.964516,71.728261,74.149412,74.0,68.619429
75%,38.029754,39.627166,48.956124,57.443746,67.64361,75.772059,77.670909,75.914785,72.264216,59.788313,...,39.619131,36.515799,44.907917,46.736474,53.828371,71.735656,75.982181,78.017329,76.837792,73.690909
max,67.44717,69.061111,70.975926,76.21962,77.944586,81.925949,84.548684,83.476574,81.062264,76.085,...,67.068,68.2,70.71,67.547826,71.697163,78.104412,83.775871,84.606667,83.424599,82.807692


In [163]:
aggregated_by_quarter = TAVG_df.groupby(pd.PeriodIndex(TAVG_df.columns, freq='Q'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3
Alabama,46.928231,72.057654,78.270934,60.033338,50.211965,70.910266,80.665539,58.616761,54.934835,70.892207,77.530135,55.778212,50.603421,70.843834,79.866139
Alaska,16.74245,45.331378,50.052759,21.643262,21.74282,46.137857,52.817082,19.550547,9.323255,43.689653,51.921071,22.908455,14.950361,41.669502,52.067967
Arizona,49.305863,64.129426,75.025752,49.261639,47.225312,65.000406,73.744152,52.809886,47.341507,65.698068,74.379399,55.047496,46.992469,67.126249,75.562577
Arkansas,40.942971,69.76822,78.272733,55.596438,46.699964,69.3804,79.159959,54.772944,50.45928,68.890125,76.887424,52.782816,44.336211,69.830287,78.434489
California,53.329779,62.091489,72.3636,52.063341,49.970185,62.65889,71.741923,52.341079,47.705641,62.037937,73.12406,54.603736,48.257196,61.576193,72.956178


#### PRCP: Total Monthly Precipitation

In [None]:
# Total Monthly Precipitation
PRCP_param = 'PRCP'
PRCP_df = retrieve_df(PRCP_param,df_state)

In [None]:
PRCP_df.to_csv(datadir+'external/'+'PRCP201501-201809.csv')

In [46]:
# read the csv file with Total Monthly/Annual Precipitation for each month; then aggregate it by quarter 
# unit = Inches
PRCP_df = pd.read_csv(datadir+'external/'+'PRCP201501-201809.csv',index_col=0).iloc[:,1:] #this is to get rid of the first column (state_id)
PRCP_df.describe()
# 0 missing data

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,...,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,2.490675,2.13186,2.801954,3.488286,3.837892,4.892309,3.824923,3.044911,3.091166,3.513293,...,2.167983,2.539449,4.030734,3.161621,3.540094,3.877443,3.8436,3.798292,4.266177,4.967247
std,1.520601,1.242446,1.93022,1.949891,2.945913,2.711448,1.952542,1.884274,1.982103,2.131413,...,1.692075,1.709597,2.826356,1.612911,1.952234,2.273829,1.875494,2.184216,2.420324,3.072976
min,0.158817,0.361434,0.253164,0.432415,0.788065,0.208249,0.42899,0.063173,0.351508,0.647038,...,0.13069,0.191845,0.386261,0.288,0.018759,0.120995,0.052735,0.036047,0.032889,0.047429
25%,1.086453,1.120111,0.958079,2.326032,2.029942,3.407823,2.586574,1.87244,1.652909,1.887607,...,0.698966,1.154279,1.789226,1.78562,1.834632,2.113805,2.830482,2.437884,2.318089,2.543948
50%,2.591656,2.124962,2.879052,3.216433,2.716383,4.807601,3.641501,2.862873,2.871475,3.218103,...,1.661713,2.501444,3.515534,3.163923,3.714021,3.539537,3.900274,3.719162,4.803142,5.194316
75%,3.657459,2.8652,4.395407,4.422608,5.010056,6.900269,4.936485,3.653175,3.962624,4.415981,...,3.216466,3.34482,5.431585,4.169941,4.883385,5.33108,5.354221,5.022216,5.648371,7.574635
max,5.719205,5.331172,7.804027,8.715635,15.086935,10.98,9.53841,9.823,11.553985,11.920504,...,6.394634,8.073744,11.360283,7.425242,9.568898,10.094018,7.353064,9.13,12.836147,10.193333


In [47]:
aggregated_by_quarter = PRCP_df.groupby(pd.PeriodIndex(PRCP_df.columns, freq='Q'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3
Alabama,4.292857,5.063683,4.381662,7.053285,5.201336,3.716904,3.654481,2.751229,5.56806,7.403597,5.220556,3.938974,4.821895,5.661132,5.07006
Alaska,3.420439,2.406155,4.959754,4.662441,3.828712,2.774919,5.037917,3.22615,2.677167,2.315169,5.059351,4.360517,2.407222,2.818683,3.589024
Arizona,1.17087,0.725099,1.807447,1.188076,0.704811,0.602894,2.099749,1.190525,1.341123,0.160966,1.918154,0.094061,0.768739,0.230241,2.026463
Arkansas,4.689986,6.955468,3.165727,7.698019,4.190321,4.487992,4.812787,2.794519,3.490672,6.596194,3.950579,2.630518,6.325989,3.768693,5.121797
California,1.37754,0.759852,0.402765,2.533952,4.571983,0.739729,0.059846,3.83722,7.830338,1.164686,0.178725,1.300384,3.548905,0.925199,0.067636


#### SNOW: Total Monthly Snowfall

In [None]:
# Total Monthly Snowfall
SNOW_param = 'SNOW'
SNOW_df1 = retrieve_df(SNOW_param,df_state)

In [None]:
SNOW_df1.to_csv(datadir+'external/'+'SNOW201501-201805.csv')

In [180]:
# read the csv file with Total Monthly Snowfall for each month
# then aggregate it by quarter 
# unit = Inches
SNOW_df1 = pd.read_csv(datadir+'external/'+'SNOW201501-201805.csv',index_col=0,names=['','state_id']+time_space).iloc[1:,:]#.iloc[:,1:] #this is to get rid of the first column (state_id)
SNOW_df1.head()
#noticing a lot of missing data

Unnamed: 0,state_id,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
,,,,,,,,,,,,,,,,,,,,,
Alabama,FIPS:01,0.0133333333333333,6.490196078431373,0.2249999999999999,0.0,0.0,0.0,0.0,0.0,0.0,...,0.5236842105263158,0.4902439024390244,0.0,0.0,0.0,0.0,,,,
Alaska,FIPS:02,7.808695652173914,5.644680851063828,7.117708333333333,3.894117647058823,0.0227272727272727,0.034090909090909,0.0,0.0175824175824175,4.29,...,8.223809523809523,11.580898876404495,17.791011235955057,11.764634146341464,2.203225806451613,0.5724137931034482,,,,
Arizona,FIPS:04,4.429885057471265,0.7424028268551237,1.0598290598290598,0.1260714285714285,0.1796536796536796,0.0,0.0,0.0,0.0,...,0.0286163522012578,0.8153392330383482,1.897854077253219,0.2980327868852459,0.0308157099697885,0.1178124999999999,,,,
Arkansas,FIPS:05,0.0277777777777777,3.563057324840765,3.4477611940298507,0.0022900763358778,0.0,0.0,0.0,0.0,0.0,...,0.0167701863354037,1.1694610778443115,0.0266666666666666,0.0,0.0537931034482758,0.0,,,,
California,FIPS:06,0.0142857142857142,0.4945086705202312,0.2438547486033519,0.5719444444444445,0.2509677419354839,0.0,0.0,0.0,0.0,...,0.2124378109452736,1.3540084388185656,1.7013089005235602,9.680722891566266,0.7188311688311687,0.0564245810055865,,,,


In [190]:
#re-running to fill in the NaNs
variable = 'SNOW'
SNOW_df2 = pd.read_csv(datadir+'external/'+'SNOW201501-201809.csv',index_col=0)
SNOW_df2 = retrieve_df(variable, SNOW_df2)

In [193]:
SNOW_df2.describe()
#data source missing 2018-09 data, 
#judging from the past years, I think it's safe to assume that SNOW in 2018-09 is 0.

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,51.0,51.0,50.0,50.0,50.0,50.0,50.0,51.0,51.0,49.0,...,51.0,51.0,50.0,51.0,51.0,51.0,51.0,51.0,51.0,0.0
mean,7.3256,11.163849,4.41788,1.168461,0.265401,0.000983,0.000381,0.000425,0.084286,0.113093,...,7.045658,6.228499,7.140944,8.343922,3.406,0.050647,0.001612,0.000413,0.000395,
std,8.722077,11.512115,4.302115,2.104775,0.791193,0.004963,0.001625,0.002516,0.600697,0.372328,...,7.97379,5.723932,7.050644,8.96572,4.733505,0.156273,0.006474,0.002948,0.002181,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.559881,2.506667,0.717557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.715462,1.309861,0.841706,0.434133,0.057522,0.0,0.0,0.0,0.0,
50%,4.846377,6.585185,2.853039,0.127851,0.0,0.0,0.0,0.0,0.0,0.0,...,5.424691,4.321277,5.230148,6.569748,1.788636,0.0,0.0,0.0,0.0,
75%,8.879617,14.602113,7.68663,1.164831,0.065421,0.0,0.0,0.0,0.0,0.065217,...,8.595802,9.48648,11.078042,11.484519,4.443551,0.000962,0.0,0.0,0.0,
max,34.459406,44.695,16.75,8.809828,3.52037,0.034091,0.008403,0.017582,4.29,2.485714,...,28.553846,21.320202,25.437879,32.847222,18.876689,0.804688,0.039286,0.021053,0.014706,


In [186]:
SNOW_df2.to_csv(datadir+'external/'+'SNOW201501-201809.csv')

In [11]:
SNOW_df = pd.read_csv(datadir+'external/'+'SNOW201501-201809.csv',index_col=0).iloc[:,1:]
SNOW_df.describe()
# aggregated_by_quarter = SNOW_df.groupby(pd.PeriodIndex(SNOW_df.columns, freq='Q'), axis=1).mean()
# aggregated_by_quarter.head()

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,51.0,51.0,50.0,50.0,50.0,50.0,50.0,51.0,51.0,49.0,...,51.0,51.0,50.0,51.0,51.0,51.0,51.0,51.0,51.0,0.0
mean,7.3256,11.163849,4.41788,1.168461,0.265401,0.000983,0.000381,0.000425,0.084286,0.113093,...,7.045658,6.228499,7.140944,8.343922,3.406,0.050647,0.001612,0.000413,0.000395,
std,8.722077,11.512115,4.302115,2.104775,0.791193,0.004963,0.001625,0.002516,0.600697,0.372328,...,7.97379,5.723932,7.050644,8.96572,4.733505,0.156273,0.006474,0.002948,0.002181,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
25%,0.559881,2.506667,0.717557,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.715462,1.309861,0.841706,0.434133,0.057522,0.0,0.0,0.0,0.0,
50%,4.846377,6.585185,2.853039,0.127851,0.0,0.0,0.0,0.0,0.0,0.0,...,5.424691,4.321277,5.230148,6.569748,1.788636,0.0,0.0,0.0,0.0,
75%,8.879617,14.602113,7.68663,1.164831,0.065421,0.0,0.0,0.0,0.0,0.065217,...,8.595802,9.48648,11.078042,11.484519,4.443551,0.000962,0.0,0.0,0.0,
max,34.459406,44.695,16.75,8.809828,3.52037,0.034091,0.008403,0.017582,4.29,2.485714,...,28.553846,21.320202,25.437879,32.847222,18.876689,0.804688,0.039286,0.021053,0.014706,


#### DP10: Number of days with precipitation >= 1.00 inch/25.4 millimeters in the month.

In [None]:
# Total number of days with precipitation >= 1.00 inch/25.4 millimeters in the month
# unit = inch
DP10_param = 'DP10'
DP10_df = retrieve_df(DP10_param,df_state)

In [None]:
DP10_df.to_csv(datadir+'external/'+'DP10201501-201809.csv')

In [9]:
DP10_df = pd.read_csv(datadir+'external/'+'DP10201501-201809.csv',index_col=0).iloc[:,1:] #this is to get rid of the first column (state_id)
DP10_df.describe()
# 0 missing data

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,...,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,4.807604,4.614621,5.956594,6.839012,7.026273,7.728568,6.62862,5.102556,4.705622,5.3362,...,4.410242,4.949015,7.349005,6.012721,6.243693,7.085553,6.488747,5.94596,6.779593,6.753391
std,2.126784,2.006966,3.057218,2.411253,3.493194,3.335835,2.375078,2.21455,2.387474,1.635,...,2.514035,2.812518,3.288656,2.30914,2.900562,2.766769,2.603536,2.524208,2.930016,3.308148
min,0.592532,1.191235,0.76006,1.265854,2.123967,0.692513,1.27593,0.192771,0.862205,1.84058,...,0.379549,0.735043,1.581818,1.073059,0.085253,0.400463,0.171233,0.128155,0.094366,0.15083
25%,3.069444,3.20695,3.075874,5.828757,3.941057,5.5315,5.374615,4.11003,3.197053,4.249971,...,2.031392,3.499235,4.908186,4.765478,4.12832,5.770464,5.277562,4.780627,5.187704,4.799152
50%,5.178,4.591837,6.08125,6.810309,6.603636,7.811224,6.521073,4.755556,4.2,5.275362,...,4.665354,5.213018,7.903361,6.095745,5.781362,7.505085,6.813131,6.073864,7.627907,6.817365
75%,6.259104,5.919584,8.52847,8.147458,9.617252,10.676194,7.79218,5.803182,5.318309,6.465833,...,6.407687,6.07619,9.781754,7.256705,8.012435,8.988215,8.470429,7.630453,8.599208,8.970114
max,9.333333,9.380671,11.163588,11.466667,14.56129,14.356164,12.149284,12.715431,12.226087,9.94686,...,9.431555,15.891705,14.037634,10.823529,12.621359,12.320965,10.843478,12.831395,12.857868,12.333333


In [106]:
aggregated_by_quarter = DP10_df.groupby(pd.PeriodIndex(DP10_df.columns, freq='Q'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3
Alabama,7.010941,7.99817,6.526474,7.267033,6.117246,5.376676,6.323448,3.307723,7.2592,8.502084,6.906492,5.209086,7.244026,7.427125,7.548988
Alaska,6.427,5.513423,10.336559,8.979941,7.283221,6.787782,10.471054,6.288735,5.442729,5.961737,10.59388,8.599639,6.093157,6.725057,8.421027
Arizona,2.657535,1.97161,4.189651,3.316578,1.641814,1.775444,4.559666,2.939756,3.100388,0.612323,4.152277,0.336516,1.951869,0.544096,4.493255
Arkansas,6.699857,8.935607,4.008884,6.86103,5.313871,6.492739,6.583304,4.132718,6.298751,6.842576,5.61411,3.658821,7.031669,5.267324,6.871052
California,2.163413,1.793912,0.848985,4.729417,6.613647,1.931278,0.169361,5.435925,9.446713,2.293451,0.532073,2.102354,5.764785,1.621762,0.178761


#### EVAP: Total Monthly Evaporation.

In [None]:
# Total Monthly Evaporation
# unit = inch
EVAP_param = 'EVAP'
EVAP_df = retrieve_df(EVAP_param,df_state)

In [None]:
EVAP_df.to_csv(datadir+'external/'+'EVAP201501-201809.csv')

In [10]:
EVAP_df = pd.read_csv(datadir+'external/'+'EVAP201501-201809.csv',index_col=0).iloc[:,1:]
EVAP_df.head()
### SOME MISSING DATA in 2015 & early 2016

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
Alabama,1.87,,3.67,4.19,6.78,6.34,7.34,6.82,4.95,4.05,...,6.907692,5.209964,9.954545,6.567568,6.287719,7.971014,8.022642,7.264151,7.535156,7.847656
Alaska,,,,,5.48,5.13,4.14,2.845,1.45,,...,8.379487,6.430052,6.051546,5.797872,5.209756,8.152284,6.813131,6.235,12.857868,6.170213
Arizona,2.09,3.385,5.96,8.925,9.49,12.46,12.95,11.51,8.625,5.235,...,0.7471,1.353333,3.429213,1.073059,0.085253,0.400463,1.146572,6.149758,5.307506,2.0225
Arkansas,0.59,0.38,2.34,4.45,6.115,7.075,7.4875,6.7825,5.935,5.37,...,4.324723,4.347518,10.693141,6.054348,5.560284,5.553957,4.687732,6.053435,7.820225,6.739496
California,1.844615,2.745385,5.065,7.32,8.195385,11.023077,11.251538,10.818462,8.666154,5.471538,...,0.81586,6.099075,1.802326,9.392954,3.290495,1.403557,0.171233,0.291086,0.094366,0.15083


In [33]:
EVAP_param = 'EVAP'
EVAP_df = pd.read_csv(datadir+'external/'+'EVAP201501-201809.csv',index_col=0)
EVAP_df2 = retrieve_df(EVAP_param,EVAP_df)

2015-01 Alaska No Data
2015-01 Connecticut No Data
2015-01 Delaware No Data
2015-01 District of Columbia No Data
2015-01 Georgia No Data
2015-01 Indiana No Data
2015-01 Iowa No Data
2015-01 Kansas No Data
2015-01 Kentucky No Data
2015-01 Maine No Data
2015-01 Maryland No Data
2015-01 Massachusetts No Data
2015-01 Michigan No Data
2015-01 Minnesota No Data
2015-01 Missouri No Data
2015-01 Montana No Data
2015-01 Nebraska No Data
2015-01 New Jersey No Data
2015-01 New York No Data
2015-01 North Carolina No Data
2015-01 North Dakota No Data
2015-01 Ohio No Data
2015-01 Oklahoma No Data
2015-01 Oregon No Data
2015-01 Pennsylvania No Data
2015-01 Rhode Island No Data
2015-01 South Dakota No Data
2015-01 Utah No Data
2015-01 Vermont No Data
2015-01 Virginia No Data
2015-01 Washington No Data
2015-01 West Virginia No Data
2015-01 Wisconsin No Data
2015-01 Wyoming No Data
2015-02 Alabama No Data
2015-02 Alaska No Data
2015-02 Connecticut No Data
2015-02 Delaware No Data
2015-02 District of Col

2016-01 Washington No Data
2016-01 West Virginia No Data
2016-01 Wisconsin No Data
2016-01 Wyoming No Data
2016-02 Alaska No Data
2016-02 Connecticut No Data
2016-02 Delaware No Data
2016-02 District of Columbia No Data
2016-02 Georgia No Data
2016-02 Hawaii No Data
2016-02 Illinois No Data
2016-02 Indiana No Data
2016-02 Iowa No Data
2016-02 Kansas No Data
2016-02 Kentucky No Data
2016-02 Maine No Data
2016-02 Maryland No Data
2016-02 Massachusetts No Data
2016-02 Michigan ERROR
2016-02 Minnesota No Data
2016-02 Missouri No Data
2016-02 Montana No Data
2016-02 Nebraska No Data
2016-02 New Jersey No Data
2016-02 New York No Data
2016-02 North Carolina No Data
2016-02 North Dakota No Data
2016-02 Ohio No Data
2016-02 Oklahoma No Data
2016-02 Oregon No Data
2016-02 Pennsylvania No Data
2016-02 Rhode Island No Data
2016-02 South Dakota No Data
2016-02 Tennessee No Data
2016-02 Utah No Data
2016-02 Vermont No Data
2016-02 Virginia No Data
2016-02 Washington No Data
2016-02 West Virginia No

In [32]:
EVAP_df2.to_csv(datadir+'external/'+'EVAP201501-201809.csv')

In [34]:
EVAP_df2 = pd.read_csv(datadir+'external/'+'EVAP201501-201809.csv',index_col=0).iloc[:,1:]
EVAP_df2.describe()
### STILL SOME MISSING DATA in 2015 & early 2016

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,17.0,14.0,16.0,27.0,39.0,40.0,38.0,39.0,39.0,33.0,...,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0,51.0
mean,1.699316,2.30226,3.637491,5.658565,8.10191,12.31412,10.997689,9.722893,7.598801,3.993213,...,4.410242,4.949015,7.349005,6.012721,6.243693,7.085553,6.488747,5.94596,6.779593,6.753391
std,1.451848,1.882164,2.846456,2.346378,7.31845,16.361827,12.0078,10.26316,7.126413,1.789782,...,2.514035,2.812518,3.288656,2.30914,2.900562,2.766769,2.603536,2.524208,2.930016,3.308148
min,0.0,0.0,0.0,0.0,3.065,2.175,2.915,2.495,1.45,0.0,...,0.379549,0.735043,1.581818,1.073059,0.085253,0.400463,0.171233,0.128155,0.094366,0.15083
25%,0.35,0.5025,1.599167,4.5,5.51,6.093333,6.774375,5.795,4.7175,3.03,...,2.031392,3.499235,4.908186,4.765478,4.12832,5.770464,5.277562,4.780627,5.187704,4.799152
50%,1.87,2.417692,3.483333,5.074444,6.241667,7.6095,7.685,6.82,5.51,3.907778,...,4.665354,5.213018,7.903361,6.095745,5.781362,7.505085,6.813131,6.073864,7.627907,6.817365
75%,2.43125,4.01125,5.79875,6.522419,6.935,9.51125,10.151978,9.650833,8.391667,5.235,...,6.407687,6.07619,9.781754,7.256705,8.012435,8.988215,8.470429,7.630453,8.599208,8.970114
max,4.86,5.205,8.365,11.95,45.69,74.83,61.5,61.43,45.315,7.99,...,9.431555,15.891705,14.037634,10.823529,12.621359,12.320965,10.843478,12.831395,12.857868,12.333333


In [19]:
aggregated_by_quarter = EVAP_df.groupby(pd.PeriodIndex(EVAP_df.columns, freq='Y'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015,2016,2017,2018
Alabama,4.605455,4.776686,6.969216,7.406713
Alaska,3.809,7.84919,7.649496,7.079747
Arizona,7.15625,3.439133,2.050376,2.32974
Arkansas,4.069028,4.912607,5.603565,6.390015
California,6.381611,2.630887,3.593648,2.521769


#### PSUN: Monthly Average of the daily percents of possible sunshine.
unuseable – Most of the states do not report any data for this label.

In [None]:
# Monthly Average of the daily percents of possible sunshine
# unit = percentage
df = df_state
PSUN_param = 'PSUN'
PSUN_df = retrieve_df(PSUN_param,df_state)

In [None]:
PSUN_df.to_csv(datadir+'external/'+'PSUN201501-201809.csv')

In [38]:
PSUN_df = pd.read_csv(datadir+'external/'+'PSUN201501-201809.csv',index_col=0).iloc[:,1:] #this is to get rid of the first column (state_id)
PSUN_df.describe()
### A LOT OF MISING DATA

Unnamed: 0,2015-01,2015-02,2015-03,2015-04,2015-05,2015-06,2015-07,2015-08,2015-09,2015-10,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,...,2.0,1.0,2.0,3.0,3.0,3.0,3.0,3.0,3.0,2.0
mean,19.2,33.7,63.6,55.1,44.9,52.8,68.9,52.3,51.95,39.5,...,33.7,37.5,30.4,44.666667,40.266667,55.5,48.8,59.366667,46.866667,43.85
std,,,,,,,,1.414214,12.940054,,...,13.57645,,16.263456,12.038411,14.303962,6.237788,4.635731,14.981433,12.994743,12.374369
min,19.2,33.7,63.6,55.1,44.9,52.8,68.9,51.3,42.8,39.5,...,24.1,37.5,18.9,35.5,25.3,50.9,43.5,42.2,32.5,35.1
25%,19.2,33.7,63.6,55.1,44.9,52.8,68.9,51.8,47.375,39.5,...,28.9,37.5,24.65,37.85,33.5,51.95,47.15,54.15,41.4,39.475
50%,19.2,33.7,63.6,55.1,44.9,52.8,68.9,52.3,51.95,39.5,...,33.7,37.5,30.4,40.2,41.7,53.0,50.8,66.1,50.3,43.85
75%,19.2,33.7,63.6,55.1,44.9,52.8,68.9,52.8,56.525,39.5,...,38.5,37.5,36.15,49.25,47.75,57.8,51.45,67.95,54.05,48.225
max,19.2,33.7,63.6,55.1,44.9,52.8,68.9,53.3,61.1,39.5,...,43.3,37.5,41.9,58.3,53.8,62.6,52.1,69.8,57.8,52.6


In [39]:
PSUN_param = 'PSUN'
PSUN_df = pd.read_csv(datadir+'external/'+'PSUN201501-201809.csv',index_col=0)
PSUN_df2 = retrieve_df(PSUN_param,PSUN_df)

2015-01 Alabama No Data
2015-01 Alaska No Data
2015-01 Arizona No Data
2015-01 Arkansas No Data
2015-01 California No Data
2015-01 Colorado No Data
2015-01 Connecticut No Data
2015-01 Delaware No Data
2015-01 District of Columbia No Data
2015-01 Florida No Data
2015-01 Georgia No Data
2015-01 Hawaii No Data
2015-01 Idaho No Data
2015-01 Illinois No Data
2015-01 Indiana No Data
2015-01 Iowa No Data
2015-01 Kansas No Data
2015-01 Kentucky No Data
2015-01 Louisiana No Data
2015-01 Maine No Data
2015-01 Maryland No Data
2015-01 Massachusetts No Data
2015-01 Minnesota No Data
2015-01 Mississippi No Data
2015-01 Missouri No Data
2015-01 Montana No Data
2015-01 Nebraska No Data
2015-01 Nevada No Data
2015-01 New Hampshire No Data
2015-01 New Jersey No Data
2015-01 New Mexico No Data
2015-01 New York No Data
2015-01 North Carolina No Data
2015-01 North Dakota No Data
2015-01 Ohio No Data
2015-01 Oklahoma No Data
2015-01 Oregon No Data
2015-01 Pennsylvania No Data
2015-01 Rhode Island No Data
2

2015-07 Maryland No Data
2015-07 Massachusetts No Data
2015-07 Minnesota No Data
2015-07 Mississippi No Data
2015-07 Missouri No Data
2015-07 Montana No Data
2015-07 Nebraska No Data
2015-07 Nevada No Data
2015-07 New Hampshire No Data
2015-07 New Jersey No Data
2015-07 New Mexico No Data
2015-07 New York No Data
2015-07 North Carolina No Data
2015-07 North Dakota No Data
2015-07 Ohio No Data
2015-07 Oklahoma No Data
2015-07 Oregon No Data
2015-07 Pennsylvania No Data
2015-07 Rhode Island No Data
2015-07 South Carolina No Data
2015-07 South Dakota No Data
2015-07 Tennessee No Data
2015-07 Texas No Data
2015-07 Utah No Data
2015-07 Vermont No Data
2015-07 Virginia No Data
2015-07 Washington No Data
2015-07 West Virginia No Data
2015-07 Wisconsin No Data
2015-07 Wyoming No Data
2015-08 Alabama No Data
2015-08 Alaska No Data
2015-08 Arizona No Data
2015-08 Arkansas No Data
2015-08 California No Data
2015-08 Colorado No Data
2015-08 Connecticut No Data
2015-08 Delaware No Data
2015-08 Dist

2016-01 Utah No Data
2016-01 Vermont No Data
2016-01 Virginia No Data
2016-01 Washington No Data
2016-01 West Virginia No Data
2016-01 Wisconsin No Data
2016-01 Wyoming No Data
2016-02 Alabama No Data
2016-02 Alaska No Data
2016-02 Arizona No Data
2016-02 Arkansas No Data
2016-02 California No Data
2016-02 Colorado No Data
2016-02 Connecticut No Data
2016-02 Delaware No Data
2016-02 District of Columbia No Data
2016-02 Florida No Data
2016-02 Georgia No Data
2016-02 Hawaii No Data
2016-02 Idaho No Data
2016-02 Illinois No Data
2016-02 Indiana No Data
2016-02 Iowa No Data
2016-02 Kansas No Data
2016-02 Kentucky No Data
2016-02 Louisiana No Data
2016-02 Maine No Data
2016-02 Maryland No Data
2016-02 Minnesota No Data
2016-02 Mississippi No Data
2016-02 Missouri No Data
2016-02 Montana No Data
2016-02 Nebraska No Data
2016-02 Nevada No Data
2016-02 New Jersey No Data
2016-02 New Mexico No Data
2016-02 New York No Data
2016-02 North Carolina No Data
2016-02 North Dakota No Data
2016-02 Ohi

2016-08 Ohio No Data
2016-08 Oklahoma No Data
2016-08 Oregon No Data
2016-08 Pennsylvania No Data
2016-08 Rhode Island No Data
2016-08 South Carolina No Data
2016-08 South Dakota No Data
2016-08 Tennessee No Data
2016-08 Texas No Data
2016-08 Utah No Data
2016-08 Vermont No Data
2016-08 Virginia No Data
2016-08 Washington No Data
2016-08 West Virginia No Data
2016-08 Wisconsin No Data
2016-08 Wyoming No Data
2016-09 Alabama No Data
2016-09 Alaska No Data
2016-09 Arizona No Data
2016-09 Arkansas No Data
2016-09 California No Data
2016-09 Colorado No Data
2016-09 Connecticut No Data
2016-09 Delaware No Data
2016-09 District of Columbia No Data
2016-09 Georgia No Data
2016-09 Hawaii No Data
2016-09 Idaho No Data
2016-09 Illinois No Data
2016-09 Indiana No Data
2016-09 Iowa No Data
2016-09 Kansas No Data
2016-09 Kentucky No Data
2016-09 Louisiana No Data
2016-09 Maine No Data
2016-09 Maryland No Data
2016-09 Minnesota No Data
2016-09 Mississippi No Data
2016-09 Missouri No Data
2016-09 Mon

2017-03 Minnesota No Data
2017-03 Mississippi No Data
2017-03 Missouri No Data
2017-03 Montana No Data
2017-03 Nebraska No Data
2017-03 Nevada No Data
2017-03 New Jersey No Data
2017-03 New Mexico No Data
2017-03 New York No Data
2017-03 North Carolina No Data
2017-03 North Dakota No Data
2017-03 Ohio No Data
2017-03 Oklahoma No Data
2017-03 Oregon No Data
2017-03 Pennsylvania No Data
2017-03 Rhode Island No Data
2017-03 South Carolina No Data
2017-03 South Dakota No Data
2017-03 Tennessee No Data
2017-03 Texas No Data
2017-03 Utah No Data
2017-03 Vermont No Data
2017-03 Virginia No Data
2017-03 Washington No Data
2017-03 West Virginia No Data
2017-03 Wisconsin No Data
2017-03 Wyoming No Data
2017-04 Alabama No Data
2017-04 Alaska No Data
2017-04 Arizona No Data
2017-04 Arkansas No Data
2017-04 California No Data
2017-04 Colorado No Data
2017-04 Connecticut No Data
2017-04 Delaware No Data
2017-04 District of Columbia No Data
2017-04 Georgia No Data
2017-04 Hawaii No Data
2017-04 Idaho

2017-10 Georgia No Data
2017-10 Hawaii No Data
2017-10 Idaho No Data
2017-10 Illinois No Data
2017-10 Indiana No Data
2017-10 Iowa No Data
2017-10 Kansas No Data
2017-10 Kentucky No Data
2017-10 Louisiana No Data
2017-10 Maine No Data
2017-10 Maryland No Data
2017-10 Minnesota No Data
2017-10 Mississippi No Data
2017-10 Missouri No Data
2017-10 Montana No Data
2017-10 Nebraska No Data
2017-10 Nevada No Data
2017-10 New Jersey No Data
2017-10 New Mexico No Data
2017-10 New York No Data
2017-10 North Carolina No Data
2017-10 North Dakota No Data
2017-10 Ohio No Data
2017-10 Oklahoma No Data
2017-10 Oregon No Data
2017-10 Pennsylvania No Data
2017-10 Rhode Island No Data
2017-10 South Carolina No Data
2017-10 South Dakota No Data
2017-10 Tennessee No Data
2017-10 Texas No Data
2017-10 Utah No Data
2017-10 Vermont No Data
2017-10 Virginia No Data
2017-10 Washington No Data
2017-10 West Virginia No Data
2017-10 Wisconsin No Data
2017-10 Wyoming No Data
2017-11 Alabama No Data
2017-11 Alaska

2018-04 Texas No Data
2018-04 Utah No Data
2018-04 Vermont No Data
2018-04 Virginia No Data
2018-04 Washington No Data
2018-04 West Virginia No Data
2018-04 Wisconsin No Data
2018-04 Wyoming No Data
2018-05 Alabama No Data
2018-05 Alaska No Data
2018-05 Arizona No Data
2018-05 Arkansas No Data
2018-05 California No Data
2018-05 Colorado No Data
2018-05 Connecticut No Data
2018-05 Delaware No Data
2018-05 District of Columbia No Data
2018-05 Florida No Data
2018-05 Georgia No Data
2018-05 Hawaii No Data
2018-05 Idaho No Data
2018-05 Illinois No Data
2018-05 Indiana No Data
2018-05 Iowa No Data
2018-05 Kansas No Data
2018-05 Kentucky No Data
2018-05 Louisiana No Data
2018-05 Maine No Data
2018-05 Maryland No Data
2018-05 Minnesota No Data
2018-05 Mississippi No Data
2018-05 Missouri No Data
2018-05 Montana No Data
2018-05 Nebraska No Data
2018-05 Nevada No Data
2018-05 New Jersey No Data
2018-05 New Mexico No Data
2018-05 New York No Data
2018-05 North Carolina No Data
2018-05 North Dako

In [42]:
PSUN_df2.to_csv(datadir+'external/'+'PSUN201501-201809.csv')

In [44]:
PSUN_df2 = pd.read_csv(datadir+'external/'+'PSUN201501-201809.csv',index_col=0).iloc[:,1:]
aggregated_by_quarter = PSUN_df2.groupby(pd.PeriodIndex(PSUN_df2.columns, freq='Q'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3
Alabama,,,,,,,,,,,,,,,
Alaska,,,,,,,,,,,,,,,
Arizona,,,,,,,,,,,,,,,
Arkansas,,,,,,,,,,,,,,,
California,,,,,,,,,,,,,,,


#### AWND – Monthly Average Wind Speed

In [None]:
# Monthly Average Wind Speed
# unit = miles per hour
AWND_param = 'AWND'
AWND_df = retrieve_df(AWND_param,df_state)

In [20]:
AWND_df.to_csv(datadir+'external/'+'AWND201501-201809.csv')
# 0 missing data

In [64]:
AWND_df = pd.read_csv(datadir+'external/'+'AWND201501-201809.csv',index_col=0).iloc[:,1:] #this is to get rid of the first column (state_id)
aggregated_by_quarter = AWND_df.groupby(pd.PeriodIndex(AWND_df.columns, freq='Q'), axis=1).mean()
aggregated_by_quarter.head()

Unnamed: 0,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3
Alabama,6.321936,4.921201,4.415556,5.892857,6.952083,5.170384,4.210088,5.089474,6.552941,5.488235,4.10875,5.356005,6.999632,5.172549,3.998897
Alaska,8.471976,7.950941,7.52261,8.766213,9.035234,8.347439,7.431746,8.22921,8.344507,8.02667,7.300506,8.968875,8.812165,8.834016,7.193083
Arizona,5.731774,7.72232,6.517702,9.704971,10.1,8.990196,11.138889,9.97037,11.071133,9.452941,6.641176,5.748366,6.314815,8.038889,6.340741
Arkansas,6.613725,6.352587,4.635294,6.0875,7.588235,5.772004,4.414815,5.566667,7.65,6.42963,4.334641,5.953159,7.406318,6.264815,4.56122
California,4.940638,7.227901,6.446914,5.61844,6.070329,7.246497,6.628692,5.60459,6.661572,7.538099,6.458974,5.115787,5.891067,7.541675,6.442122
