In [1]:
import pandas as pd

# Weather Dataset

This dataset was obtained by a request using the tool [here](https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND). See [here](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf) for the datasheet for this dataset.

We downloaded all daily summaries from 2013-01-01 to correspond to the full set of days for which we have crash data. This required us to download the data in 4 chunks due to limites on the NCDC website.

In [2]:
weather_df_1 = pd.read_csv('../../data/raw_data/weather/weather_data_2013-01-01_2013-06-30.csv')
weather_df_2 = pd.read_csv('../../data/raw_data/weather/weather_data_2013-07-01_2015-12-31.csv')
weather_df_3 = pd.read_csv('../../data/raw_data/weather/weather_data_2016-01-01_2018-06-30.csv')
weather_df_4 = pd.read_csv('../../data/raw_data/weather/weather_data_2018-07-01_2021-10-31.csv')

Our first step is to just merge all of these into one dataframe:

In [3]:
weather_df = pd.concat((weather_df_1, weather_df_2, weather_df_3, weather_df_4))

This dataset has a row for each daily observation from each station. Not all stations include all information that we want. The first thing we want to do is group values by day and then take the average over everything available:

In [4]:
weather_df_by_date = weather_df.groupby('DATE').mean()
weather_df_by_date

Unnamed: 0_level_0,AWND,DAPR,DASF,MDPR,MDSF,PGTM,PRCP,PSUN,SNOW,SNWD,...,WT09,WT10,WT11,WT13,WT14,WT15,WT16,WT18,WT19,WT22
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,11.23875,9.000000,,2.130,,1843.500,0.000741,,0.0,0.722222,...,,,,,,,1.0,1.0,,
2013-01-02,9.70375,8.666667,,1.830,,1201.750,0.000000,,0.0,0.350000,...,,,,,,,,,,
2013-01-03,7.40750,7.000000,,0.300,,941.250,0.000000,,0.0,0.287500,...,,,,,,,,,,
2013-01-04,11.32250,20.000000,,6.680,,1220.750,0.000000,,0.0,0.273333,...,,,,,,,,,,
2013-01-05,8.02625,,,,,1126.500,0.000000,,0.0,0.178571,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,2.000000,,3.775,,848.500,1.602989,,0.0,0.000000,...,,,,,,,,,,
2021-10-28,6.34750,,,,,1180.500,0.008101,,0.0,0.000000,...,,,,,,,,,,
2021-10-29,13.67250,,,,,1912.750,0.043947,,0.0,0.000000,...,,,,,,,,,,
2021-10-30,8.47125,6.000000,,4.170,,119.750,0.648353,,0.0,0.000000,...,,,,,,,,,,


We drop the  following columns as they are sparsely populated:
* `MDSF` Multi-day snowfall total
* `DASF` Days included in `MDSF`
* `MDPR` Multi-day precipitation total
* `DAPR` Days included in `MDPR`
* `PSUN` Daily percent of possible sunshine
* `TSUN` Daily total sunshine
* `TAVG`: Average temperature (degrees F)
* `WESD`: Water equivalent of snow on the ground (inches)
* `WESF`: Water equivalent of snowfall (inches)
* `WSF2`: Fastest 2-minute wind speed (mph)
* `WSF5`: Fastest 5-second wind speed (mph)
* `WDF2`: Direction of fastest 2-minute wind (degrees)
* `WDF5`: Direction of fastest 5-minute wind (degrees)

In [5]:
weather_df_by_date = weather_df_by_date.drop(columns=['DAPR', 'MDPR', 'DASF', 'MDSF', 'PSUN', 'TSUN', 'TAVG', 'WESD', 'WESF', 'WSF2', 'WSF5', 'WDF2', 'WDF5'])

After doing this we are left with 15 numerical columns which are fully populated:

In [6]:
weather_df_by_date[['AWND', 'PGTM', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS']].dropna()

Unnamed: 0_level_0,AWND,PGTM,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2013-01-01,11.23875,1843.500,0.000741,0.0,0.722222,39.785714,27.571429,33.333333
2013-01-02,9.70375,1201.750,0.000000,0.0,0.350000,35.500000,22.285714,24.666667
2013-01-03,7.40750,941.250,0.000000,0.0,0.287500,32.714286,22.642857,25.333333
2013-01-04,11.32250,1220.750,0.000000,0.0,0.273333,37.000000,27.071429,33.333333
2013-01-05,8.02625,1126.500,0.000000,0.0,0.178571,41.714286,29.714286,33.333333
...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,848.500,1.602989,0.0,0.000000,62.923077,52.769231,55.800000
2021-10-28,6.34750,1180.500,0.008101,0.0,0.000000,59.615385,46.769231,47.200000
2021-10-29,13.67250,1912.750,0.043947,0.0,0.000000,57.076923,45.769231,47.200000
2021-10-30,8.47125,119.750,0.648353,0.0,0.000000,62.461538,51.615385,54.000000


We also have 9 boolean categorical columns indicating the day's observed weather types. Note that these columns are not mutually exclusive:

In [7]:
weather_df_by_date[['WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09', 'WT11']]

Unnamed: 0_level_0,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-01-01,,,,,,,,,
2013-01-02,,,,1.0,,,1.0,,
2013-01-03,,,,,,,,,
2013-01-04,,,,,,,,,
2013-01-05,,,,,,,,,
...,...,...,...,...,...,...,...,...,...
2021-10-27,1.0,,,,,,,,
2021-10-28,,,,,,,,,
2021-10-29,1.0,,1.0,,,,,,
2021-10-30,1.0,,,,,,,,


We can see that these columns have `NaN` if the corresponding weather type was not observed for that day. We'll fill in those values with `0`:

In [8]:
weather_df_by_date = weather_df_by_date.fillna(0)

In [9]:
weather_df_by_date

Unnamed: 0_level_0,AWND,PGTM,PRCP,SNOW,SNWD,TMAX,TMIN,TOBS,WT01,WT02,...,WT09,WT10,WT11,WT13,WT14,WT15,WT16,WT18,WT19,WT22
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,11.23875,1843.500,0.000741,0.0,0.722222,39.785714,27.571429,33.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
2013-01-02,9.70375,1201.750,0.000000,0.0,0.350000,35.500000,22.285714,24.666667,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-03,7.40750,941.250,0.000000,0.0,0.287500,32.714286,22.642857,25.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-04,11.32250,1220.750,0.000000,0.0,0.273333,37.000000,27.071429,33.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2013-01-05,8.02625,1126.500,0.000000,0.0,0.178571,41.714286,29.714286,33.333333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,848.500,1.602989,0.0,0.000000,62.923077,52.769231,55.800000,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-28,6.34750,1180.500,0.008101,0.0,0.000000,59.615385,46.769231,47.200000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-29,13.67250,1912.750,0.043947,0.0,0.000000,57.076923,45.769231,47.200000,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-30,8.47125,119.750,0.648353,0.0,0.000000,62.461538,51.615385,54.000000,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After doing this we can see that we now have 3226 rows. This is the correct number of days for the full date range, so we know that there are no days with no observations.

In [10]:
weather_df_by_date.columns

Index(['AWND', 'PGTM', 'PRCP', 'SNOW', 'SNWD', 'TMAX', 'TMIN', 'TOBS', 'WT01',
       'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09', 'WT10', 'WT11',
       'WT13', 'WT14', 'WT15', 'WT16', 'WT18', 'WT19', 'WT22'],
      dtype='object')

We can also see that we have full rank for all columns. Below are their definitions.

The following columns are continuous:
* `AWND`: Average wind speed (mph)
* `PGTM`: Peak gust time (hours and minutes, i.e., HHMM)
* `PRCP`: Amount of precipitation (inches)
* `SNOW`: Amount of snowfall (inches)
* `SNWD`: Snow depth on ground (inches)
* `TMAX`: Maximum temperature (degrees F)
* `TMIN`: Minimum temperature (degrees F)
* `TOBS`: Temperature at the time of observation (degrees F)

The following columns contain a `1` if the specified weather condition was observed. 0 otherwise.
* `WT01`: Fog, ice fog, or freezing fog (may include heavy fog)
* `WT02`: Heavy fog or heaving freezing fog (not always distinguished from fog)
* `WT03`: Thunder
* `WT04`: Ice pellets, sleet, snow pellets, or small hail
* `WT05`: Hail (may include small hail)
* `WT06`: Glaze or rime
* `WT08`: Smoke or haze
* `WT09`: Blowing or drifting snow
* `WT11`: High or damaging winds
* `WT13`: Mist
* `WT14`: Drizzle
* `WT15`: Freezing Drizzle
* `WT16`: Rain (may include freezing rain, drizzle, and freezing drizzle)
* `WT18`: Snow, snow pellets, snow grains, or ice crystals
* `WT19`: Unknown source of precipitation
* `WT22`: Ice fog or freezing fog

We now write this normalized dataframe to disk for safe-keeping:

In [11]:
weather_df_by_date.to_csv('../../data/weather_data_normalized_2031-01-01_2021-10-31.csv')