In [1]:
import pandas as pd

# Weather Dataset

This dataset was obtained by a request using the tool [here](https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND). This dataset includes daily summaries of all observed weather data for stations corresponding to New York City from 2018-07-01 to 2021-10-31. See [here](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf) for the datasheet for this dataset.

In [2]:
weather_df = pd.read_csv('../../data/weather_data_2018-07-01_2021-10-31.csv')

This dataset has a row for each daily observation from each station. Not all stations include all information that we want. The first thing we want to do is group values by day and then take the average over everything available:

In [3]:
weather_df_by_date = weather_df.groupby('DATE').mean()
weather_df_by_date

Unnamed: 0_level_0,AWND,DAPR,MDPR,PGTM,PRCP,PSUN,SNOW,SNWD,TAVG,TMAX,...,WSF5,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-07-01,4.39000,,,1349.250,0.000000,,0.0,0.0,86.000000,95.666667,...,15.7750,,,,,,,,,
2018-07-02,5.39625,2.0,0.000,1517.250,0.000000,,0.0,0.0,85.333333,95.076923,...,20.1000,,,,,,,1.0,,
2018-07-03,4.14000,4.0,0.000,1376.000,0.060000,,0.0,0.0,81.333333,93.307692,...,22.0000,1.0,1.0,1.0,,,,1.0,,
2018-07-04,4.16625,,,1270.250,0.358795,,0.0,0.0,80.666667,90.250000,...,16.1875,1.0,,1.0,,,,1.0,,
2018-07-05,8.19375,18.0,1.540,1559.250,0.022561,,0.0,0.0,81.000000,89.666667,...,24.0250,1.0,,1.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,2.0,3.775,848.500,1.602989,,0.0,0.0,59.000000,62.923077,...,35.6125,1.0,,,,,,,,
2021-10-28,6.34750,,,1180.500,0.008101,,0.0,0.0,55.000000,59.615385,...,18.4750,,,,,,,,,
2021-10-29,13.67250,,,1912.750,0.043947,,0.0,0.0,54.333333,57.076923,...,40.1500,1.0,,1.0,,,,,,
2021-10-30,8.47125,6.0,4.170,119.750,0.648353,,0.0,0.0,60.000000,62.461538,...,33.2250,1.0,,,,,,,,


We drop the `MDPR` (Multi-day precipitation total), `DAPR` (Days included in `MDPR`), `PSUN` (Daily percent of possible sunshine), and `TSUN` (Daily total sunshine) columns, as these are sparsely populated:

In [4]:
weather_df_by_date = weather_df_by_date.drop(columns=['DAPR', 'MDPR', 'PSUN', 'TSUN'])

After doing this we are left with 15 numerical columns which are fully populated:

In [5]:
weather_df_by_date[['AWND', 'PGTM', 'PRCP', 'SNOW', 'SNWD', 'TAVG', 'TMAX', 'TMIN', 'TOBS', 'WDF2', 'WDF5', 'WESD', 'WESF', 'WSF2', 'WSF5']].dropna()

Unnamed: 0_level_0,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TOBS,WDF2,WDF5,WESD,WESF,WSF2,WSF5
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2018-07-01,4.39000,1349.250,0.000000,0.0,0.0,86.000000,95.666667,73.416667,76.666667,212.50,175.00,0.0,0.0,11.7625,15.7750
2018-07-02,5.39625,1517.250,0.000000,0.0,0.0,85.333333,95.076923,73.153846,73.600000,161.25,155.00,0.0,0.0,14.9375,20.1000
2018-07-03,4.14000,1376.000,0.060000,0.0,0.0,81.333333,93.307692,74.750000,77.500000,223.75,215.00,0.0,0.0,17.1625,22.0000
2018-07-04,4.16625,1270.250,0.358795,0.0,0.0,80.666667,90.250000,74.833333,77.333333,136.25,137.50,0.0,0.0,13.0125,16.1875
2018-07-05,8.19375,1559.250,0.022561,0.0,0.0,81.000000,89.666667,74.153846,75.600000,177.50,188.75,0.0,0.0,18.5125,24.0250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,848.500,1.602989,0.0,0.0,59.000000,62.923077,52.769231,55.800000,142.50,218.75,0.0,0.0,25.6875,35.6125
2021-10-28,6.34750,1180.500,0.008101,0.0,0.0,55.000000,59.615385,46.769231,47.200000,211.25,133.75,0.0,0.0,14.0500,18.4750
2021-10-29,13.67250,1912.750,0.043947,0.0,0.0,54.333333,57.076923,45.769231,47.200000,86.25,86.25,0.0,0.0,28.7500,40.1500
2021-10-30,8.47125,119.750,0.648353,0.0,0.0,60.000000,62.461538,51.615385,54.000000,83.75,82.50,0.0,0.0,24.0625,33.2250


We also have 9 boolean categorical columns indicating the day's observed weather types. Note that these columns are not mutually exclusive:

In [6]:
weather_df_by_date[['WT01', 'WT02', 'WT03', 'WT04', 'WT05', 'WT06', 'WT08', 'WT09', 'WT11']]

Unnamed: 0_level_0,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2018-07-01,,,,,,,,,
2018-07-02,,,,,,,1.0,,
2018-07-03,1.0,1.0,1.0,,,,1.0,,
2018-07-04,1.0,,1.0,,,,1.0,,
2018-07-05,1.0,,1.0,,,,,,
...,...,...,...,...,...,...,...,...,...
2021-10-27,1.0,,,,,,,,
2021-10-28,,,,,,,,,
2021-10-29,1.0,,1.0,,,,,,
2021-10-30,1.0,,,,,,,,


We can see that these columns have `NaN` if the corresponding weather type was not observed for that day. We'll fill in those values with `0`:

In [7]:
weather_df_by_date = weather_df_by_date.fillna(0)

In [8]:
weather_df_by_date

Unnamed: 0_level_0,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,TMIN,TOBS,WDF2,...,WSF5,WT01,WT02,WT03,WT04,WT05,WT06,WT08,WT09,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-07-01,4.39000,1349.250,0.000000,0.0,0.0,86.000000,95.666667,73.416667,76.666667,212.50,...,15.7750,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2018-07-02,5.39625,1517.250,0.000000,0.0,0.0,85.333333,95.076923,73.153846,73.600000,161.25,...,20.1000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2018-07-03,4.14000,1376.000,0.060000,0.0,0.0,81.333333,93.307692,74.750000,77.500000,223.75,...,22.0000,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2018-07-04,4.16625,1270.250,0.358795,0.0,0.0,80.666667,90.250000,74.833333,77.333333,136.25,...,16.1875,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2018-07-05,8.19375,1559.250,0.022561,0.0,0.0,81.000000,89.666667,74.153846,75.600000,177.50,...,24.0250,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-10-27,13.67375,848.500,1.602989,0.0,0.0,59.000000,62.923077,52.769231,55.800000,142.50,...,35.6125,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-28,6.34750,1180.500,0.008101,0.0,0.0,55.000000,59.615385,46.769231,47.200000,211.25,...,18.4750,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-29,13.67250,1912.750,0.043947,0.0,0.0,54.333333,57.076923,45.769231,47.200000,86.25,...,40.1500,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-10-30,8.47125,119.750,0.648353,0.0,0.0,60.000000,62.461538,51.615385,54.000000,83.75,...,33.2250,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


After doing this we can see that we now have 1219 rows. This is the correct number of days for this date range, so we know that there are no days with no observations.

We can also see that we have full rank for all columns. Below are their definitions.

The following columns are continuous:
* `AWND`: Average wind speed (mph)
* `PGTM`: Peak gust time (hours and minutes, i.e., HHMM)
* `PRCP`: Amount of precipitation (inches)
* `SNOW`: Amount of snowfall (inches)
* `SNWD`: Snow depth on ground (inches)
* `TAVG`: Average temperature (degrees F)
* `TMAX`: Maximum temperature (degrees F)
* `TMIN`: Minimum temperature (degrees F)
* `TOBS`: Temperature at the time of observation (degrees F)
* `WESD`: Water equivalent of snow on the ground (inches)
* `WESF`: Water equivalent of snowfall (inches)
* `WSF2`: Fastest 2-minute wind speed (mph)
* `WSF5`: Fastest 5-second wind speed (mph)

Note that these two columns, though continuous, measure a direction in degrees. As such, the meaning of their magnitude in a machine learning context is problematic:
* `WDF2`: Direction of fastest 2-minute wind (degrees)
* `WDF5`: Direction of fastest 5-minute wind (degrees)

The following columns contain a `1` if the specified weather condition was observed. 0 otherwise.
* `WT01`: Fog, ice fog, or freezing fog (may include heavy fog)
* `WT02`: Heavy fog or heaving freezing fog (not always distinguished from fog)
* `WT03`: Thunder
* `WT04`: Ice pellets, sleet, snow pellets, or small hail
* `WT05`: Hail (may include small hail)
* `WT06`: Glaze or rime
* `WT08`: Smoke or haze
* `WT09`: Blowing or drifting snow
* `WT11`: High or damaging winds



We now write this normalized dataframe to disk for safe-keeping:

In [10]:
weather_df_by_date.to_csv('../../data/weather_data_normalized_2018-07-01_2021-10-31.csv')