In [96]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

In [97]:
weather_path = 'weather_NY_2010_2018Nov.csv'
w = pd.read_csv(weather_path)

In [98]:
w.head()

Unnamed: 0,USAF,WBAN,StationName,State,Latitude,Longitude,MeanTemp,MinTemp,MaxTemp,DewPoint,Percipitation,WindSpeed,MaxSustainedWind,Gust,Rain,SnowDepth,SnowIce,Year,Month,Day
0,726228,94740,ADIRONDACK REGIONAL ARPT,NY,44.385,-74.207,27.6,24.8,30.9,25.0,0.07,1.3,6.0,,0,,1,2010,1,1
1,726228,94740,ADIRONDACK REGIONAL ARPT,NY,44.385,-74.207,-3.2,-20.9,17.1,-9.6,0.0,3.3,9.9,,0,,1,2010,1,10
2,726228,94740,ADIRONDACK REGIONAL ARPT,NY,44.385,-74.207,20.9,17.1,24.1,15.1,0.0,6.8,12.0,19.0,0,,1,2010,1,11
3,726228,94740,ADIRONDACK REGIONAL ARPT,NY,44.385,-74.207,13.8,5.0,19.9,8.5,,4.4,8.0,15.9,0,,1,2010,1,12
4,726228,94740,ADIRONDACK REGIONAL ARPT,NY,44.385,-74.207,6.3,-8.0,19.0,1.9,0.0,3.3,5.1,,0,,1,2010,1,13


In [99]:
len(w)

160775

In [101]:
w.loc[0]

USAF                                  726228
WBAN                                   94740
StationName         ADIRONDACK REGIONAL ARPT
State                                     NY
Latitude                              44.385
Longitude                            -74.207
MeanTemp                                27.6
MinTemp                                 24.8
MaxTemp                                 30.9
DewPoint                                25.0
Percipitation                           0.07
WindSpeed                                1.3
MaxSustainedWind                         6.0
Gust                                     NaN
Rain                                       0
SnowDepth                                NaN
SnowIce                                    1
Year                                    2010
Month                                      1
Day                                        1
Name: 0, dtype: object

Make smaller dataframe (for testing purposes).

In [102]:
w.loc[0:99].to_csv('weather_abridged.csv', index=False)

In [103]:
wab = pd.read_csv('weather_abridged.csv')
#wab = pd.read_csv('weather_abridged.csv', dtype = {'StationName':'string'}) # can specify dtype for individual columns

Convert date information into one datetime64 column (which I can use for convenient comparison).

In [117]:
w.loc[:, 'Date'] = pd.to_datetime(w.loc[:, ['Year', 'Month', 'Day']])

  w.loc[:, 'Date'] = pd.to_datetime(w.loc[:, ['Year', 'Month', 'Day']])


Create dataframe with date restricted to between 2016 and 2018 (to compare with 311 calls data).

In [167]:
wd = w.loc[(w['Date']>=np.datetime64('2016')) & (w['Date']<np.datetime64('2019'))]

In [168]:
print('Min Date: ', np.min(wd.Date))
print('Max Date: ', np.max(wd.Date))

Min Date:  2016-01-01 00:00:00
Max Date:  2018-11-12 00:00:00


In [169]:
len(wd)

51329

It looks like our weather data only goes through November 12 of 2018.

Look at limits on latitude and longitude. These fit with the station names, which are from all over the state of New York. I will likely want to limit to just stations in NYC.

In [170]:
print('Min Latitude: ', np.min(wd.Latitude))
print('Max Latitude: ', np.max(wd.Latitude))

Min Latitude:  40.639
Max Latitude:  44.936


From checking the 311 data, I know the minimum and maximum latitude and longitude values for calls to be:

In [171]:
min_311_lat = 40.49804421521046
max_311_lat = 40.91294056699566
min_311_long = -74.25521082506387
max_311_long = -73.70038354802529

Since I am trying to relate weather and 311 calls, I want to restrict the weather data I am looking at to this same region.

In [172]:
wdp = wd.loc[(wdr['Latitude']>min_311_lat) & ((wd['Latitude']<max_311_lat)) &
           ((wd['Longitude']<max_311_long)) & ((wd['Longitude']<max_311_long))]
# Note: I could do a slightly more precise version by selecting stations, but should get largely the same results.

In [173]:
len(wdp)

6417

Check the stations that remain in the dataframe with restricted latitude and longitude. These make sense for NYC.

In [175]:
np.unique(wdp.StationName)

array(['BERGEN POINT', 'CENTRAL PARK',
       'JOHN F KENNEDY INTERNATIONAL AIRPORT', 'KINGS POINT',
       'LA GUARDIA AIRPORT', 'PORT AUTH DOWNTN MANHATTAN WALL ST HEL',
       'THE BATTERY'], dtype=object)

I don't expect too much variation in weather across NYC. For my inital modelling purposes, I will look at averages across all NYC weather stations. I could later look to see if 311 calls vary with weather on a more granualr level (by checking weather at closest weather station) but daily averages seem reasonable for now.

In [191]:
wdp.loc[wdp['Date'] == np.datetime64('2016-08-05'), 'MeanTemp']

10098     75.7
21017     74.5
68117     74.1
71336     74.3
74361     77.4
123396    76.2
150563    74.6
Name: MeanTemp, dtype: float64

In [193]:
np.mean(wdp.loc[wdp['Date'] == np.datetime64('2016-11-11'), 'MeanTemp'])

55.900000000000006

Look at limits on temperature. These numbers must be in Fahrenheit.

In [165]:
print(np.min(w.MinTemp))
print(np.max(w.MinTemp))

-36.9
87.8
