# Make Dataset

## Process

1. Import daily bike data
2. Import daily weather data
3. Import US holidys
4. Join precipitation and temperature to daily bike data
5. Create holiday variable
6. Save dataset

## NYC Weather Data

Weather data obtained from [NOAA](https://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/locations/ZIP:10023/detail). It consists of the daily summaries for zip code 10023 (Central Park station).

Per the [documentation](https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf):

+ PRCP = Precipitation (mm or inches as per user preference, inches to hundredths on Daily Form pdf file)
+ SNOW = Snowfall (mm or inches as per user preference, inches to tenths on Daily Form pdf file)
+ SNWD = Snow depth (mm or inches as per user preference, inches on Daily Form pdf file)
+ TMAX = Maximum temperature (Fahrenheit or Celsius as per user preference, Fahrenheit to tenths on Daily Form pdf file)

In [1]:
import os
import pandas as pd
import numpy as np
import holidays

# set directory with data
data_dir = "data"

# read dataset
bikes = pd.read_csv(os.path.join(data_dir, "citi_bike_daily.csv"))
weather = pd.read_csv(os.path.join(data_dir, "nyc_daily_weather.csv"))

# holiday list
us_holidays = holidays.US()

In [2]:
bikes.head()

Unnamed: 0,date,rides
0,2017-01-01,16009
1,2017-01-02,8921
2,2017-01-03,14198
3,2017-01-04,34039
4,2017-01-05,28393


In [3]:
bikes.shape

(907, 2)

In [4]:
weather.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PGTM,PRCP,SNOW,SNWD,TAVG,TMAX,...,WT03,WT04,WT06,WT08,WT13,WT14,WT16,WT18,WT19,WT22
0,USW00094728,"NY CITY CENTRAL PARK, NY US",2013-01-01,6.93,,0.0,0.0,0.0,,40,...,,,,,,,,,,
1,USW00094728,"NY CITY CENTRAL PARK, NY US",2013-01-02,5.82,,0.0,0.0,0.0,,33,...,,,,1.0,,,,,,
2,USW00094728,"NY CITY CENTRAL PARK, NY US",2013-01-03,4.47,,0.0,0.0,0.0,,32,...,,,,,,,,,,
3,USW00094728,"NY CITY CENTRAL PARK, NY US",2013-01-04,8.05,,0.0,0.0,0.0,,37,...,,,,,,,,,,
4,USW00094728,"NY CITY CENTRAL PARK, NY US",2013-01-05,6.71,,0.0,0.0,0.0,,42,...,,,,,,,,,,


In [5]:
weather['TMID'] = weather[["TMIN", "TMAX"]].mean(axis=1)

weather = weather[["DATE", "PRCP", "SNOW", "SNWD", "TMIN", "TMID", "TMAX"]]

In [6]:
weather.head()

Unnamed: 0,DATE,PRCP,SNOW,SNWD,TMIN,TMID,TMAX
0,2013-01-01,0.0,0.0,0.0,26,33.0,40
1,2013-01-02,0.0,0.0,0.0,22,27.5,33
2,2013-01-03,0.0,0.0,0.0,24,28.0,32
3,2013-01-04,0.0,0.0,0.0,30,33.5,37
4,2013-01-05,0.0,0.0,0.0,32,37.0,42


In [7]:
weather.shape

(2392, 7)

In [8]:
# combine bike and weather data
daily_df = bikes.merge(weather, 
                       left_on='date', 
                       right_on='DATE', 
                       how='left').drop('DATE', axis=1)

# lowercase column names
daily_df.columns = [col.lower() for col in daily_df]

In [9]:
# function to flag date as US holiday
def is_holiday(x):
    return x in us_holidays

In [10]:
# add holiday variable
daily_df['holiday'] = daily_df['date'].apply(is_holiday)

In [11]:
daily_df.head()

Unnamed: 0,date,rides,prcp,snow,snwd,tmin,tmid,tmax,holiday
0,2017-01-01,16009,0.0,0.0,0.0,40,44.0,48,True
1,2017-01-02,8921,0.21,0.0,0.0,37,39.0,41,True
2,2017-01-03,14198,0.58,0.0,0.0,39,41.0,43,False
3,2017-01-04,34039,0.0,0.0,0.0,34,43.0,52,False
4,2017-01-05,28393,0.0,0.0,0.0,27,30.5,34,False


In [12]:
# same rows as original bikes
daily_df.shape

(907, 9)

In [13]:
# export finalized data
daily_df.to_csv(os.path.join(data_dir, "daily.csv"), index=False)