 # Data preparation: join both datasets, Taxis and Weather

The purpose of this notebook is to join the following CSV's:

- ``Data_Taxis_[year]_Cleaned.csv`` created in the notebook _Data_Taxis_Clean_Transform.ipynb_  
- ``Data_Weather_Cleaned.csv`` created in the notebook _Data_Weather_Clean_Transform.ipynb_

And create these ones, which data is ready to be used by machine learning models:

- ``Data_Cleaned_2017_To_Model.csv``  
- ``Data_Cleaned_2018_To_Model.csv``  
- ``Data_Cleaned_2019_To_Model.csv``  

If you want to skip all the steps, go straight to the point 3 [Everything put together and Export CSV](#Everything-put-together-and-Export-CSV)

### 1. [Fill missing LocationIDs in the Taxis dataset](#Fill-missing-LocationIDs-in-the-Taxis-dataset)

It may occur that there were no pickups at all for a specific zone and time period. I will need to fill those missing time slots and LocationIDs with ``0.00`` so that the model learns that there were no pickups.  

In order to do this I need to **prepare** a Data Frame with LocationIDs and perform a LEFT JOIN with the Taxis dataset.

- [Manhattan LocationIDs: Create DataFrame](#Manhattan-LocationIDs:-Create-DataFrame)<br>
    - There should be 67 unique LocationIDs for each hourly period.
    - Therefore the DataFrame should have 586.920 rows:  
    ``365 days * 24h * 67 LocationIDs = 586.920``

- [Taxis: Import cleaned Dataset & Sanity check](#Taxis:-Import-cleaned-Dataset-&-Sanity-check)<br>
- [Manhattan LocationIDs: Prepare columns for Multi Index](#Manhattan-LocationIDs:-Prepare-columns-for-Multi-Index)<br>

After many "trial & error" I have come to the conclusion that I need to create a Multi Index based on ``month``, ``day``, ``hour`` and ``LocationID`` in order to perform the JOIN successfully
- [Check that parameters needed for the multi index are correct before executing the Join](#Check-that-parameters-needed-for-the-multi-index-are-correct-before-executing-the-Join)
- [Create Multi Index with groupby & Sanity check](#Create-Multi-Index-with-groupby-&-Sanity-check)
- [Perform the JOIN & Sanity Check](#Perform-the-JOIN-&-Sanity-Check)

### 2. [Join Taxis and Weather datasets](#Join-Taxis-and-Weather-datasets)

- [Import cleaned Weather dataset](#Import-cleaned-Weather-dataset)
- [Insert ``datetime`` column in Taxis dataset](#Insert-datetime-column-in-Taxis-dataset)  
I lost this column when creating the multi index so I need to put it back in order to perform the MERGE on ``datetime``.
- [Merge Taxis and Weather datasets](#Merge-Taxis-and-Weather-datasets)
- [Manage NaNs](#Manage-NaNs): After filling the LocationID gaps, I need to manage the NaNs

### 3. [Everything put together and Export CSV](#Everything-put-together-and-Export-CSV)

- [Define functions](#Define-functions)
- [Run all and export to CSV](#Run-all-and-export-to-CSV)

- [Sanity check](#Sanity-check): Let´s check that the csv files created are correct by looking at some numbers.

In [1]:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import date
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Fill missing LocationIDs in the Taxis dataset

Let´s start by joining just one year: 2017

### Manhattan LocationIDs: Create DataFrame

- There should be 67 unique LocationIDs for each hourly period.
- Therefore the DataFrame should have 586.920 rows:
    - 365 days * 24h * 67 LocationIDs = 586.920

In [2]:
# 1. Import Location and Borough columns form NY TAXI ZONES dataset
dfzones = pd.read_csv('../data/NY_taxi_zones.csv', sep=',',
                      usecols=['LocationID', 'borough'])

# 2. Filter Manhattan zones
dfzones = dfzones[dfzones['borough']=='Manhattan']\
                .drop(['borough'], axis=1)\
                .sort_values(by='LocationID')\
                .drop_duplicates('LocationID').reset_index(drop=True)

dfzones = pd.concat([dfzones]*8760).reset_index(drop=True)

print('There should be  586920 rows: ', dfzones.shape)
print('67 UNIQUE LocationIDs',pd.unique(dfzones['LocationID']).shape)

dfzones.head()

There should be  586920 rows:  (586920, 1)
67 UNIQUE LocationIDs (67,)


Unnamed: 0,LocationID
0,4
1,12
2,13
3,24
4,41


### Taxis: Import cleaned Dataset & Sanity check
- Import clean datasets created in the notebook ``Data_Taxis_Clean_Transform.ipynb``
- Confirm that the year is correct.
- Confirm that hourly periods count is correct

In [3]:
year = 2017
dftax = pd.read_csv('../data/Data_Taxis_'+str(year)+'_Cleaned.csv', sep=',',
                        #dtype = {"PULocationID" : "object"},
                        parse_dates=['datetime'])
print('Year should be unique: ', dftax.year.unique())

print('67 UNIQUE LocationIDs: ', pd.unique(dftax['LocationID']).shape)
print(dftax.shape[0], 'A number less than 586920 indicates that there are missing LocationIDs')

# Count LocationID per hourly period. They should be 67.
t = dftax.copy()
t['count'] = 1
tg = t.groupby(['month', 'day', 'hour']).sum().head()
print('Count unique LocationIDs per hourly period. Should be 67: ',pd.unique(tg['count']),\
      'If not, is because there are missing LocationIDs for each hourly period')

display(tg.head())
display(dftax.head())

Year should be unique:  [2017]
67 UNIQUE LocationIDs:  (67,)
536306 A number less than 586920 indicates that there are missing LocationIDs
Count unique LocationIDs per hourly period. Should be 67:  [65 64] If not, is because there are missing LocationIDs for each hourly period


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,LocationID,pickups,year,week,dayofweek,isweekend,isholiday,count
month,day,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,0,9733,18881,131105,3380,390,65,0,65
1,1,1,9539,20186,129088,3328,384,64,0,64
1,1,2,9667,17989,131105,3380,390,65,0,65
1,1,3,9667,14997,131105,3380,390,65,0,65
1,1,4,9547,10700,129088,3328,384,64,0,64


Unnamed: 0,datetime,LocationID,pickups,year,month,day,hour,week,dayofweek,isweekend,isholiday
0,2017-01-01,4,136,2017,1,1,0,52,6,1,0
1,2017-01-01,12,3,2017,1,1,0,52,6,1,0
2,2017-01-01,13,103,2017,1,1,0,52,6,1,0
3,2017-01-01,24,94,2017,1,1,0,52,6,1,0
4,2017-01-01,41,136,2017,1,1,0,52,6,1,0


### Manhattan LocationIDs: Prepare columns for Multi Index

This Multi Index composed of ``month``, ``day`` and ``hour``.

In [4]:
year = 2017
a = pd.period_range(start=str(year)+'-01-01', end=str(year)+'-12-31T23:00', freq='H')
df_index = pd.DataFrame({'datetime':a})

df_index['month'] = df_index['datetime'].dt.month
df_index['day'] = df_index['datetime'].dt.day
df_index['hour'] = df_index['datetime'].dt.hour
df_index = df_index.drop(columns=['datetime'],inplace=False)
df_index = df_index.iloc[np.arange(len(df_index)).repeat(67)].reset_index(drop=True)
df_index['LocationID'] = dfzones['LocationID']
print(df_index.shape)
df_index.head()

(586920, 4)


Unnamed: 0,month,day,hour,LocationID
0,1,1,0,4
1,1,1,0,12
2,1,1,0,13
3,1,1,0,24
4,1,1,0,41


### Check that parameters needed for the multi index are correct before executing the Join
After many "trial & error" I have come to the conclusion that I need to create a Multi Index based on ``month``, ``day``, ``hour`` and ``LocationID`` in order to perform the JOIN successfully.<br>
I will check that the parameters needed for the multi index are correct.

In [5]:
tax_m = dftax.groupby(['month']).count()
tax_d = dftax.groupby(['month', 'day']).count()
tax_h = dftax.groupby(['month', 'day','hour']).count()
ind_m = df_index.groupby(['month']).count()
ind_d = df_index.groupby(['month','day']).count()
ind_h = df_index.groupby(['month','day','hour']).count()
ind_z = df_index.groupby(['month','day','hour','LocationID']).count()


print('12 MONTHS:',tax_m.shape[0],'=>',ind_m.shape[0])
print('365 DAYS:',tax_d.shape[0],'=>', ind_d.shape[0])
print('8760 HOURS:',tax_h.shape[0],'=>', ind_h.shape[0])
print('67 UNIQUE LocationID:',\
      pd.unique(df_index['LocationID']).shape[0],'=>',\
      pd.unique(dftax['LocationID']).shape[0])

12 MONTHS: 12 => 12
365 DAYS: 365 => 365
8760 HOURS: 8760 => 8760
67 UNIQUE LocationID: 67 => 67


### Create Multi Index with groupby & Sanity check

In [6]:
dftax_g = dftax.groupby(['month','day','hour','LocationID']).sum()
df_index_g = df_index.groupby(['month','day','hour','LocationID']).sum()
print(dftax_g.shape)
print(df_index_g.shape)

(536306, 6)
(586920, 0)


In [7]:
# Sanity check
print('Taxis dataset BEFORE grouping.')
print(dftax.shape[0])
print('67 UNIQUE LocationIDs: ', pd.unique(dftax['LocationID']).shape[0])
display(dftax.head())

print('Taxis dataset AFTER grouping.')
print('It should be 586920: ', df_index_g.shape[0])
print('67 UNIQUE LocationIDs: ',dftax_g.index.unique(level='LocationID').shape[0])
display(dftax_g.head(100))

Taxis dataset BEFORE grouping.
536306
67 UNIQUE LocationIDs:  67


Unnamed: 0,datetime,LocationID,pickups,year,month,day,hour,week,dayofweek,isweekend,isholiday
0,2017-01-01,4,136,2017,1,1,0,52,6,1,0
1,2017-01-01,12,3,2017,1,1,0,52,6,1,0
2,2017-01-01,13,103,2017,1,1,0,52,6,1,0
3,2017-01-01,24,94,2017,1,1,0,52,6,1,0
4,2017-01-01,41,136,2017,1,1,0,52,6,1,0


Taxis dataset AFTER grouping.
It should be 586920:  586920
67 UNIQUE LocationIDs:  67


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pickups,year,week,dayofweek,isweekend,isholiday
month,day,hour,LocationID,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,0,4,136,2017,52,6,1,0
1,1,0,12,3,2017,52,6,1,0
1,1,0,13,103,2017,52,6,1,0
1,1,0,24,94,2017,52,6,1,0
1,1,0,41,136,2017,52,6,1,0
1,1,0,42,79,2017,52,6,1,0
1,1,0,43,401,2017,52,6,1,0
1,1,0,45,54,2017,52,6,1,0
1,1,0,48,692,2017,52,6,1,0
1,1,0,50,313,2017,52,6,1,0


### Perform the JOIN & Sanity Check

This Join will fill the missing LocationIDs within the taxis dataset.

In [8]:
taxis_join = dftax_g.join(df_index_g, how='right').reset_index()

In [9]:
# Sanity Check
print('67 Unique LocationIDs in total: ',pd.unique(taxis_join['LocationID']).shape)
print('Shape should be (586920, x): ', taxis_join.shape)
t2 = taxis_join.copy()
t2['count'] = 1
tg2 = t2.groupby(['month', 'day', 'hour']).sum().head()
print('Count unique LocationIDs per hourly period. Should be 67: ',pd.unique(tg2['count']))
display(tg2.head())

display(taxis_join.head())

67 Unique LocationIDs in total:  (67,)
Shape should be (586920, x):  (586920, 10)
Count unique LocationIDs per hourly period. Should be 67:  [67]


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,LocationID,pickups,year,week,dayofweek,isweekend,isholiday,count
month,day,hour,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1,0,9964,18881.0,131105.0,3380.0,390.0,65.0,0.0,67
1,1,1,9964,20186.0,129088.0,3328.0,384.0,64.0,0.0,67
1,1,2,9964,17989.0,131105.0,3380.0,390.0,65.0,0.0,67
1,1,3,9964,14997.0,131105.0,3380.0,390.0,65.0,0.0,67
1,1,4,9964,10700.0,129088.0,3328.0,384.0,64.0,0.0,67


Unnamed: 0,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,isholiday
0,1,1,0,4,136.0,2017.0,52.0,6.0,1.0,0.0
1,1,1,0,12,3.0,2017.0,52.0,6.0,1.0,0.0
2,1,1,0,13,103.0,2017.0,52.0,6.0,1.0,0.0
3,1,1,0,24,94.0,2017.0,52.0,6.0,1.0,0.0
4,1,1,0,41,136.0,2017.0,52.0,6.0,1.0,0.0


# Join Taxis and Weather datasets

### Import cleaned Weather dataset

In [10]:
# Filter one year data
year = 2017

# Import WEATHER DATASET to dataframe.
dfwea = pd.read_csv('../data/Data_Weather_Cleaned.csv', sep=',',
                        parse_dates=['datetime'])

# Filter one year data
dfwea.drop(dfwea[dfwea['datetime'] < pd.Timestamp(date(year,1,1))].index, inplace=True)
dfwea.drop(dfwea[dfwea['datetime'] >= pd.Timestamp(date(year+1,1,1))].index, inplace=True)

# Sanity check
print('Year should be unique: ', dfwea.datetime.dt.year.unique())
print('There should be 8760 hourly periods in a year: ', dfwea.shape[0])

dfwea.sample(5)

Year should be unique:  [2017]
There should be 8760 hourly periods in a year:  8760


Unnamed: 0,datetime,precipitation
1606,2017-03-08 22:00:00,0.0
7167,2017-10-26 15:00:00,0.0
4883,2017-07-23 11:00:00,0.0
3544,2017-05-28 16:00:00,0.0
5001,2017-07-28 09:00:00,0.0


### Insert ``datetime`` column in Taxis dataset
I lost this column when creating the multi index so I need to put it back in order to perform the MERGE on ``datetime``.

In [11]:
# I will take the 'datetime' sequence from the Weather data frame
datetime_col = dfwea.copy()
datetime_col.drop(columns=['precipitation'], inplace=True)

# repeat values to have one hour per LocationID (67)
datetime_col = pd.DataFrame(np.repeat(datetime_col.values,67))
# rename column
datetime_col = datetime_col.rename(columns={0:'datetime'})

taxis_final = pd.concat([datetime_col,taxis_join], axis=1)
print('Should have 586920 rows: ',taxis_final.shape[0])
taxis_final.head()

Should have 586920 rows:  586920


Unnamed: 0,datetime,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,isholiday
0,2017-01-01,1,1,0,4,136.0,2017.0,52.0,6.0,1.0,0.0
1,2017-01-01,1,1,0,12,3.0,2017.0,52.0,6.0,1.0,0.0
2,2017-01-01,1,1,0,13,103.0,2017.0,52.0,6.0,1.0,0.0
3,2017-01-01,1,1,0,24,94.0,2017.0,52.0,6.0,1.0,0.0
4,2017-01-01,1,1,0,41,136.0,2017.0,52.0,6.0,1.0,0.0


### Merge Taxis and Weather datasets

In [12]:
df_merge = pd.merge(taxis_final, dfwea, on='datetime')
# taxis dataframe and merged dataframe should have same number of rows
print('df_merge shape ({0}) should be equal to taxis_final shape ({1})'.format(df_merge.shape[0], taxis_final.shape[0]))
df_merge.head(100)

df_merge shape (586920) should be equal to taxis_final shape (586920)


Unnamed: 0,datetime,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,isholiday,precipitation
0,2017-01-01 00:00:00,1,1,0,4,136.0,2017.0,52.0,6.0,1.0,0.0,0.0
1,2017-01-01 00:00:00,1,1,0,12,3.0,2017.0,52.0,6.0,1.0,0.0,0.0
2,2017-01-01 00:00:00,1,1,0,13,103.0,2017.0,52.0,6.0,1.0,0.0,0.0
3,2017-01-01 00:00:00,1,1,0,24,94.0,2017.0,52.0,6.0,1.0,0.0,0.0
4,2017-01-01 00:00:00,1,1,0,41,136.0,2017.0,52.0,6.0,1.0,0.0,0.0
5,2017-01-01 00:00:00,1,1,0,42,79.0,2017.0,52.0,6.0,1.0,0.0,0.0
6,2017-01-01 00:00:00,1,1,0,43,401.0,2017.0,52.0,6.0,1.0,0.0,0.0
7,2017-01-01 00:00:00,1,1,0,45,54.0,2017.0,52.0,6.0,1.0,0.0,0.0
8,2017-01-01 00:00:00,1,1,0,48,692.0,2017.0,52.0,6.0,1.0,0.0,0.0
9,2017-01-01 00:00:00,1,1,0,50,313.0,2017.0,52.0,6.0,1.0,0.0,0.0


### Manage NaNs
After filling the LocationID gaps, I need to manage the NaNs of the following variables:
- ``pickups``: will be 0.
- ``year``: will be calculated from ``datetime``
- ``week``: will be calculated from ``datetime``
- ``dayofweek``: will be calculated from ``datetime``
- ``isweekend``: will be calculated from ``datetime``
- ``isholiday``: will be calculated again with ``USFederalHolidayCalendar``

In [13]:
df_merge['pickups'].fillna(0, inplace=True)
df_merge['year'] = df_merge['datetime'].dt.year
df_merge['week'] = df_merge['datetime'].dt.week
df_merge['dayofweek'] = df_merge['datetime'].dt.dayofweek
# isweekend
mask = (df_merge['dayofweek'] == 5) | (df_merge['dayofweek'] == 6)
df_merge['isweekend'] = np.where(mask, 1, 0)

# isholiday: Create date time index calendar
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

drange = pd.date_range(start=str(year)+'-01-01', end=str(year)+'-12-31')
cal = calendar()
holidays = cal.holidays(start=drange.min(), end=drange.max())

# isholiday: create new column 'date'
df_merge['date'] = pd.to_datetime(df_merge['datetime'].dt.date)
df_merge['isholiday'] = df_merge['datetime'].isin(holidays).astype(int)


# isholiday: drop column 'date'
df_merge.drop(columns=['date'], inplace=True)
df_merge.head(20)

Unnamed: 0,datetime,month,day,hour,LocationID,pickups,year,week,dayofweek,isweekend,isholiday,precipitation
0,2017-01-01,1,1,0,4,136.0,2017,52,6,1,0,0.0
1,2017-01-01,1,1,0,12,3.0,2017,52,6,1,0,0.0
2,2017-01-01,1,1,0,13,103.0,2017,52,6,1,0,0.0
3,2017-01-01,1,1,0,24,94.0,2017,52,6,1,0,0.0
4,2017-01-01,1,1,0,41,136.0,2017,52,6,1,0,0.0
5,2017-01-01,1,1,0,42,79.0,2017,52,6,1,0,0.0
6,2017-01-01,1,1,0,43,401.0,2017,52,6,1,0,0.0
7,2017-01-01,1,1,0,45,54.0,2017,52,6,1,0,0.0
8,2017-01-01,1,1,0,48,692.0,2017,52,6,1,0,0.0
9,2017-01-01,1,1,0,50,313.0,2017,52,6,1,0,0.0


# Everything put together and Export CSV

### Define functions

- ``fill_locationids()``: Fill missing LocationIDs in the Taxis dataset.
- ``join_taxis_and_weather()``: Join Taxis and Weather datasets.
- ``prepare_data_for_models()``: process all the above for a specific year and export to CSV

In [14]:
# FUNCTION TO FILL MISSING LOCATIONIDs IN THE TAXIS DATASET
def fill_locationids(year):
    # Manhattan LocationIDs: Create DataFrame
    dfzones = pd.read_csv('../data/NY_taxi_zones.csv', sep=',',
                          usecols=['LocationID', 'borough'])

    dfzones = dfzones[dfzones['borough']=='Manhattan']\
                    .drop(['borough'], axis=1)\
                    .sort_values(by='LocationID')\
                    .drop_duplicates('LocationID').reset_index(drop=True)

    dfzones = pd.concat([dfzones]*8760).reset_index(drop=True)

    # Manhattan LocationIDs: Prepare columns for Multi Index
    a = pd.period_range(start=str(year)+'-01-01', end=str(year)+'-12-31T23:00', freq='H')
    df_index = pd.DataFrame({'datetime':a})

    df_index['month'] = df_index['datetime'].dt.month
    df_index['day'] = df_index['datetime'].dt.day
    df_index['hour'] = df_index['datetime'].dt.hour
    df_index = df_index.drop(columns=['datetime'],inplace=False)
    df_index = df_index.iloc[np.arange(len(df_index)).repeat(67)].reset_index(drop=True)
    df_index['LocationID'] = dfzones['LocationID']

    # Taxis: Import cleaned Dataset
    dftax = pd.read_csv('../data/Data_Taxis_'+str(year)+'_Cleaned.csv', sep=',',
                            #dtype = {"PULocationID" : "object"},
                            parse_dates=['datetime'])

    # Create Multi Index with groupby
    dftax_g = dftax.groupby(['month','day','hour','LocationID']).sum()
    df_index_g = df_index.groupby(['month','day','hour','LocationID']).sum()

    # Perform the JOIN
    taxis_join = dftax_g.join(df_index_g, how='right').reset_index()
    return taxis_join

# FUNCTION TO JOIN TAXIS AND WEATHER DATASETS
def join_taxis_and_weather(year, taxis_join):
    # Import WEATHER DATASET to dataframe.
    dfwea = pd.read_csv('../data/Data_Weather_Cleaned.csv', sep=',',
                            parse_dates=['datetime'])

    # Filter one year data
    dfwea.drop(dfwea[dfwea['datetime'] < pd.Timestamp(date(year,1,1))].index, inplace=True)
    dfwea.drop(dfwea[dfwea['datetime'] >= pd.Timestamp(date(year+1,1,1))].index, inplace=True)

    # I will take the 'datetime' sequence from the Weather data frame
    datetime_col = dfwea.copy()
    datetime_col.drop(columns=['precipitation'], inplace=True)

    # repeat values to have one hour per LocationID (67)
    datetime_col = pd.DataFrame(np.repeat(datetime_col.values,67))
    # rename column
    datetime_col = datetime_col.rename(columns={0:'datetime'})

    taxis_final = pd.concat([datetime_col,taxis_join], axis=1)

    df_merge = pd.merge(taxis_final, dfwea, on='datetime')

    # Manage NaNs
    df_merge['pickups'].fillna(0, inplace=True)
    df_merge['year'] = df_merge['datetime'].dt.year
    df_merge['week'] = df_merge['datetime'].dt.week
    df_merge['dayofweek'] = df_merge['datetime'].dt.dayofweek
    # isweekend
    mask = (df_merge['dayofweek'] == 5) | (df_merge['dayofweek'] == 6)
    df_merge['isweekend'] = np.where(mask, 1, 0)

    # isholiday: Create date time index calendar
    drange = pd.date_range(start=str(year)+'-01-01', end=str(year)+'-12-31')
    cal = calendar()
    holidays = cal.holidays(start=drange.min(), end=drange.max())

    # isholiday: create new column 'date'
    df_merge['date'] = pd.to_datetime(df_merge['datetime'].dt.date)
    df_merge['isholiday'] = df_merge['datetime'].isin(holidays).astype(int)


    # isholiday: drop column 'date'
    df_merge.drop(columns=['date'], inplace=True)
    return df_merge

# FUNCTION TO PROCESS ALL THE ABOVE FOR A SPECIFIC YEAR AND EXPORT TO CSV
def prepare_data_for_models(year):
    filename = 'Data_Cleaned_'+str(year)+'_To_Model.csv'
    print('PROCESSING '+ filename + '...')
    df_fill_locationids = fill_locationids(year)
    df_merged = join_taxis_and_weather(year, df_fill_locationids)
    df_merged.to_csv('../data/'+filename, index = False, header=True)
    print(filename + ' has been SAVED!')

### Run all and export to CSV
NOTE: the whole computation takes about 6 minutes.

In [15]:
import pandas as pd
import numpy as np
import random
from datetime import datetime as dt
from datetime import date
import matplotlib.pyplot as plt
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar

prepare_data_for_models(2017)
prepare_data_for_models(2018)
prepare_data_for_models(2019)

PROCESSING Data_Cleaned_2017_To_Model.csv...
Data_Cleaned_2017_To_Model.csv has been SAVED!
PROCESSING Data_Cleaned_2018_To_Model.csv...
Data_Cleaned_2018_To_Model.csv has been SAVED!
PROCESSING Data_Cleaned_2019_To_Model.csv...
Data_Cleaned_2019_To_Model.csv has been SAVED!


### Sanity check
Let´s check that the csv files created are correct by looking at some numbers.

In [16]:
year=2019 # input desired year

csvcheck = pd.read_csv('../data/Data_Cleaned_'+str(year)+'_To_Model.csv', sep=',',
                       parse_dates=['datetime'])

print('Year should be ' + str(year)+':', csvcheck.year.unique())
print('There should be 12 months:', csvcheck.month.nunique())
print('There should be 31 days:', csvcheck.day.nunique())
print('There should be 24 hours:', csvcheck.hour.nunique())
print('There should be 52 weeks:', csvcheck.week.nunique())
print('There should be 67 UNIQUE LocationIDs',pd.unique(csvcheck['LocationID']).shape)

csvcheck['hourlyperiods'] = 1
h = csvcheck.groupby(['month','day','hour'])['hourlyperiods'].sum()
print('There should be 8760 hourly periods in a year: ', h.shape)
print(csvcheck.describe())

Year should be 2019: [2019]
There should be 12 months: 12
There should be 31 days: 31
There should be 24 hours: 24
There should be 52 weeks: 52
There should be 67 UNIQUE LocationIDs (67,)
There should be 8760 hourly periods in a year:  (8760,)
               month            day           hour     LocationID  \
count  586920.000000  586920.000000  586920.000000  586920.000000   
mean        6.526027      15.720548      11.500000     148.716418   
std         3.447854       8.796254       6.922192      72.681293   
min         1.000000       1.000000       0.000000       4.000000   
25%         4.000000       8.000000       5.750000      90.000000   
50%         7.000000      16.000000      11.500000     148.000000   
75%        10.000000      23.000000      17.250000     229.000000   
max        12.000000      31.000000      23.000000     263.000000   

             pickups      year           week      dayofweek      isweekend  \
count  586920.000000  586920.0  586920.000000  586920.0