# NYC Yellow cab dataset -- Preprocessing

This notebook is provided to generate file `yellow.csv`. This file is already provided in the _data/_ folder. 

`yellow` contains the number of taxis hired for each `taxi_zone` at a hourly time for the whole year 2017. 

The `yellow` dataset is computed from the ___NYC Taxi and Limousine Commission (LTC)___ [available data](http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).

In [1]:
import pandas as pd
import numpy as np
from time import time
from multiprocessing import Pool

## 1. Download Data

Only download the `TARGET_COLUMNS` from the S3 repository. 

__Warning:__ Even with an excellent connection, it takes around __10 min__ to download the full year.

In [2]:
URL = "https://s3.amazonaws.com/nyc-tlc/trip+data/"

TARGET_COLUMNS = ['tpep_pickup_datetime', 'PULocationID'] 

def get_month_dataset(month):
    url = URL + "yellow_tripdata_2017-{:0>2}.csv".format(month)
    print('--| ' + url)
    now = time()
    month_df = pd.read_csv(url, usecols=TARGET_COLUMNS)
    print(month, '-->', int(time()-now), 'seconds\n')
    return month_df

In [3]:
%%time

p = Pool(5)
months = list(p.map(get_month_dataset, range(1,13)))

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv
--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-03.csv
--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-02.csv
--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-04.csv
--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-05.csv
4 --> 90 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-06.csv
3 --> 109 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-07.csv
2 --> 130 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-08.csv
1 --> 132 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-09.csv
5 --> 149 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-10.csv
8 --> 93 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-11.csv
9 --> 100 seconds

--| https://s3.amazonaws.com/nyc-tlc/trip+da

## 2. Data Preprocessing

### 2.1 Build a single DataFrame

Merge `months` into a single DataFrame: `yellow`.

In [4]:
yellow = pd.concat(months, ignore_index=True)

del months # free memory

yellow.columns = ['pickup_datetime', 'taxi_zone']

### 2.2 Create an hourly timestamp

Truncate `pickup_datetime` to groupby every trips that are starting from the same `taxi_zone` at the same hourly time.

In [5]:
# Convert to datetime
yellow.pickup_datetime = pd.to_datetime(yellow.pickup_datetime, utc=True)

# Truncate the datetime to groupby by hour each trip
yellow.pickup_datetime = yellow.pickup_datetime.dt.floor('h')

# Add column to count the number of trips in each taxi zone
yellow['trip_counter'] = np.ones(yellow.shape[0])
yellow = yellow.groupby(['pickup_datetime', 'taxi_zone']).sum()

### 2.3 Index on time with NYC timezone
Set `pickup_datetime` as index.

In [6]:
# Set NYC timezone
yellow.reset_index(inplace=True)
yellow.set_index('pickup_datetime', inplace=True)
yellow.index = yellow.index.tz_convert('America/New_York')

# Add timezone
yellow['timezone'] = '-05:00'

### 2.4 Last cleaning step
Filter unwanted lines added retrospectively.

In [7]:
yellow = yellow['2017-01-01 00:00':'2017-12-31 23:00']

## 3. Export Data

In [8]:
PATH = '../data/' # Modify this to fit your data folder

In [9]:
yellow.to_csv(PATH + 'yellow.csv', index=True)

Now `yellow` is ready to be used by the `NYC_Yellow_Cabs_Main.ipynb` notebook.