# Preprocessing

This notebook is provided to generate file `yellow_pickups.csv`. This file is already provided in the _data/_ folder. 

`yellow_pickups` contains the number of taxis hired for each `taxi_zone` at a hourly time for the whole year 2017, as well as the average fare amount and trip distance. 

The `yellow_pickups` dataset is computed from the ___NYC Taxi and Limousine Commission (LTC)___ [available data](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page).

In [None]:
import pandas as pd
import numpy as np
from time import time
from multiprocessing import Pool

## 1. Download Data

Only download the `TARGET_COLUMNS` from the S3 repository. 

__Warning:__ Even with an excellent connection, it takes around __10 min__ to download the full year.

In [None]:
URL = "https://s3.amazonaws.com/nyc-tlc/trip+data/"

TARGET_COLUMNS = ['tpep_pickup_datetime', 'PULocationID', 'trip_distance', 'fare_amount'] 

def get_month_dataset(month):
    url = URL + "yellow_tripdata_2017-{:0>2}.csv".format(month)
    print('--| ' + url)
    now = time()
    month_df = pd.read_csv(url, usecols=TARGET_COLUMNS)
    print(month, '-->', int(time()-now), 'seconds\n')
    return month_df

In [None]:
%%time

p = Pool(5)
months = list(p.map(get_month_dataset, range(1,13)))

## 2. Data Preprocessing

### 2.1 Build a single DataFrame

Merge `months` into a single DataFrame: `yellow`.

In [None]:
months = pd.concat(months, ignore_index=True)
months.columns = ['pickup_datetime', 'trip_distance', 'pickup_zone', 'fare_amount']
months.head()

### 2.2 Create an hourly timestamp

Truncate `pickup_datetime` to groupby every trips that are starting from the same `pickup_zone` at the same hourly time. \

In [None]:
# Convert to datetime
months.pickup_datetime = pd.to_datetime(months.pickup_datetime)

# Truncate the datetime to groupby by hour each trip
months.pickup_datetime = months.pickup_datetime.dt.floor('h')

### 2.3 group by zone
indexed on datetime

In [None]:
def pickup_summary(group):
    return pd.Series({'pu_counter': group[group.columns[0]].count(), 'avg_trip_distance': group[group.columns[1]].mean() ,'avg_fare_amount': group[group.columns[2]].mean()})

pickups_by_zone = months.groupby(['pickup_datetime', 'pickup_zone']).apply(pickup_summary).unstack(1)

pickups_by_zone.index = pd.to_datetime(pickups_by_zone.index).tz_localize('America/New_York', ambiguous=True, nonexistent='shift_forward')
pickups_by_zone = pickups_by_zone.sort_index()
pickups_by_zone = pickups_by_zone.fillna(0)

## 4. Export Data

In [None]:
PATH = '../data/' # Modify this to fit your data folder

In [None]:
pickups_by_zone.to_csv(PATH + 'yellow_pickups.csv', index=True)

Now `yellow_pickups` is ready to be used by the `NYC_Yellow_Cabs_Main.ipynb` notebook.