# Predicting NYC Taxi Fares with RAPIDS

[RAPIDS](https://rapids.ai/) is a suite of GPU accelerated data science libraries with APIs that should be familiar to uses of Pandas, Dask, and Scikitlearn.

This notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on mutliple nodes as part of Google Cloud Dataproc.

In [3]:
import numpy as np
import numba, xgboost, socket

import dask, dask_cudf
from dask_cuda import LocalCUDACluster
from dask.delayed import delayed
from dask.distributed import Client, wait

# connect to the Dask cluster started at Dataproc cluster creation time
client = Client(socket.gethostname()+':8786')
# forces workers to restart. useful to ensure GPU memory is clear
client.restart()
client

0,1
Client  Scheduler: tcp://test-m:8786  Dashboard: http://test-m:8787/status,Cluster  Workers: 13  Cores: 13  Memory: 51.27 GB


# Inspecting the Data

Now that we have a cluster of GPU workers, we'll use [dask-cudf](https://github.com/rapidsai/dask-cudf/) to load and parse a bunch of CSV files into a distributed DataFrame.

In [14]:
#taxi_df = dask_cudf.read_csv('/data/nyc_taxi/raw/2014/yellow_tripdata_2014-1*')
taxi_df = dask_cudf.read_csv('/data/nyc_taxi/raw/2011/yellow_tripdata_2011*')

taxi_df.head().to_pandas()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,CMT,2011-01-29 02:38:35,2011-01-29 02:47:07,1,1.2,-74.005254,40.729084,1,N,-73.988697,40.727127,CSH,6.1,0.5,0.5,0.0,0.0,7.1
1,CMT,2011-01-28 10:38:19,2011-01-28 10:42:18,1,0.4,-73.968585,40.759171,1,N,-73.964336,40.764665,CSH,4.1,0.0,0.5,0.0,0.0,4.6
2,CMT,2011-01-28 23:49:58,2011-01-28 23:57:44,3,1.2,-73.98071,40.74239,1,N,-73.987028,40.729532,CSH,6.1,0.5,0.5,0.0,0.0,7.1
3,CMT,2011-01-28 23:52:09,2011-01-28 23:59:21,3,0.8,-73.993773,40.747329,1,N,-73.991378,40.75005,CSH,5.3,0.5,0.5,0.0,0.0,6.3
4,CMT,2011-01-28 10:34:39,2011-01-28 11:25:50,1,5.3,-73.991475,40.749936,1,N,-73.950237,40.775626,CSH,25.3,0.0,0.5,0.0,0.0,25.8


# Data Cleanup

As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.

In this case, the taxi data for different years has different column names. We'll do a little string manipulation, column renaming and dropping to fix the problem.

In [15]:
# helper function which takes a DataFrame partition
def clean(df_part, mapper):    
    # some col-names include pre-pended space.. remove them
    tmp = {col:col.strip().lower() for col in list(df_part.columns)}
    df_part = df_part.rename(tmp)
    
    # drop any column without a supplied replacement name
    for col in mapper:
        if col in mapper and mapper[col] == None and col in df_part.columns:
            df_part = df_part.drop(col)
    
    # rename according to supplied mapping
    df_part = df_part.rename(mapper)
        
    # fill all na values for non-object/string columns
    for col in df_part.columns:
        if df_part[col].dtype != 'object':
            df_part[col] = df_part[col].fillna(-1)
    
    return df_part

# create a dict mapping existing names to intended names
# any dict entry with `None` will be dropped
col_map = dict.fromkeys([
    'vendor_id', 'dropoff_datetime', 'payment_type', 'surcharge', 'mta_tax',
    'tip_amount', 'tolls_amount', 'total_amount', 'store_and_fwd_flag'
])

# apply the `clean` function to every partition in the taxi DataFrame
parts = [dask.delayed(clean)(part, col_map) for part in taxi_df.to_delayed()]
taxi_df = dask_cudf.from_delayed(parts)

# apply a list of filter conditions to throw out records with missing or outlier values
query_frags = [
    'fare_amount > 0 and fare_amount < 500',
    'passenger_count > 0 and passenger_count < 6',
    'pickup_longitude > -75 and pickup_longitude < -73',
    'dropoff_longitude > -75 and dropoff_longitude < -73',
    'pickup_latitude > 40 and pickup_latitude < 42',
    'dropoff_latitude > 40 and dropoff_latitude < 42'
]
taxi_df = taxi_df.query(' and '.join(query_frags))

# inspect the results of cleaning
taxi_df.head().to_pandas()

Unnamed: 0,pickup_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount
0,2011-01-29 02:38:35,1,1.2,-74.005254,40.729084,1,-73.988697,40.727127,6.1
1,2011-01-28 10:38:19,1,0.4,-73.968585,40.759171,1,-73.964336,40.764665,4.1
2,2011-01-28 23:49:58,3,1.2,-73.98071,40.74239,1,-73.987028,40.729532,6.1
3,2011-01-28 23:52:09,3,0.8,-73.993773,40.747329,1,-73.991378,40.75005,5.3
4,2011-01-28 10:34:39,1,5.3,-73.991475,40.749936,1,-73.950237,40.775626,25.3


# Adding Interesting Features

Dask & cuDF provide standard DataFrame operations, but also let you run "user defined functions" on the underlying data.

cuDF's [apply_rows](https://rapidsai.github.io/projects/cudf/en/0.6.0/api.html#cudf.dataframe.DataFrame.apply_rows) operation is similar to Pandas's [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), except that for cuDF, custom Python code is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels.

In [16]:
import math
import cudf

#Numba Kernel to calculate Haversine distance
@numba.cuda.jit
def haversine_kernel(lat1, lon1, lat2, lon2, outputCol):
    iRow = numba.cuda.grid(1)
    p = 0.017453292519943295 # Pi/180
    if iRow < outputCol.size:
        a = 0.5 - math.cos((lat2[iRow] - lat1[iRow]) * p)/2 + math.cos(lat1[iRow] * p) * \
            math.cos(lat2[iRow] * p) * (1 - math.cos((lon2[iRow] - lon1[iRow]) * p)) / 2                                 
        outputCol[iRow] = 12734 * math.asin(math.sqrt(a))

#ToDo
# use rmm allocation from outside the kernel
# explain ability to use numba kernels, link to examples
# but still use apply_rows directly nicer API
# use keith's example
def haversine_distance(gdf):
    nRows = gdf.shape[0]
    blockSize = 128
    blockCount = nRows // blockSize + 1
    lat1_arr = gdf['pickup_latitude'].to_gpu_array()
    lon1_arr = gdf['pickup_longitude'].to_gpu_array()
    lat2_arr = gdf['dropoff_latitude'].to_gpu_array()
    lon2_arr = gdf['dropoff_longitude'].to_gpu_array()
                        
    # allocate device memory for the result
    outputCol = cudf.rmm.device_array ( shape=(nRows), dtype=lat1_arr.dtype.name)
    
    haversine_kernel[(blockCount),(blockSize)](lat1_arr, lon1_arr, lat2_arr, lon2_arr, outputCol)
    gdf.add_column(name='h_distance', data = outputCol)
    return gdf

#Numba Kernel to calculate day of the week from Date
@numba.cuda.jit
def day_of_the_week_kernel(output ,year, month, day):
    iRow = numba.cuda.grid(1)
    if iRow < output.size:
        year[iRow] -= month[iRow] < 3
        month[iRow] = (month[iRow] + 9)%12 + 1
        output[iRow] = (year[iRow] + int(year[iRow]/4) - int(year[iRow]/100) + int(year[iRow]/400) + math.floor(2.6*month[iRow] - 0.2) + day[iRow] -1) % 7

#ToDo:
#use rmm instead of cuda.device_array
# replace day of week with apply_rows kernel
def day_of_week(gdf):
    nRows = gdf.shape[0]
    blockSize = 128
    blockCount = nRows // blockSize + 1
    year_arr = gdf['year'].to_gpu_array()
    month_arr = gdf['month'].to_gpu_array()
    day_arr = gdf['day'].to_gpu_array()
    outputCol = cudf.rmm.device_array ( shape=(nRows), dtype=day_arr.dtype.name)
    
    day_of_the_week_kernel[(blockCount),(blockSize)](outputCol, year_arr, month_arr, day_arr)
    gdf.add_column(name='day_of_week', data = outputCol)
    gdf['day_of_week'] = gdf['day_of_week'].astype('float32')
    return gdf

def add_features(df):
    df['hour'] = df['pickup_datetime'].dt.hour
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    
    df = df.drop('pickup_datetime')
    
    df = day_of_week(df)
    df['is_weekend'] = (df['day_of_week']/4).floor()
    df = haversine_distance(df)
    return df

# Add features
parts = [dask.delayed(add_features)(part) for part in taxi_df.to_delayed()]
taxi_df = dask_cudf.from_delayed(parts)

taxi_df.head().to_pandas()

Unnamed: 0,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount,hour,year,month,day,day_of_week,is_weekend,h_distance
0,1,1.2,-74.005254,40.729084,1,-73.988697,40.727127,6.1,2,2010,11,29,5.0,1.0,1.411159
1,1,0.4,-73.968585,40.759171,1,-73.964336,40.764665,4.1,10,2010,11,28,4.0,1.0,0.707559
2,3,1.2,-73.98071,40.74239,1,-73.987028,40.729532,6.1,23,2010,11,28,4.0,1.0,1.524669
3,3,0.8,-73.993773,40.747329,1,-73.991378,40.75005,5.3,23,2010,11,28,4.0,1.0,0.36343
4,1,5.3,-73.991475,40.749936,1,-73.950237,40.775626,25.3,10,2010,11,28,4.0,1.0,4.494138


# Pick a Training Set

We'll use a query expression to identify the `year` & `month` values to use to divide the data into roughly 75% for training, and 25% for test.

In [17]:
# ToDo: use multi-column groupby to compute a more meaningful train/test split boundary
# note you could also use taxi_df.query('day < 25') if you prefer that syntax
X_train = taxi_df[taxi_df['day'] < 25]

# create a Y_train ddf with just the target variable
Y_train = X_train[['fare_amount']]

# drop the target variable from the training ddf
X_train = X_train[X_train.columns.difference(['fare_amount'])]

In [18]:
# display how many records will be used in training
def pretty(val):
    print("{:,}".format(val))

pretty(len(X_train))

134,637,456


# Train the XGBoost Regression Model

In [19]:
%%time

import dask_xgboost as dxgb_gpu

params = {
 'learning_rate': 0.05,
  'max_depth': 8,
  'objective': 'reg:linear',
  'subsample': 0.8,
  'gamma': 1,
  'silent': True,
  'verbose_eval': True,
  'tree_method':'gpu_hist',
  'n_gpus': 1
}

bst = dxgb_gpu.train(client, params, X_train, Y_train, num_boost_round=100)

CPU times: user 2.75 s, sys: 92 ms, total: 2.84 s
Wall time: 2min 59s


# How Good is Our Model?

Now that we have a trained model, we need to test it with the 25% of records we held out.

Based on the filtering conditions applied to this dataset, many of the DataFrame partitions will wind up having 0 rows.

This is a problem for XGBoost which doesn't know what to do with 0 length arrays. We'll apply a bit of Dask logic to check for and drop partitions without any rows.

In [20]:
X_test = taxi_df.query('day >= 25')

# in this dataset
lengths = X_test.map_partitions(len).compute()
nonempty = [length > 0 for length in lengths]

X_test = X_test.partitions[nonempty]

In [21]:
# Create Y_test with just the fare amount
Y_test = X_test[['fare_amount']]

# Drop the fare amount from X_test
X_test = X_test[X_test.columns.difference(['fare_amount'])]

# generate predictions on the test set
Y_test['prediction'] = dxgb_gpu.predict(client, bst, X_test)

# Compute Root Mean Squared Error

In [22]:
Y_test['squared_error'] = (Y_test['prediction'] - Y_test['fare_amount'])**2

# inspect the results to make sure our calculation looks right
Y_test.head().to_pandas()

Unnamed: 0,fare_amount,prediction,squared_error
0,6.1,6.006429,0.008756
1,4.1,4.408784,0.095348
2,6.1,6.053908,0.002124
3,5.3,5.076191,0.05009
4,25.3,16.606407,75.578556


In [23]:
# compute the actual RMSE over the full test set
math.sqrt(Y_test.squared_error.mean().compute())

2.010007212284223

# Takeaways

We just demonstrated how to use GPU DataFrames to scale ETL style operations out to multiple GPUs on multiple nodes.

We also showed how to pass prepared data directly to XGBoost without having the data ever leave GPU memory. As a result, we can run end to end data processing _and_ model training faster, using less hardware than with a CPU based solution.

What now?

[Check out RAPIDS on GitHub](https://github.com/rapidsai) and follow the development, or pitch in by reporting issues, making pull requests or even just requesting the features your workflows need. We look forward to hearing from you!

# Appendix

In [4]:
import os
import pandas as pd

# generate list of all files
base_dir = '/data/nyc_taxi/raw/'
files = []
for year in range(2009, 2019):
    for fn in os.listdir(base_dir+str(year)):
        if 'yellow' in fn:
            files.append(base_dir+str(year)+'/'+fn)

# get list of headers
def get_columns(fn):
    df = pd.DataFrame()
    with open(fn, 'r') as fp:
        df['year'] = [fn.split('-')[-2].split('_')[-1]]
        df['month'] = [fn.split('-')[-1].split('.')[0]]
        df['line'] = [fp.readline()]
    return df

parts = [dask.delayed(get_columns)(fn) for fn in files]
res = dask.dataframe.from_delayed(parts)
res.repartition(npartitions=1).compute().line.drop_duplicates()

0    vendor_name,Trip_Pickup_DateTime,Trip_Dropoff_...
0    vendor_id,pickup_datetime,dropoff_datetime,pas...
0    vendor_id, pickup_datetime, dropoff_datetime, ...
0    VendorID,tpep_pickup_datetime,tpep_dropoff_dat...
0    VendorID,tpep_pickup_datetime,tpep_dropoff_dat...
0    VendorID,tpep_pickup_datetime,tpep_dropoff_dat...
Name: line, dtype: object

In [6]:
# return columns in df2 but not df1
def column_delta(df1, df2):
    return list(set(df2.columns.map(str.lower)) - set(df1.columns.map(str.lower)))

In [6]:
# years >= 2015 data has different column names
# remap to match existing schema
newer_df = dask_cudf.read_csv('/data/nyc-taxi/2015/yellow_tripdata_2015-1*')

# data for 2015+ has more columns
# assume we should drop them
for col in column_delta(taxi_df, newer_df):
    col_map[col] = None

#ratecodeid and tpep_pickup_datetime map to columns we had in years < 2014
col_map['ratecodeid'] = 'rate_code'
col_map['tpep_pickup_datetime'] = 'pickup_datetime'

parts = [dask.delayed(clean_data)(part, col_map) for part in newer_df.to_delayed()]
newer_df = dask_cudf.from_delayed(parts)

taxi_df = taxi_df.append(newer_df)