# Predicting NYC Taxi Fares with RAPIDS

[RAPIDS](https://rapids.ai/) is a suite of GPU accelerated data science libraries with APIs that should be familiar to uses of Pandas, Dask, and Scikitlearn.

This notebook focuses on showing how to use cuDF with Dask & XGBoost to scale GPU DataFrame ETL-style operations & model training out to multiple GPUs on mutliple nodes as part of Google Cloud Dataproc.

Anaconda has graciously made some of the NYC Taxi dataset available in [a public Google Cloud Storage bucket](https://console.cloud.google.com/storage/browser/anaconda-public-data/nyc-taxi/csv/). We'll use our Dataproc Cluster of GPUs to process it and train a model that predicts the fare amount.

In [1]:
import numpy as np
import numba, xgboost, socket

import dask, dask_cudf
from dask_cuda import LocalCUDACluster
from dask.delayed import delayed
from dask.distributed import Client, wait

# connect to the Dask cluster that Dataproc stood up
client = Client(socket.gethostname()+':8786')
# forces workers to restart. useful to ensure GPU memory is clear
client.restart()

# attempt to limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.get('distributed.scheduler.work-stealing')
dask.config.set({'distributed.scheduler.bandwidth': 1})
dask.config.get('distributed.scheduler.bandwidth')

client

0,1
Client  Scheduler: tcp://test-m:8786  Dashboard: http://test-m:8787/status,Cluster  Workers: 12  Cores: 12  Memory: 0 B


# Inspecting the Data

Now that we have a cluster of GPU workers, we'll use [dask-cudf](https://github.com/rapidsai/dask-cudf/) to load and parse a bunch of CSV files into a distributed DataFrame.

First we'll tell Dask to have all workers use `gsutil cp` to copy data from the GCS bucket to local disk. This is significantly faster than having Dask read from the bucket directly into a DataFrame.

In [2]:
# dask-cuda-worker creates one worker per GPU on Dataproc worker instances
# Get the list of unique IP addresses
machine_ips = []
machines = []
workers = list(client.scheduler_info()['workers'].keys())
for worker in workers:
    ip = worker.split(":")[1].replace("//", "")
    if ip not in machine_ips:
        machine_ips.append(ip)
        machines.append(worker)

In [29]:
%%time
import subprocess

def copy_files(year):
    proc = subprocess.Popen(
        "gsutil cp -r gs://anaconda-public-data/nyc-taxi/csv/"+str(year)+" .",
        shell=True,
        stdout=subprocess.PIPE
    )
    proc.wait()
    return proc.communicate()[0]

# have each worker instance copy data over to local disk
for year in [2014, 2015, 2016]:
    client.run(copy_files(year), workers=machines)

CPU times: user 80 ms, sys: 48 ms, total: 128 ms
Wall time: 1min 15s


In [30]:
%%time

# have the master copy data as well
for year in [2014, 2015, 2016]:
    copy_files(year)

CPU times: user 180 ms, sys: 68 ms, total: 248 ms
Wall time: 2min 22s


b''

In [2]:
taxi_df = dask_cudf.read_csv('2014/yellow*')
taxi_df.head().to_pandas()

Unnamed: 0,vendor_id,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,surcharge,mta_tax,tip_amount,tolls_amount,total_amount
0,CMT,2014-01-09 20:45:25,2014-01-09 20:52:31,1,0.7,-73.99477,40.736828,1,N,-73.982227,40.73179,CRD,6.5,0.5,0.5,1.4,0.0,8.9
1,CMT,2014-01-09 20:46:12,2014-01-09 20:55:12,1,1.4,-73.982392,40.773382,1,N,-73.960449,40.763995,CRD,8.5,0.5,0.5,1.9,0.0,11.4
2,CMT,2014-01-09 20:44:47,2014-01-09 20:59:46,2,2.3,-73.98857,40.739406,1,N,-73.986626,40.765217,CRD,11.5,0.5,0.5,1.5,0.0,14.0
3,CMT,2014-01-09 20:44:57,2014-01-09 20:51:40,1,1.7,-73.960213,40.770464,1,N,-73.979863,40.77705,CRD,7.5,0.5,0.5,1.7,0.0,10.2
4,CMT,2014-01-09 20:47:09,2014-01-09 20:53:32,1,0.9,-73.995371,40.717248,1,N,-73.984367,40.720524,CRD,6.0,0.5,0.5,1.75,0.0,8.75


# Data Cleanup

As usual, the data needs to be massaged a bit before we can start adding features that are useful to an ML model.

In this case, the taxi data for different years has different column names. We'll do a little string manipulation, column renaming and dropping to fix the problem.

In [3]:
import cudf

# helper function which takes a DataFrame partition
def clean_delayed(df_part, mapper, must_haves):    
    # some col-names include pre-pended space.. remove them
    tmp = {col:col.strip().lower() for col in list(df_part.columns)}
    df_part = df_part.rename(tmp)
    
    # drop any column without a supplied replacement name
    for col in mapper:
        if col in mapper and mapper[col] == None and col in df_part.columns:
            df_part = df_part.drop(col)
    
    # rename according to supplied mapping
    df_part = df_part.rename(mapper)
        
    # fill all na values for non-object/string columns
    for col in df_part.columns:
        if df_part[col].dtype == 'object':
            df_part[col] = df_part[col].str.fillna('-1')
            df_part[col] = df_part[col].astype('float64')
        else:
            if 'int' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('int32')
            if 'float' in str(df_part[col].dtype):
                df_part[col] = df_part[col].astype('float32')
            df_part[col] = df_part[col].fillna(-1)
    
    # some CSV files are missing columns
    for col, dtype in must_haves.items():
        if col not in df_part.columns or str(df_part[col].dtype) != dtype:
            empty_df = cudf.DataFrame()
            for col_name, str_dtype in must_haves.items():
                # these will be filtered downstream
                empty_df[col_name] = [-1]
                empty_df[col_name] = empty_df[col_name].astype(str_dtype)
            return empty_df
        
    return df_part

# create a dict mapping existing names to intended names
# any dict entry with `None` will be dropped
col_map = dict.fromkeys([
    'vendor_id', 'vendorid', 'payment_type', 'surcharge', 'mta_tax',
    'tip_amount', 'tolls_amount', 'total_amount', 'store_and_fwd_flag', 'pulocationid',
    'dolocationid', 'extra', 'improvement_surcharge'
])
col_map['tpep_pickup_datetime'] = 'pickup_datetime'
col_map['tpep_dropoff_datetime'] = 'dropoff_datetime'
col_map['ratecodeid'] = 'rate_code'

#create a list of columns & dtypes the df must have
must_haves = {
 'pickup_datetime': 'datetime64[ms]',
 'dropoff_datetime': 'datetime64[ms]',
 'passenger_count': 'int32',
 'trip_distance': 'float32',
 'pickup_longitude': 'float32',
 'pickup_latitude': 'float32',
 'rate_code': 'int32',
 'dropoff_longitude': 'float32',
 'dropoff_latitude': 'float32',
 'fare_amount': 'float32'
}

In [4]:
def clean(df):
    parts = [dask.delayed(clean_delayed)(part, col_map, must_haves) for part in df.to_delayed()]
    return dask_cudf.from_delayed(parts)

taxi_df = clean(taxi_df)
taxi_df.head().to_pandas()

Unnamed: 0,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount
0,2014-01-09 20:45:25,2014-01-09 20:52:31,1,0.7,-73.994766,40.736828,1,-73.982224,40.731789,6.5
1,2014-01-09 20:46:12,2014-01-09 20:55:12,1,1.4,-73.982391,40.77338,1,-73.960449,40.763996,8.5
2,2014-01-09 20:44:47,2014-01-09 20:59:46,2,2.3,-73.988571,40.739407,1,-73.986626,40.765217,11.5
3,2014-01-09 20:44:57,2014-01-09 20:51:40,1,1.7,-73.960213,40.770466,1,-73.979866,40.77705,7.5
4,2014-01-09 20:47:09,2014-01-09 20:53:32,1,0.9,-73.995369,40.717247,1,-73.984367,40.720524,6.0


# Increasing Our Training Data Size

There are two more years (2015, 2016) worth of data. we'll add 2015 to our training set and hold 2016 back for test.

In [5]:
# combine and apply a last-step filter:
df_2 = clean(dask_cudf.read_csv('2015/yellow*'))
df_3 = clean(dask_cudf.read_csv('2016/yellow*'))

taxi_df = dask.dataframe.multi.concat([taxi_df, df_2, df_3])

# apply a list of filter conditions to throw out records with missing or outlier values
query_frags = [
    'fare_amount > 0 and fare_amount < 500',
    'passenger_count > 0 and passenger_count < 6',
    'pickup_longitude > -75 and pickup_longitude < -73',
    'dropoff_longitude > -75 and dropoff_longitude < -73',
    'pickup_latitude > 40 and pickup_latitude < 42',
    'dropoff_latitude > 40 and dropoff_latitude < 42'
]
taxi_df = taxi_df.query(' and '.join(query_frags))

# inspect the results of cleaning
taxi_df.head().to_pandas()

Unnamed: 0,pickup_datetime,dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount
0,2014-01-09 20:45:25,2014-01-09 20:52:31,1,0.7,-73.994766,40.736828,1,-73.982224,40.731789,6.5
1,2014-01-09 20:46:12,2014-01-09 20:55:12,1,1.4,-73.982391,40.77338,1,-73.960449,40.763996,8.5
2,2014-01-09 20:44:47,2014-01-09 20:59:46,2,2.3,-73.988571,40.739407,1,-73.986626,40.765217,11.5
3,2014-01-09 20:44:57,2014-01-09 20:51:40,1,1.7,-73.960213,40.770466,1,-73.979866,40.77705,7.5
4,2014-01-09 20:47:09,2014-01-09 20:53:32,1,0.9,-73.995369,40.717247,1,-73.984367,40.720524,6.0


# Adding Interesting Features

Dask & cuDF provide standard DataFrame operations, but also let you run "user defined functions" on the underlying data.

cuDF's [apply_rows](https://rapidsai.github.io/projects/cudf/en/0.6.0/api.html#cudf.dataframe.DataFrame.apply_rows) operation is similar to Pandas's [DataFrame.apply](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html), except that for cuDF, custom Python code is [JIT compiled by numba](https://numba.pydata.org/numba-doc/dev/cuda/kernels.html) into GPU kernels.

In [6]:
import math
from math import cos, sin, asin, sqrt, pi

def haversine_distance_kernel(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude, h_distance):
    for i, (x_1, y_1, x_2, y_2) in enumerate(zip(pickup_latitude, pickup_longitude, dropoff_latitude, dropoff_longitude)):
        x_1 = pi/180 * x_1
        y_1 = pi/180 * y_1
        x_2 = pi/180 * x_2
        y_2 = pi/180 * y_2
        
        dlon = y_2 - y_1
        dlat = x_2 - x_1
        a = sin(dlat/2)**2 + cos(x_1) * cos(x_2) * sin(dlon/2)**2
        
        c = 2 * asin(sqrt(a)) 
        r = 6371 # Radius of earth in kilometers
        
        h_distance[i] = c * r

def day_of_the_week_kernel(day, month, year, day_of_week):
    for i, (d_1, m_1, y_1) in enumerate(zip(day, month, year)):
        if month[i] <3:
            shift = month[i]
        else:
            shift = 0
        Y = year[i] - (month[i] < 3)
        y = Y - 2000
        c = 20
        d = day[i]
        m = month[i] + shift + 1
        day_of_week[i] = (d + math.floor(m*2.6) + y + (y//4) + (c//4) -2*c)%7
        
def add_features_delayed(df):
    df['hour'] = df['pickup_datetime'].dt.hour
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['diff'] = df['dropoff_datetime'].astype('int64') - df['pickup_datetime'].astype('int64')
    
    df['pickup_latitude_r'] = df['pickup_latitude'].ceil()
    df['pickup_longitude_r'] = df['pickup_longitude'].ceil()
    df['dropoff_latitude_r'] = df['dropoff_latitude'].ceil()
    df['dropoff_longitude_r'] = df['dropoff_longitude'].ceil()
    
    df = df.drop('pickup_datetime')
    df = df.drop('dropoff_datetime')
    
    
    df = df.apply_rows(haversine_distance_kernel,
                   incols=['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude'],
                   outcols=dict(h_distance=np.float64),
                   kwargs=dict())
    
    
    df = df.apply_rows(day_of_the_week_kernel,
                      incols=['day', 'month', 'year'],
                      outcols=dict(day_of_week=np.float64),
                      kwargs=dict())
    
    
    df['is_weekend'] = (df['day_of_week']<2)
    return df

# Dropping Empty Partitions

Based on the filters applied to this dataset, some partitions will be empty. We can use a bit of Dask logic to filter them.

In [7]:
def drop_empty_partitions(df):
    lengths = df.map_partitions(len).compute()
    nonempty = [length > 0 for length in lengths]
    return df.partitions[nonempty]

In [8]:
taxi_df = drop_empty_partitions(taxi_df)

# now add the features
parts = [dask.delayed(add_features_delayed)(part) for part in taxi_df.to_delayed()]
taxi_df = dask_cudf.from_delayed(parts)

# inspect the result
taxi_df.head().to_pandas()

Unnamed: 0,passenger_count,trip_distance,pickup_longitude,pickup_latitude,rate_code,dropoff_longitude,dropoff_latitude,fare_amount,hour,year,month,day,diff,pickup_latitude_r,pickup_longitude_r,dropoff_latitude_r,dropoff_longitude_r,h_distance,day_of_week,is_weekend
0,1,0.7,-73.994766,40.736828,1,-73.982224,40.731789,6.5,20,2014,1,9,426000,41.0,-73.0,41.0,-73.0,1.196175,4.0,False
1,1,1.4,-73.982391,40.77338,1,-73.960449,40.763996,8.5,20,2014,1,9,540000,41.0,-73.0,41.0,-73.0,2.122098,4.0,False
2,2,2.3,-73.988571,40.739407,1,-73.986626,40.765217,11.5,20,2014,1,9,899000,41.0,-73.0,41.0,-73.0,2.874643,4.0,False
3,1,1.7,-73.960213,40.770466,1,-73.979866,40.77705,7.5,20,2014,1,9,403000,41.0,-73.0,41.0,-73.0,1.809662,4.0,False
4,1,0.9,-73.995369,40.717247,1,-73.984367,40.720524,6.0,20,2014,1,9,383000,41.0,-73.0,41.0,-73.0,0.996204,4.0,False


# Pick a Training Set

Let's imagine you're making a trip to New York on the 25th and want to build a model to predict what fare prices will be like the last few days of the month based on the first part of the month. We'll use a query expression to identify the `day` of the month to use to divide the data into train and test sets.

The wall-time below represents how long it takes your GPU cluster to run the ETL portion of the workflow.

In [9]:
%%time
X_train = taxi_df.query('day < 25').persist()

# create a Y_train ddf with just the target variable
Y_train = X_train[['fare_amount']]
# drop the target variable from the training ddf
X_train = X_train[X_train.columns.difference(['fare_amount'])]

# this wont return until all data is in GPU memory
done = wait([X_train, Y_train])

CPU times: user 6.07 s, sys: 52 ms, total: 6.12 s
Wall time: 1min 12s


In [10]:
# display how many records will be used in training
def pretty(val):
    print("{:,}".format(val))

pretty(len(X_train))

286,955,292


# Train the XGBoost Regression Model

The wall time output below indicates how long it took your GPU cluster to train an XGBoost model over the training set.

In [11]:
%%time

import dask_xgboost as dxgb_gpu

params = {
 'learning_rate': 0.3,
  'max_depth': 8,
  'objective': 'reg:squarederror', #gpu:reg deprecated
  'subsample': 0.6,
  'gamma': 1,
  'silent': True,
  'verbose_eval': True,
  'tree_method':'gpu_hist',
  'n_gpus': 1
}

bst = dxgb_gpu.train(client, params, X_train, Y_train, num_boost_round=100)

CPU times: user 292 ms, sys: 20 ms, total: 312 ms
Wall time: 1min 34s


# How Good is Our Model?

Now that we have a trained model, we need to test it with the 25% of records we held out.

Based on the filtering conditions applied to this dataset, many of the DataFrame partitions will wind up having 0 rows.

This is a problem for XGBoost which doesn't know what to do with 0 length arrays. We'll apply a bit of Dask logic to check for and drop partitions without any rows.

In [12]:
X_test = taxi_df.query('day >= 25').persist()
X_test = drop_empty_partitions(X_test)

# Create Y_test with just the fare amount
Y_test = X_test[['fare_amount']]

# Drop the fare amount from X_test
X_test = X_test[X_test.columns.difference(['fare_amount'])]

# display test set size
pretty(len(X_test))

72,909,669


In [13]:
# generate predictions on the test set
Y_test['prediction'] = dxgb_gpu.predict(client, bst, X_test)

# Compute Root Mean Squared Error

In [14]:
Y_test['squared_error'] = (Y_test['prediction'] - Y_test['fare_amount'])**2

# inspect the results to make sure our calculation looks right
Y_test.head().to_pandas()

Unnamed: 0,fare_amount,prediction,squared_error
205295,13.0,12.75034,0.06233
205431,7.5,7.669405,0.028698
205493,8.0,8.145337,0.021123
205805,8.0,7.944676,0.003061
206044,14.5,14.436069,0.004087


In [15]:
%%time
# compute the actual RMSE over the full test set
math.sqrt(Y_test.squared_error.mean().compute())

CPU times: user 376 ms, sys: 264 ms, total: 640 ms
Wall time: 3.02 s


1.8511388438225553

Not bad! We can predict a taxi fare to within about $1.85.

If I'm planning to head to Strata Data in NYC, I can probably fill out my ground transportation expense items ahead of time.

# Takeaways

We just demonstrated how to use GPU DataFrames to scale ETL style operations out to multiple GPUs on multiple nodes.

We also showed how to pass prepared data directly to XGBoost without having the data ever leave GPU memory. As a result, we can run end to end data processing _and_ model training faster, using less hardware than with a CPU based solution.

What now?

[Check out RAPIDS on GitHub](https://github.com/rapidsai) and follow the development, or pitch in by reporting issues, making pull requests or even just requesting the features your workflows need. We look forward to hearing from you!