# New York City Taxi Fare Prediction

The task is to predict the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations.  While I can get a basic estimate based on just the distance between the two points, this will result in an RMSE of $5-$8, depending on the model used. The challenge is to do better than this using Machine Learning techniques!

All datasets and the task itself were taken from the Kaggle playground competition, I do not own any of them.

## Taxi fare prediction analysis - cleaning up the training data


First we need to look at the data:

In [1]:
# load some default Python modules that will be used for all the sections of the notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn-whitegrid')



In [2]:
# read training data with limiting to the rows, as the file is big

tax_train = pd.read_csv("train.csv", nrows = 10_000, parse_dates=["pickup_datetime"])

#and checking what columns the dataframe has
tax_train.columns.to_list()

['key',
 'fare_amount',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [3]:
# i can also check how the table would look like and get the shape of the dataframe
print(tax_train.shape)
tax_train.head()

(10000, 8)


Unnamed: 0,key,fare_amount,pickup_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
0,26:21.0,4.5,2009-06-15 17:26:21+00:00,-73.844311,40.721319,-73.84161,40.712278,1
1,52:16.0,16.9,2010-01-05 16:52:16+00:00,-74.016048,40.711303,-73.979268,40.782004,1
2,35:00.0,5.7,2011-08-18 00:35:00+00:00,-73.982738,40.76127,-73.991242,40.750562,2
3,30:42.0,7.7,2012-04-21 04:30:42+00:00,-73.98713,40.733143,-73.991567,40.758092,1
4,51:00.0,5.3,2010-03-09 07:51:00+00:00,-73.968095,40.768008,-73.956655,40.783762,1


I check the type of the data, I will need it later to pick up features for training.

In [4]:
# check datatypes
tax_train.dtypes

key                               object
fare_amount                      float64
pickup_datetime      datetime64[ns, UTC]
pickup_longitude                 float64
pickup_latitude                  float64
dropoff_longitude                float64
dropoff_latitude                 float64
passenger_count                    int64
dtype: object

I check how the data looks like:

In [5]:
tax_train.describe()


Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,11.235464,-72.46666,39.920448,-72.474094,39.893281,1.6447
std,9.584258,10.609729,7.318932,10.579732,6.339919,1.271229
min,-2.9,-74.438233,-74.006893,-74.429332,-73.994392,0.0
25%,6.0,-73.992058,40.734547,-73.991112,40.73523,1.0
50%,8.5,-73.981758,40.752693,-73.980083,40.753738,1.0
75%,12.5,-73.966925,40.767694,-73.963504,40.768186,2.0
max,180.0,40.766125,401.083332,40.802437,41.366138,6.0


I double check if there are any missing values (looks like there is none).

In [6]:
# get the number of missing data points per column
missing_values_count = tax_train.isnull().sum()

# look at the # of missing points
print(missing_values_count)

key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    0
dropoff_latitude     0
passenger_count      0
dtype: int64


From the data description it looks that some `fare_amount` values are negative. As it cannot be so, I drop the negative values.

In [9]:
# I filter out all negative values from the dataframe
tax_train = tax_train[tax_train.fare_amount > 0]

In [10]:
# checking how the data looks like now
tax_train.describe()

Unnamed: 0,fare_amount,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,9998.0,9998.0,9998.0,9998.0,9998.0,9998.0
mean,11.238252,-72.466375,39.920296,-72.47381,39.893123,1.644829
std,9.583189,10.610771,7.319656,10.580772,6.340543,1.271324
min,0.01,-74.438233,-74.006893,-74.429332,-73.994392,0.0
25%,6.0,-73.992056,40.734564,-73.991109,40.735235,1.0
50%,8.5,-73.981758,40.752695,-73.980083,40.75374,1.0
75%,12.5,-73.966934,40.767696,-73.963512,40.768187,2.0
max,180.0,40.766125,401.083332,40.802437,41.366138,6.0


Ok, now no negative valuse in `fare_amount` and no NaN. All good.

## Taxi fare prediction analysis - cleaning up the test data

Then I take a look into the test set to see how the columns look like there

In [11]:
# Read test data

tax_test = pd.read_csv("test.csv", nrows = 2_000)

#and checking what columns the data table has
tax_test.columns.to_list()

['key',
 'pickup_datetime',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'passenger_count']

In [13]:
# Checking how the test data looks like
tax_test.describe()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count
count,2000.0,2000.0,2000.0,2000.0,1.0
mean,40.69283,-73.916841,40.69312,1.019872,1.0
std,2.565463,2.565859,2.565693,0.888699,
min,-73.97332,-74.263242,-73.98143,1.0,1.0
25%,40.735097,-73.991009,40.7347,1.0,1.0
50%,40.752976,-73.98015,40.753934,1.0,1.0
75%,40.767182,-73.963539,40.768612,1.0,1.0
max,41.06966,40.763805,41.051657,40.743835,1.0


As I can see there is no `fare_amount` column, as I need to predict that column. 
There are some NaN values in the `passanger_count`:

After looking at the data, I need to determine a problem. I need to determine a model, to understand what kind of Machine Learning model to use.

I start with creating a histogramm of the data:

In [None]:
# Plot a histogram
tax_train.fare_amount.hist(bins=30, alpha=0.5)
plt.show()

From histogram I can see, that the `fare_amount` is a contionous variable, therefore I am dealing with the regression problem.

In [None]:
from sklearn.linear_model import LinearRegression


# Creating a LinearRegression object
lr = LinearRegression() 


I take some features in the train set and build a linear regression

In [None]:
# Fit the model on the train data

lr.fit(X=tax_train[['pickup_longitude',  'pickup_latitude',  'dropoff_longitude',  'dropoff_latitude',  'passenger_count']],
        y=tax_train['fare_amount'])