# New York City Taxi Fare Prediction

We are tasked with predicting the fare amount (inclusive of tolls) for a taxi ride in New York City given the pickup and dropoff locations. The evaluation metric for this competition is the root mean-squared error or RMSE. Follow the link to see the [Kaggle Competition](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/overview/description). Note that it suggest using TensorFlow, but we will use sklearn and regression models to try and predict fare amount.

### Features
- `key` - Unique string identifying each row in both the training and test sets. Comprised of pickup_datetime plus a unique integer. Required for submission but not for training.
- `pickup_datetime` - timestamp value indicating when the taxi ride started.
- `pickup_longitude` - float for longitude coordinate of where the taxi ride started.
- `pickup_latitude` - float for latitude coordinate of where the taxi ride started.
- `dropoff_longitude` - float for longitude coordinate of where the taxi ride ended.
- `dropoff_latitude` - float for latitude coordinate of where the taxi ride ended.
- `passenger_count` - integer indicating the number of passengers in the taxi ride.

### Label
- `fare_amount` - a float dollar amount of the cost of the taxi ride. This is the value we want to predict

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.linear_model import Lasso, LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Encoding Data
We want to use One-Hot Encoding for our data before we normalize it. We will create seperate columns for each year and passenger count.

In [25]:
X = df.drop(['key', 'fare_amount', 'pickup_datetime'], axis=1) # Get numerical data only (input for our model)
y = df['fare_amount'] # labels

In [26]:
year_encoded = pd.get_dummies(X['year'])
year_encoded.columns = ['year_' + str(col) for col in year_encoded.columns]
X = pd.concat([X, year_encoded], axis=1).drop('year', axis=1)

passenger_encoded = pd.get_dummies(X['passenger_count'])
passenger_encoded.columns = ['passengers_' + str(col) for col in passenger_encoded.columns]
X = pd.concat([X, passenger_encoded], axis=1).drop('passenger_count', axis=1)

X.head()

Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,longitude_distance,latitude_distance,year_2009,year_2010,year_2011,year_2012,year_2013,year_2014,year_2015,passengers_0,passengers_1,passengers_2,passengers_3,passengers_4,passengers_5,passengers_6
0,-73.844311,40.721319,-73.84161,40.712278,0.002701,0.009041,1,0,0,0,0,0,0,0,1,0,0,0,0,0
1,-74.016048,40.711303,-73.979268,40.782004,0.03678,0.070701,0,1,0,0,0,0,0,0,1,0,0,0,0,0
2,-73.982738,40.76127,-73.991242,40.750562,0.008504,0.010708,0,0,1,0,0,0,0,0,0,1,0,0,0,0
3,-73.98713,40.733143,-73.991567,40.758092,0.004437,0.024949,0,0,0,1,0,0,0,0,1,0,0,0,0,0
4,-73.968095,40.768008,-73.956655,40.783762,0.01144,0.015754,0,1,0,0,0,0,0,0,1,0,0,0,0,0


# Initial Modeling

In [38]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)