<h1><center>It’s Time to Make Some Crazy Money!</center></h1>
<h1><center>Predicting a NYC Taxi Trip Duration</center></h1>
*****

## Table of contents
1. [Introduction](#introduction)
    1. [Loading libraries](#libraries)
    2. [Acquiring data](#acquire)
    3. [Data content](#content)
    4. [Missing values](#missing)
    5. [Dummy variables](#dummies)
    6. [Changing data types](#datatypes)
    7. [Adding columns](#columns)
2. [Exploratory Data Analysis](#EDA)
    1. [Feature visualization](#features)
    2. [Relationships](#related)
    3. [GPS Precision](#gps)
    4. [Repeated Trips](#repeated)
    4. [Pickup vs Dropoff activity](#activity)
    5. [Rush Hour](#rush)
3. [Save Data](#save)

## 1. Introduction <a name="introduction"></a>
***

With an estimated 2017 [population of 8,622,698](https://en.wikipedia.org/wiki/New_York_City) distributed over a land area of about 302.6 square miles (784 $km^2$), New York City is the most densely populated major city in the United States. As well as of 2016, the cities transportation infrastructure encompass more than [13,000 yellow taxicabs](https://ny.curbed.com/2017/1/17/14296892/yellow-taxi-nyc-uber-lyft-via-numbers) and more than **60,000 black cars**(for-hire vehicles), of which more than **46,000** are connected with Uber so traffic is unavoidable.

If a dispatcher knew approximately when all the fleet of taxis would be ending their current ride, it would be easier to identify which driver to assign to each pickup request and maximize their revenue. It is important to be able to predict how long a driver will have the taxi occupied to improve the efficiency of dispatching systems.

Big companies like Uber, Lyft, Taxify, Didi Chuxing, and other peer-to-peer ride sharing companies can use such a model to predict the duration of a taxi trip in major cities. They can make better estimates of how much to charge per ride as well as better predictions on the duration of the ride. From the customer's point of view, the price paid per ride could decrease with the improvement and optimization of a driver's traveling time improving customer satisfaction.

The data with the total ride duration of taxi trips in New York City has been acquired from [Kaggle](https://www.kaggle.com/c/nyc-taxi-trip-duration/data). The dataset includes pickup time, geo-coordinates, number of passengers, and several other variables. It contains data collected for the year 2016 from two different vendors. The evaluation metric for this project is [Root Mean Squared Logarithmic Error](https://www.kaggle.com/wiki/RootMeanSquaredLogarithmicError). The RMSLE is calculated as:

$$ \epsilon = \sqrt{\frac{1}{n} \sum_{i=1}^n (log(p_i+1) - log(a_i+1))^2 } $$

Where:

- *ϵ* is the RMSLE value (score)
- *n* is the total number of observations in the (public/private) data set
- $p_i$ is your prediction of trip duration
- $a_i$ is the actual trip duration for *i* 
- *log(x)* is the natural logarithm of *x*

Since the goal is to predict the duration of a taxi trip with labeled data, a supervised regression algorithm is a good fit for the project. This notebook covers Data Wrangling, Exploratory Data Analysis, Feature Engineering, Data Cleaning, and Modeling.


### Loading libraries <a name="libraries"></a>
***
These are some of the libraries used for data wrangling and EDA.

In [None]:
repeated_trips = merged[(merged['pickup_latitude'] != merged['dropoff_latitude']) 
       & (merged['pickup_longitude'] != merged['dropoff_longitude'])]
repeated_trips

In [None]:
s = []
for index, row in repeated_trips[['pickup_longitude','pickup_latitude', 'dropoff_latitude', 'dropoff_longitude']].iterrows():
    s.append([[row['pickup_latitude'], row['pickup_longitude']] , [row['dropoff_latitude'], row['dropoff_longitude']]])
    
folium_map = folium.Map(location=[40.738, -73.98],
                        zoom_start=12,
                        tiles="CartoDB dark_matter")
for i in range(len(s)):
    folium.PolyLine(s[i],color='yellow',weight=1.0,opacity=0.7).add_to(folium_map)

folium_map

Only 6 trips are repeated routes that have a travel distance more than zero. Each of these 6 trips is repeated twice and the data shows that both trips per route have the same *trip_duration*, which is suspicious.

### Pickup vs Dropoff Activity<a name='activity'></a>
***
Previously, the data showed that most of the activity is in Manhattan and some on both airports. Now let's plot the activity to get an idea of the flow of trips.

In [None]:
new_style = {'grid': False}  
matplotlib.rc('axes', **new_style)  
# from matplotlib import rcParams  
rcParams['figure.figsize'] = (17.5, 17) #Size of figure  
rcParams['figure.dpi'] = 250

plt.style.use(['dark_background'])
train.plot(kind='scatter', x='pickup_longitude', y='pickup_latitude',color='yellow',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.6)
plt.title('Pickup Activity')
plt.xlabel('longitude')
plt.ylabel('latitude')

As seen from the pickup graph, most of the activity is in Manhattan, as well as some activity on JFK and La Guardia airport as expected.

In [None]:
train.plot(kind='scatter', x='dropoff_longitude', y='dropoff_latitude',color='yellow',xlim=(-74.06,-73.77),ylim=(40.61, 40.91),s=.02,alpha=.6)
plt.title('Dropoff Activity')
plt.xlabel('longitude')
plt.ylabel('latitude')

The dropoff location graphs shows that most of the dropoff activity is also in Manhattan and the both airports. However, there is an increase of activity on the surrounding boroughs such as Queens, Brooklyn and the Bronx. Both maps give a clear idea that Manhattan has the most activity which it was expected since most of the trips last between 3 to 50 minutes. 

### Rush Hour<a name='rush'></a>
***
Traffic can easily double the *trip_duration* in NYC for similar routes. Let's look when are the hours during the week that you would expect to pay more for a trip. 

In [None]:
pylab.rcParams.update(params)
plt.style.use(['seaborn-white'])

train['pickup_datetime_day_of_week'] = train['pickup_datetime'].dt.weekday_name
days = train['pickup_datetime_day_of_week'].unique()

for day in days:
    r = train[train['pickup_datetime_day_of_week'] == day]['pickup_datetime'].dt.hour.value_counts().sort_index()
    r.plot(legend=True,label=day)

plt.title('Number of pickups per hour')
plt.xlabel('Hours')
plt.ylabel('Count')
plt.margins(0.04)

- There is a really high demand after midnight on Friday, Saturday and Sunday. 
- For the 7 days of the week the demand is really low around 5 AM.
- Rush hour is between 7 and 10 AM from Monday to Friday.
- Taxi demand stabilizes after 10 AM until 4 PM for all of the days in the week but still being pretty high.
- After 4 PM there is another rush hour for every day except Sunday.
- At night, demand starts to slow down except on Fridays and Saturdays.

Something interesting to point out is that for some people Mondays are considered the days with the most traffic but as the graph shows, it is lower compared to the other 4 week days. 

As pointed out, it is expected that trip durations will increase between 7 to 10 AM and 5 to 10 PM since are the busiest times in NYC and should be substantially shorter trip durations around 5 am. So for trips with the same distance we can try and compare if this holds true. 

## 3. Save Data <a name="save"></a>
***
Some of the data types were changed. As well, columns were added to the data. So let's save the data into a new csv file for further reference.

The store and fwd flag shows that many taxis sent the taxi trip information immediately(0) and only few of them had an issue with connecting to the server and had to store it (1). Also, vendor 2 had more trips in the 6 month period in 2016.  

### Relationships <a name='related'></a>
***
Now let's look at some of the relationship between the columns.

In [None]:
sns.heatmap(train.corr(),annot=True)

The response variable *trip_duration* has no linear relationship with any of the features. However, we see positive linear relationship among the latitude and longitude of the trips. As well, there is a positive correlation between vendor_id and passenger_count.

In [None]:
coordinates = train[(train['pickup_latitude'] >= 40) & (train['pickup_latitude'] <= 42) &
             (train['pickup_longitude'] >= -75) & (train['pickup_longitude'] <= -73) &
             (train['dropoff_latitude'] >= 40) & (train['dropoff_latitude'] <= 42) &
             (train['dropoff_longitude'] >= -75) & (train['dropoff_longitude'] <= -73)]
sns.pairplot(coordinates, vars=['pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'])

### GPS Precision<a name='gps'></a>
***
If you have visited a big city like Manhattan, you are probably aware that gps precision is not the best at times. Precision in this case is measured by the number of decimals in each coordinate. Taking this into cosideration let's plot the latitude and longitude and check the gps precision of each trip. 

In [None]:
plt.figure(figsize=(18,18))
cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
for index, each in enumerate(cols):
    s = train[each].astype(str)
    dec = s.apply(lambda x: abs(Decimal(x).as_tuple().exponent))
    plt.subplot(len(cols)/2,2,index+1)
    graph = sns.countplot(dec)
    graph.set_yscale('log')
    plt.title(each)
    plt.xlabel('number of decimals')

The GPS precision of the taxis vary from 1 up to 15 decimals. According to the [Degree Precision vs. Length Table](https://en.wikipedia.org/wiki/Decimal_degrees) from Wikipedia, a coordinate with 15 decimals would be equal to 0.1 nanometer(the size of an atom!), which sounds unrealistic for a taxi. In this case, it sounds reasonable to use from the fifth decimal place and up, which is worth up to 1.1 m. According to Wikipedia, it distinguishes trees from each other which sounds like a good precision for a big city. 

Also, it is important to point out that one would expect to see similar graphs for the 4 different features. Since this is not the case, it means that one trip might have diffent precision at pickup but not dropoff. Let's check which trips have the same amount of precision on the 4 different features. 

In [None]:
df = pd.DataFrame(train)

cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
for index, each in enumerate(cols):
    s = df[each].astype(str)
    df[each + '_decimal'] = s.apply(lambda x: abs(Decimal(x).as_tuple().exponent))
    
coor = df[(df['pickup_latitude_decimal'] <= 4) | (df['pickup_longitude_decimal'] <= 4) | 
          (df['dropoff_latitude_decimal'] <= 4) | (df['dropoff_longitude_decimal'] <= 4)]
coor['trip_duration'].count()

There are 642 trips that have at least one latitude or longitude that have less than 5 decimal precision. Let's see which ones angular distances are the most common. 

In [None]:
cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
plt.figure(figsize=(18,18))
for index, each in enumerate(cols):
    plt.subplot(len(cols)/2,2,index+1)
    coor[coor[each + '_decimal'] <= 4][each].value_counts().plot(kind='bar')
    plt.title('Top ' + each + ' with low precision')
    plt.xlabel('Coordinates')
    plt.ylabel('Count')
    plt.xticks(rotation=0)

The graphs show that there is a common angular distance for pickup and dropoff were the precision is lower. It might come across as being a specific location with coordinates $[40.75, -74.0]$. Let's check if this is true.

### Acquiring data <a name="acquire"></a>
***
The data was acquired from a Kaggle's Competition called [New York City Taxi Trip Duration](https://www.kaggle.com/c/nyc-taxi-trip-duration).

In [None]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

### Data content <a name="content"></a>
***
Let's look at the content of the data such as shape, columns, and summary statistics. 

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.columns

In [None]:
train.info()

The data consists of a large number of rows with the file memory being 122.4+ MB. To make the file lighter lets convert category column 'store_and_fwd_flag' to category data type and see how much memory can be saved. As well, we will convert pickup and dropoff datetime columns to  datetime objects and re-format into a more useful shape. Also, converting all integers to floats will make data easier to work with. The max value of trip duration seems unrealistic for a taxi trip duration causing our standard deviation to be very high. 

### Missing values <a name='missing'></a>
***
Let's calculate the fraction of missing values per column. This will give us an idea if there is a need to drop columns or rows. 

In [None]:
train.isnull().sum()/len(train)*100

There are no missing values so we can proceed using all of the data for the time being.

### Dummy Variables <a name='dummies'></a>
***
Let's make store_and_fwd_flag into a dummy variable.

In [None]:
train = pd.get_dummies(train,columns=['store_and_fwd_flag'],drop_first=True)

### Changing data types <a name="datatypes"></a>
***
As mentioned previously, let's convert data objects to date times and integers to floats. 

In [None]:
# Convert to datetime object
train['pickup_datetime'] = pd.to_datetime(train['pickup_datetime'], format='%Y-%m-%d %H:%M:%S') 
train['dropoff_datetime'] = pd.to_datetime(train['dropoff_datetime'], format='%Y-%m-%d %H:%M:%S')

# # Convert to Float datatype
int_feat = train.dtypes[(train.dtypes != 'object') & (train.dtypes != 'float64') & (train.dtypes != 'datetime64[ns]')].index
train[int_feat] = train[int_feat].astype(float)

### Adding columns <a name='columns'></a>
***
To be able to use the date of every trip, it is necessary to split the *pickup_datetime* and *dropoff_datetime* column into year, month, day and seconds.

In [None]:
# Create Float columns to split datetime objects
train['pickup_month'] = train['pickup_datetime'].dt.month.astype(float)
train['pickup_day'] = train['pickup_datetime'].dt.day.astype(float)
train['pickup_hour'] = train['pickup_datetime'].dt.hour.astype(float)
train['pickup_minute'] = train['pickup_datetime'].dt.minute.astype(float)
train['pickup_second'] = train['pickup_datetime'].dt.second.astype(float)

train.info()

## 2. Exploratory Data Analysis <a name='EDA'></a>
***
Let's do EDA to explore to get a visual understanding of the data. First, let's explore trip_duration which is the target column. 

### Feature Visualization <a name='features'></a>
***
Let's check each feature of the training data and analyze it.

In [None]:
plt.hist(np.log(train['trip_duration']).values,bins=300)
plt.title('trip_duration distribution')
plt.xlabel('log(trip_duration)')

The histogram shows that most of taxi trips have a duration of $e^{5}$  to $e^{8}$ which is between 3 to 50 minutes. As well, the graph shows some unsual data with value of $e^{11}$ which is almost 1000 minutes and $e^{2}$ wich is just a few seconds.

In [None]:
train.sort_values(['trip_duration'],ascending=[False]).set_index('trip_duration').head(5)

The table above shows that some trips lasted more than a few days and someone spending more than one month in a cab which is impossible. 

Now let's take a look at the latitude and longitude of 99.9% of the trips to exclude some outliers outside of this range.

In [None]:
cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']
percentile = [0.05,99.95] ## USING 99.9% of the data
for index, name in enumerate(cols):
    plt.subplot(len(cols)/2,2,index+1)
    train[(train[name] >= np.percentile(train[name],percentile[0])) & (train[name] <= np.percentile(train[name],percentile[1]))][name].plot(kind='hist',bins=400)
    plt.xlabel(name)

The latitude and longitude features give important information:

- Most of the pickup_latitude and dropoff_latitude lie between 40.70 and 40.80

- Most of the pickup_longitude and dropoff_longitude lie between -74.05 and -73.95

- There is some pickup_latitude and dropoff_latitude happening around 40.65

- There is some pickup_longitude and dropoff_longitude happening around -73.80

- There is also some pickup_longitude and dropoff_longitude happening around -73.88

Coordinates by themselves are hard to understand what is happening. To have a better sense of what is happening let's plot these points on a map. 

In [None]:
coor = [[40.7, -74.05], [40.7,-73.95]],[[40.7,-73.95],[40.8,-73.95]], [[40.8,-74.05],[40.8,-73.95]], [[40.7,-74.05],[40.8,-74.05]]
coor2 = [[40.65,-73.80],[40.65,-73.80]]
coor3 = [[40.77,-73.88],[40.77,-73.88]]

folium_map = folium.Map(location=[40.738, -73.98],
                        zoom_start=10,
                        tiles='Stamen Terrain')

folium.PolyLine(coor,color='red',weight=4.5,opacity=0.7).add_to(folium_map)
folium.PolyLine(coor2,color='red',weight=15,opacity=0.7).add_to(folium_map)
folium.PolyLine(coor3,color='red',weight=15,opacity=0.7).add_to(folium_map)
folium_map

The map shows that most of the activity is happening in Manhattan. As well, there is some activity in the airports J.F. Kennedy and La Guardia. After exploring duration and location of trips, let's focus on the distribution of trips by time. 

In [None]:
pickup = df[(df['pickup_latitude'] == 40.75) & (df['pickup_longitude'] == -74.0)]['trip_duration'].count()
dropoff = df[(df['dropoff_latitude'] == 40.75) & (df['dropoff_longitude'] == -74.0)]['trip_duration'].count()

print('Total number of columns with pickup_latitude = 40.75 and pickup_longitude = -74.0: {}'.format(pickup))
print('Total number of columns with dropoff_latitude = 40.75 and dropoff_longitude = -74.0: {}'.format(dropoff))

This means it is not a specific location. However, the latitudes and longitudes tend to be a repeated angular distance in several trips. To have a better idea of these trips let's plot it in a map.

In [None]:
s = []
for index, row in coor[['pickup_longitude','pickup_latitude', 'dropoff_latitude', 'dropoff_longitude']].iterrows():
    s.append([[row['pickup_latitude'], row['pickup_longitude']] , [row['dropoff_latitude'], row['dropoff_longitude']]])
    
folium_map = folium.Map(location=[40.738, -73.98],
                        zoom_start=13,
                        tiles="CartoDB dark_matter")
for i in range(len(s)):
    folium.PolyLine(s[i],color='yellow',weight=1.0,opacity=0.7).add_to(folium_map)

folium_map

These values might be an issue due to the fact that most of them have a pickup or dropoff in Manhattan and not having an accurate precision can take you to a different block. This might easily add more seconds to the trip, even minutes. These can be candidate trips to drop due to lower precision. 

### Repeated trips<a name='repeated'></a>
***
Now that we have an idea of the right precision for the coordinates, let's look if there are similar trips with high precision (at least 5 decimals).

In [None]:
cols = ['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude']

above_five_decimals = df[(df['pickup_latitude_decimal'] >= 5) & (df['pickup_longitude_decimal'] >= 5) & 
          (df['dropoff_latitude_decimal'] >= 5) & (df['dropoff_longitude_decimal'] >= 5)]
five_decimals = above_five_decimals[['pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude','trip_duration']].round(5)
a = five_decimals.groupby(cols).size().reset_index().rename(columns={0:'count'}).sort_values(by='count',ascending=False)
s = five_decimals.groupby(cols).trip_duration.agg(['min','max','mean']).reset_index()

merged = pd.merge(a,s,on=cols)
merged = merged[merged['count']>=2]
merged.head()

Looking at similar trips with 5 precision decimals, the data shows that at least the top 5 most popular trips have the same pickup and dropoff location and still their *trip_duration* varies, which it appears to be error in the data. Let's find out if all of the repeated trips are basically a distance traveled of 0. 

In [None]:
train.set_index('pickup_datetime').resample('D').count()['vendor_id'].plot()
plt.title('Pickup Datetime Plot by Day')
plt.xlabel('Month')
plt.ylabel('Count')

The line plot shows that the amount of daily pickups during January 2016 and June 2016 were fairly homogeneous. However, there is a drop near the end of January. The drop can be explained by doing some research. From January 22 to 24, 2016, there was a blizzard that hit the East Coast. According to [weather.gov](https://www.weather.gov/okx/Blizzard_Jan2016), Central Park, NY received 27.5" of snow, which is the largest snowstorm since records began in 1869.

In [None]:
train.set_index('dropoff_datetime').resample('D').count()['vendor_id'].plot()
plt.title('Dropoff Datetime Plot by Day')
plt.xlabel('Month')
plt.ylabel('Count')

The dropoff graph looks fairly similar to the pickup graph. At the beginning of July there is a drop but is due to the fact that is the cutoff date for the data. 

Now let's explore the other features that are left.

In [None]:
plt.subplot(1,2,1)
w = train['pickup_datetime'].dt.weekday.value_counts().sort_index()
ax = w.plot(marker='.',linestyle='none',markersize=25,visible=True)
ax.set_xticklabels(['','Mon','Tue','Wed','Thur','Fri','Sat','Sun'])
plt.title('Number of pickups per day')
plt.xlabel('Days of the week')
plt.ylabel('Count')
plt.margins(0.04)

plt.subplot(1,2,2)
p = train['passenger_count'].value_counts().sort_index().plot(kind='bar')
plt.title('Passenger Count')
plt.xlabel('Number of passengers per ride')
plt.ylabel('Count')
plt.ticklabel_format(axis='y',style='sci',scilimits=(0,0))
labels = train['passenger_count'].value_counts().sort_index().values
rects = p.patches
for rect, label in zip(rects, labels):
    height = rect.get_height()
    p.text(rect.get_x() + rect.get_width()/2, height + 25, label, ha='center', va='bottom')
    

The busiest day during the week for taxi drivers in NYC is Friday and the slowest day is Monday. Also, the passenger count graph shows that most of the taxi trips carry only one passenger. It is important to point out that the data has 60 trips with 0 passengers which it seems it is an error in the data. 

In [None]:
h = train['pickup_datetime'].dt.hour.value_counts().sort_index()
ax2 = h.plot(marker='.',linestyle='none',markersize=15,visible=True)
plt.title('Number of pickups per hour')
plt.xlabel('Hours')
plt.ylabel('Count')
plt.margins(0.04)

The number of pickups per hour is higher at night with a peak between 6 and 7 PM which most people get out of work. At around 5 in the morning, the taxi demand is low. There is a rise of demand thoughout the day.

In [None]:
plt.subplot(1,2,1)
sns.countplot(x='store_and_fwd_flag_Y',data=train)
plt.title('Store and Fwd Flag')
plt.xlabel('Store and Fwd Flag')
plt.ylabel('Count')
plt.ticklabel_format(axis='y',style='sci',scilimits=(0,0))

plt.subplot(1,2,2)
sns.countplot(x='vendor_id',data=train)
plt.title('Vendor ID')
plt.xlabel('ID number')
plt.ylabel('Count')
plt.ticklabel_format(axis='y',style='sci',scilimits=(0,0))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sns
import folium

import operator
from decimal import Decimal
import timeit

from geopy.distance import vincenty

%matplotlib inline
%pylab inline

params = {'legend.fontsize': 'x-large',
          'figure.figsize': (17.5, 10),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}

pylab.rcParams.update(params)
plt.style.use(['seaborn-white'])