# Week 1

You are the lead data scientist on a new project to focus on demand for taxis. You need to create a new model that predicts demand around Manhattan and the airports. Your work will first be reviewed by other data science team members to ensure your model is accurate and likely to generalize. You’ll then present your findings to Mr. Walker and the business team. There is some skepticism that using machine learning is the right approach. You’ll need to create a compelling final report detailing your findings to get the model implemented.

Mr. Walker has asked that your new analysis do the following:

Focus on four custom regions in Manhattan and the airports (see map below). 

Predict if demand for taxis “low,” “medium,” or “high.” 

Explore revenue (Fare) and cost (Duration and Distance) of each region to see if certain regions of the city are more profitable. 

## Creating the Taxi Regions

### Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import datetime
import scipy.stats

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

import warnings
warnings.filterwarnings('ignore')

#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)

np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


In [2]:
df = pd.read_csv("taxitrain.csv")

In [3]:
df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,15/1/2015 14:00,15/1/2015 14:13,1,3.00,-73.964,Upper East Side,1,N,-73.965775,40.80466843,2,12.0,0.0,0.5,0.00,0.0,0.3,12.8
1,2,15/1/2015 14:00,15/1/2015 14:05,1,0.67,-73.971,Lower Manhattan,1,N,-73.970169,40.78912354,1,5.0,0.0,0.5,1.00,0.0,0.3,6.8
2,2,7/1/2015 14:58,7/1/2015 15:06,1,0.98,-73.949,40.778,1,N,-73.955284,40.78686905,1,7.0,0.0,0.5,1.40,0.0,0.3,9.2
3,2,7/1/2015 14:58,7/1/2015 15:12,3,4.39,-73.989,40.723,1,N,-73.987221,40.69440842,2,15.5,0.0,0.5,0.00,0.0,0.3,16.3
4,1,20/1/2015 23:08,20/1/2015 23:23,1,3.90,-73.975,40.760,1,N,-74.008461,40.71146774,1,15.0,0.5,0.5,3.25,0.0,0.3,19.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254975,1,26/1/2015 10:06,26/1/2015 10:09,1,0.60,-74.007,40.717,1,N,-74.004417,40.72461319,2,4.0,0.0,0.5,0.00,0.0,0.3,4.8
254976,2,10/1/2015 6:06,10/1/2015 6:10,6,1.10,-73.968,40.762,1,N,-73.955055,40.76589584,1,5.5,0.0,0.5,1.10,0.0,0.3,7.4
254977,2,10/1/2015 6:07,10/1/2015 6:26,6,7.62,0.000,0.000,1,N,0.000000,0,1,23.5,0.0,0.5,1.50,0.0,0.3,25.8
254978,1,10/1/2015 18:28,10/1/2015 18:40,3,0.80,-74.006,40.735,1,N,-74.001572,40.72763062,1,8.5,0.0,0.5,1.40,0.0,0.3,10.7


### Taxi Regions

In [7]:
df["pickup_latitude"].value_counts()

Midtown              39799
Upper East Side      11576
LaGuardia Airport     7291
40.750                6013
40.764                5419
                     ...  
40.875                   1
40.617                   1
40.895                   1
40.588                   1
40.905                   1
Name: pickup_latitude, Length: 287, dtype: int64

In [8]:
df.groupby("pickup_latitude").count()

Unnamed: 0_level_0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
pickup_latitude,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0.000,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898,4898
24.762,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
37.508,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
40.531,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
40.567,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
JFK Airport,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543,4543
LaGuardia Airport,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291,7291
Lower Manhattan,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359,5359
Midtown,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799,39799


In [9]:
jfkairport = df[df["pickup_latitude"] == "JFK Airport"]

In [10]:
jfkairport.shape

(4543, 19)

In [11]:
jfkairport.to_csv("jfk.csv",index=False)

In [12]:
lgairport = df[df["pickup_latitude"] == "LaGuardia Airport"]

In [13]:
lgairport.shape

(7291, 19)

In [14]:
lgairport.to_csv("lga.csv",index=False)

In [15]:
lowerman = df[df["pickup_latitude"] == "Lower Manhattan"]

In [16]:
lowerman.shape

(5359, 19)

In [17]:
lowerman.to_csv("lowman.csv",index=False)

In [18]:
midtown = df[df["pickup_latitude"] == "Midtown"]

In [19]:
midtown.shape

(39799, 19)

In [20]:
midtown.to_csv("midtown.csv",index=False)

In [21]:
uppereast = df[df["pickup_latitude"] == "Upper East Side"]

In [22]:
uppereast.shape

(11576, 19)

In [23]:
uppereast.to_csv("uppereast.csv",index=False)

In [24]:
upperwest = df[df["pickup_latitude"] == "Upper West Side"]

In [25]:
upperwest.shape

(0, 19)

In [26]:
lowerman["fare_amount"].median()

8.5

In [27]:
jfkairport["trip_distance"].median()

17.56

In [28]:
lgairport["trip_distance"].median()

1.53

In [29]:
lowerman["trip_distance"].median()

1.69

In [30]:
midtown["trip_distance"].median()

1.5

In [31]:
uppereast["trip_distance"].median()

2.3

In [32]:
upperwest["trip_distance"].median()

nan

In [33]:
len(jfkairport)+len(lgairport)+len(lowerman)+len(midtown)+len(uppereast)

68568

In [34]:
len(df)

254980

In [35]:
(len(jfkairport)+len(lgairport)+len(lowerman)+len(midtown)+len(uppereast))/(len(df)) * 100

26.89152090360028

### Merge all datasets

In [36]:
df2 = pd.concat([jfkairport,lgairport,lowerman,midtown,uppereast],axis=0)

In [37]:
df2

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
138,2,29/1/2015 20:30,29/1/2015 20:54,2,12.74,-73.790,JFK Airport,4,N,-73.655937,40.70500565,1,47.5,0.5,0.5,9.60,0.00,0.3,58.4
188,1,1/1/2015 17:21,1/1/2015 17:53,1,17.50,-73.790,JFK Airport,2,N,-73.992577,40.7533493,1,52.0,0.0,0.5,7.00,5.33,0.0,65.13
205,1,18/1/2015 1:08,18/1/2015 1:38,2,18.50,-73.782,JFK Airport,1,N,-73.885918,40.81613922,1,51.0,0.5,0.5,0.00,5.33,0.3,57.63
281,1,13/1/2015 16:08,13/1/2015 16:37,1,14.80,-73.788,JFK Airport,1,N,-73.940056,40.6059494,2,41.0,0.0,0.5,0.00,0.00,0.3,41.8
313,1,29/1/2015 0:42,29/1/2015 1:04,1,14.30,-73.779,JFK Airport,1,N,-73.908669,40.7676506,1,39.5,0.5,0.5,0.00,0.00,0.3,40.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
254881,2,10/1/2015 19:45,10/1/2015 19:53,1,1.69,-73.961,Upper East Side,1,N,-73.973541,40.78978729,2,8.0,0.0,0.5,0.00,0.00,0.3,8.8
254901,1,3/1/2015 16:39,3/1/2015 16:57,1,1.80,-73.982,Upper East Side,1,N,-73.987061,40.75112152,1,12.5,0.0,0.5,2.00,0.00,0.0,15.3
254911,2,15/1/2015 18:53,15/1/2015 19:00,5,1.54,-73.982,Upper East Side,1,N,-73.995934,40.75386429,1,7.5,1.0,0.5,1.70,0.00,0.3,11
254937,2,15/1/2015 18:56,15/1/2015 19:01,1,0.78,-73.979,Upper East Side,1,N,-73.987503,40.76572037,1,5.5,1.0,0.5,1.25,0.00,0.3,8.55


In [38]:
df2.reset_index(inplace=True, drop=True)

In [39]:
df2

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,pickup_longitude,pickup_latitude,RateCodeID,store_and_fwd_flag,dropoff_longitude,dropoff_latitude,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,29/1/2015 20:30,29/1/2015 20:54,2,12.74,-73.790,JFK Airport,4,N,-73.655937,40.70500565,1,47.5,0.5,0.5,9.60,0.00,0.3,58.4
1,1,1/1/2015 17:21,1/1/2015 17:53,1,17.50,-73.790,JFK Airport,2,N,-73.992577,40.7533493,1,52.0,0.0,0.5,7.00,5.33,0.0,65.13
2,1,18/1/2015 1:08,18/1/2015 1:38,2,18.50,-73.782,JFK Airport,1,N,-73.885918,40.81613922,1,51.0,0.5,0.5,0.00,5.33,0.3,57.63
3,1,13/1/2015 16:08,13/1/2015 16:37,1,14.80,-73.788,JFK Airport,1,N,-73.940056,40.6059494,2,41.0,0.0,0.5,0.00,0.00,0.3,41.8
4,1,29/1/2015 0:42,29/1/2015 1:04,1,14.30,-73.779,JFK Airport,1,N,-73.908669,40.7676506,1,39.5,0.5,0.5,0.00,0.00,0.3,40.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68563,2,10/1/2015 19:45,10/1/2015 19:53,1,1.69,-73.961,Upper East Side,1,N,-73.973541,40.78978729,2,8.0,0.0,0.5,0.00,0.00,0.3,8.8
68564,1,3/1/2015 16:39,3/1/2015 16:57,1,1.80,-73.982,Upper East Side,1,N,-73.987061,40.75112152,1,12.5,0.0,0.5,2.00,0.00,0.0,15.3
68565,2,15/1/2015 18:53,15/1/2015 19:00,5,1.54,-73.982,Upper East Side,1,N,-73.995934,40.75386429,1,7.5,1.0,0.5,1.70,0.00,0.3,11
68566,2,15/1/2015 18:56,15/1/2015 19:01,1,0.78,-73.979,Upper East Side,1,N,-73.987503,40.76572037,1,5.5,1.0,0.5,1.25,0.00,0.3,8.55


In [40]:
#df2.to_csv("focus.csv",index=False)