## Homework 1: Introduction
## First. Download FHV data from [NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

In [1]:
import pandas as pd 
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Q1:Read the data for January. How many records are there?
* 1054112
* 1154112
* 1254112
* 1354112

In [2]:
df = pd.read_parquet('data/fhv_tripdata_2021-01.parquet')

In [3]:
print('There is {} rows in the dataframe'.format(len(df)))

There is 1154112 rows in the dataframe


## Q2. Computing duration

Now let's compute the `duration` variable. It should contain the duration of a ride in minutes. 

What's the average trip duration in January?

* 15.16
* 19.16
* 24.16
* 29.16

In [26]:
df.head()
# We have pickup_datetime and dropOff_datetime columns to calculate the duration of the trip.

Unnamed: 0,dispatching_base_num,pickup_datetime,dropOff_datetime,PUlocationID,DOlocationID,SR_Flag,Affiliated_base_number
0,B00009,2021-01-01 00:27:00,2021-01-01 00:44:00,,,,B00009
1,B00009,2021-01-01 00:50:00,2021-01-01 01:07:00,,,,B00009
2,B00013,2021-01-01 00:01:00,2021-01-01 01:51:00,,,,B00013
3,B00037,2021-01-01 00:13:09,2021-01-01 00:21:26,,72.0,,B00037
4,B00037,2021-01-01 00:38:31,2021-01-01 00:53:44,,61.0,,B00037


In [27]:
#Format pickup and droOff datetime to datetime
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
df['dropOff_datetime'] = pd.to_datetime(df['dropOff_datetime'])

In [28]:
# Create duration column
df['duration'] = df['dropOff_datetime'] - df['pickup_datetime']

In [29]:
#Change duration column to minutes
df['duration'] = df['duration'].apply(lambda x: x.total_seconds()/60)

In [30]:
#Calculate the mean duration
print('The mean duration is {} minutes'.format(df['duration'].mean()))

The mean duration is 19.167224093791006 minutes


In [31]:
# Keep records where the duration was between 1 and 60 minutes
df = df[(df['duration'] >= 1) & (df['duration'] <= 60)]

## Q3. Missing values

The features we'll use for our model are the pickup and dropoff location IDs. 

But they have a lot of missing values there. Let's replace them with "-1".

What's the fractions of missing values for the pickup location ID? I.e. fraction of "-1"s after you filled the NAs.

* 53%
* 63%
* 73%
* 83%

In [32]:
#Replace pick up location and drop off location null values with -1
df['PUlocationID'] = df['PUlocationID'].fillna("-1")
df['DOlocationID'] = df['DOlocationID'].fillna("-1")

#Fraction of missing values for pickup location 
print('Fraction of missing values for pickup location: {} %'.format(100*len(df[df['PUlocationID']=="-1"])/len(df)))

Fraction of missing values for pickup location: 83.52732770722618 %


## Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model. 

* Turn the dataframe into a list of dictionaries
* Fit a dictionary vectorizer 
* Get a feature matrix from it

What's the dimensionality of this matrix? (The number of columns).

* 2
* 152
* 352
* 525
* 725

In [42]:
#We'll use PUlocationID and DOlocationID as features
categorical = ['PUlocationID', 'DOlocationID']
numerical = ['duration']
# Change PUL and DOLocationID to str
df[categorical] = df[categorical].astype(str)
##Create the dictionary of categorical features
train_dicts = df[categorical].to_dict(orient= 'records')
#We'll use DictVectorizer to transform the dictionary into a matrix
dv = DictVectorizer()
##Transform the dictionary into a matrix
X_train = dv.fit_transform(train_dicts)

In [44]:
# Dimensionality of X_train
print('Dimensionality of X_train: {}'.format(X_train.shape[1]))

Dimensionality of X_train: 525


## Q5. Training a model

Now let's use the feature matrix from the previous step to train a model. 

* Train a plain linear regression model with default parameters 
* Calculate the RMSE of the model on the training data

What's the RMSE on train?

* 5.52
* 10.52
* 15.52
* 20.52

In [46]:
#Create the target value
y_train = df[numerical].values
#Create the Linear Regression model
lr = LinearRegression()
#Fit the model
lr.fit(X_train, y_train)

In [49]:
#Calculate RMSE
print('RMSE: {}'.format(mean_squared_error(y_train, lr.predict(X_train), squared=False)))

RMSE: 10.528519425310185


## Q6. Evaluating the model

Now let's apply this model to the validation dataset (Feb 2021). 

What's the RMSE on validation?

* 6.01
* 11.01
* 16.01
* 21.01

In [51]:
df_test = pd.read_parquet('data/fhv_tripdata_2021-02.parquet')
#Format pickup and droOff datetime to datetime
df_test['pickup_datetime'] = pd.to_datetime(df_test['pickup_datetime'])
df_test['dropOff_datetime'] = pd.to_datetime(df_test['dropOff_datetime'])
# Create duration column
df_test['duration'] = df_test['dropOff_datetime'] - df_test['pickup_datetime']
#Change duration column to minutes
df_test['duration'] = df_test['duration'].apply(lambda x: x.total_seconds()/60)
# Keep records where the duration was between 1 and 60 minutes
df_test = df_test[(df_test['duration'] >= 1) & (df_test['duration'] <= 60)]
#Replace pick up location and drop off location null values with "-1"
df_test['PUlocationID'] = df_test['PUlocationID'].fillna("-1")
df_test['DOlocationID'] = df_test['DOlocationID'].fillna("-1")
# Change PUL and DOLocationID to str
df_test[categorical] = df_test[categorical].astype(str)
# Create the dictionary of categorical features
test_dicts = df_test[categorical].to_dict(orient= 'records')
# Transform the dictionary into a matrix
X_test = dv.transform(test_dicts)
# Create the target value
y_test = df_test[numerical].values


In [52]:
#Evaluate the model
print('RMSE: {}'.format(mean_squared_error(y_test, lr.predict(X_test), squared=False)))

RMSE: 11.014285828610237
