## Homework
The goal of this homework is to train a simple model for predicting the duration of a ride - similar to what we did in this module.

### Q1. Downloading the data
[The data link address of NYC Dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)

Read the data for January. How many columns are there?

In [1]:
# import libraries
import pandas as pd
import numpy as np
# import visualization library
import matplotlib.pyplot as plt
import seaborn as sns

In [40]:
# This is January data of NYC Dataset
nyc_df_train = pd.read_parquet("../data/yellow_tripdata_2022-01.parquet")

In [41]:
print("Number of columns are: ", nyc_df_train.shape[1])

### Q2. Computing duration
Now let's compute the `duration` variable. It should contain the duration of a ride in minutes.
What's the standard deviation of the trips duration in January?

Let's first check the info of the dataframe to see the type of `tpep_dropoff_datetime` and `tpep_pickup_datetime` columns

In [42]:
nyc_df_train.info()

Since we will find the `duration` of the trips in minutes, the date time type is what we are looking for. We don't need to convert it to datetime type.

In [43]:
nyc_df_train["duration"] = nyc_df_train["tpep_dropoff_datetime"]-nyc_df_train["tpep_pickup_datetime"]
nyc_df_train.duration = nyc_df_train.duration.apply(lambda td: td.total_seconds()/60)
nyc_df_train.head()

In [44]:
import math
"""
The calculation of standard deviation is described as;
std = sqrt(mean(x)) ,
where x = abs(a - a.mean())**2 . The average squared deviation is typically calculated as x.sum() / N , where N = len(x)
"""

# according to that calculation description we can define our solution like below
x = abs(nyc_df_train.duration-nyc_df_train.duration.mean())**2
print(math.sqrt(x.sum() / len(x)))

# we can also choose the easy way and use built-in function
print(nyc_df_train.duration.std())

# 46.45 is the right answer for this question

### Q3. Dropping outliers

Next, we need to check the distribution of the `duration` variable. There are some outliers. Let's remove them and keep only the records where the duration was between 1 and 60 minutes (inclusive).

What fraction of the records left after you dropped the outliers?

In [45]:
# The statistics of the duration column is:
nyc_df_train.duration.describe()

In [46]:
# Do we have null values in duration records? No
nyc_df_train.isnull().sum()

In [47]:
# One of the best ways to describe a variable is to report the values that appear in the
# dataset and how many times each value appears. This is called distribution of the variable.
sns.histplot(nyc_df_train.duration)

In [48]:
before_dropping_outliers = nyc_df_train.shape[0]
nyc_dataframe = nyc_df_train[(nyc_df_train.duration >= 1) & (nyc_df_train.duration <= 60)]
after_dropping_outliers = nyc_dataframe.shape[0]

fraction_left = (after_dropping_outliers / before_dropping_outliers)*100

# Let's look at the duration's histplot, after dropping outliers
print(sns.histplot(nyc_dataframe.duration))

print(f"{fraction_left} of the records left, after dropping outliers")


### Q4. One-hot encoding

Let's apply one-hot encoding to the pickup and dropoff location IDs. We'll use only these two features for our model.
What's the dimensionality of this matrix (number of columns)?

#### 1. Turn the dataframe into a list of dictionaries

In [49]:
from sklearn.feature_extraction import DictVectorizer

# initialize the Dictionary Vectorizer
dv = DictVectorizer()

# let's choose our categorical fields
categorical_columns = ['PULocationID', 'DOLocationID']
nyc_dataframe[categorical_columns] = nyc_dataframe[categorical_columns].astype(str)

# turn selected columns into a list of dictionaries
train_dicts = nyc_dataframe[categorical_columns].to_dict(orient='records')

#### 2. Fit a dictionary vectorizer

In [50]:
X_train = dv.fit_transform(train_dicts)

#### 3. Get a feature matrix from it

In [51]:
print(f"The dimensionality of this matrix is, {X_train.shape}")

In [52]:
print(f"The number of columns of this matrix is, {X_train.shape[1]}")

### Q5. Training a model
Now let's use the feature matrix from the previous step to train a model. What's the RMSE on train?
#### 1. Train a plain linear regression model with default parameters

In [53]:
# import Linear Regression and mean_squared_error from sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# initialize linear regression
lr = LinearRegression()

# define target field
y = 'duration'

# select target values from original dataframe to train
y_train = nyc_dataframe[y].values

lr.fit(X_train, y_train)

####  2. Calculate the RMSE of the model on the training data

In [54]:
# to calculate root mean squared error, we need to get predictions
y_predictions = lr.predict(X_train)
mean_squared_error(y_train, y_predictions, squared=False)

### Q6. Evaluating the model

Now let's apply this model to the validation dataset (February 2022).
What's the RMSE on validation?

In [68]:
# This is NYC Dataset of February. We will use it for validation
nyc_df_val = pd.read_parquet('../data/yellow_tripdata_2022-02.parquet')

In [70]:
print("Number of columns are: ", nyc_df_val.shape[1])

In [71]:
# Let's check info
nyc_df_val.info()

In [72]:
# choose the same arrange of data as train dataset
nyc_df_val["duration"] = nyc_df_val.tpep_dropoff_datetime - nyc_df_val.tpep_pickup_datetime
nyc_df_val.duration = nyc_df_val.duration.apply(lambda td: td.total_seconds() / 60)

nyc_dataframe_val = nyc_df_val[(nyc_df_val.duration >= 1) & (nyc_df_val.duration <= 60)]

In [73]:
nyc_dataframe_val.head()

In [74]:
# turn selected columns into a list of dictionaries
nyc_dataframe_val[categorical_columns] = nyc_dataframe_val[categorical_columns].astype(str)
val_dicts = nyc_dataframe_val[categorical_columns].to_dict(orient='records')

In [75]:
# This time we just do the transform without fitting to get X_val
X_val = dv.transform(val_dicts)

# select target values for validation
y_val = nyc_dataframe_val['duration'].values

In [76]:
# Let's make predictions
y_val_predictions = lr.predict(X_val)

In [77]:
# ...and lastly, we check the score
mean_squared_error(y_val, y_val_predictions, squared=False)