# Part 0 - A typical Data Science project

<font size=3>
In this section, we will just go over a typical DS project without implementing any of the MLOps principles we covered in class. The objective is to get familiar with the use case and the data
</font>

## 1 - Use Case Introduction

Our project is to predict *New York City Taxi trip duration*. 
The goal is to use open source data in order to train a simple machine learning model to predict taxi trips duration. We will assume that our final goal is to have this algorithm **running in production**.

The ultimate goal for this use case could be to predict in real time the duration of a trip (like in google maps or waze). However, for simplicity, in this module, we assume that we need a batch prediction. 

The data we will use for predictions will be stored in a file. We will later access it and feed it to the trained model

The machine learning phase is mainly constituted by the following steps : 
- data processing
- model training
- model evaluation
- prediction

The data to use for this module can be downloaded from the [TLC Trip Record Data page](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page).
To complete this module, you will need 03 samples of data :
- `sample 1 example` : yellow trip 2021-01 data (to train model)
- `sample 2 example` : yellow trip 2021-02 data (to evaluate model)
- `sample 3 example` : yellow trip 2021-03 data (for prediction)

## 2 - Setup 

### Imports

In [1]:
from typing import List, Tuple
from pathlib import Path

from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy.sparse import csr_matrix
import numpy as np
import pandas as pd
import seaborn as sns

### Globals

In [2]:
ROOT_PATH = Path("../")

TRAIN_PATH = ROOT_PATH / "data/yellow_tripdata_2021-01.parquet"
TEST_PATH = ROOT_PATH / "data/yellow_tripdata_2021-02.parquet"
PREDICT_PATH = ROOT_PATH / "data/yellow_tripdata_2021-03.parquet"

## 3 - Download & Load Data

In [3]:
import urllib.request

urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet",
    TRAIN_PATH,
)
urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-02.parquet",
    TEST_PATH,
)
urllib.request.urlretrieve(
    "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-03.parquet",
    PREDICT_PATH,
)

(PosixPath('../data/yellow_tripdata_2021-03.parquet'),
 <http.client.HTTPMessage at 0x280b36220>)

In [4]:
train_df = pd.read_parquet(TRAIN_PATH)
train_df.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1.0,2.1,1.0,N,142,43,2,8.0,3.0,0.5,0.0,0.0,0.3,11.8,2.5,
1,1,2021-01-01 00:51:20,2021-01-01 00:52:19,1.0,0.2,1.0,N,238,151,2,3.0,0.5,0.5,0.0,0.0,0.3,4.3,0.0,
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1.0,14.7,1.0,N,132,165,1,42.0,0.5,0.5,8.65,0.0,0.3,51.95,0.0,
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0.0,10.6,1.0,N,138,132,1,29.0,0.5,0.5,6.05,0.0,0.3,36.35,0.0,
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1.0,4.94,1.0,N,68,33,1,16.5,0.5,0.5,4.06,0.0,0.3,24.36,2.5,


## 4 - Prepare the data

Let's prepare the data to make ready for Machine Learning. 

For this, we need to clean it, compute the target (what we want to predict), and compute some features to help the model understand the data better.

### 4-1 Compute the target

We want to predict a taxi trip duration in minutes. Let's compute it as a difference between the drop-off time and the pick-up time for each trip.

In [5]:
def compute_target(
    taxi_rides: pd.DataFrame,
    pickup_column: str = "tpep_pickup_datetime",
    dropoff_column: str = "tpep_dropoff_datetime",
) -> pd.DataFrame:
    taxi_rides["duration"] = taxi_rides[dropoff_column] - taxi_rides[pickup_column]
    taxi_rides["duration"] = taxi_rides["duration"].dt.total_seconds() / 60
    return taxi_rides


train_df = compute_target(train_df)

In [6]:
train_df["duration"].describe()

count    1.369769e+06
mean     1.391168e+01
std      1.312006e+02
min     -1.350846e+05
25%      5.566667e+00
50%      9.066667e+00
75%      1.461667e+01
max      2.881770e+04
Name: duration, dtype: float64

Let's remove outliers and reduce the scope to trips between 1 minute and 1 hour

In [7]:
MIN_DURATION = 1
MAX_DURATION = 60


def filter_outliers(
    taxi_rides: pd.DataFrame, min_duration: int = 1, max_duration: int = 60
) -> pd.DataFrame:
    """Filter out outliers based on the ride duration."""
    return taxi_rides[taxi_rides["duration"].between(min_duration, max_duration)]


train_df = filter_outliers(train_df, MIN_DURATION, MAX_DURATION)

### 4-2 Prepare features

Most machine learning models don't work directly with categorical features. Because of this, they must be encoded so that the ML model can consume them.

In [8]:
CATEGORICAL_COLS = ["PUlocationID", "DOlocationID"]


def encode_categorical_cols(
    taxi_rides: pd.DataFrame, categorical_cols: List[str] = None
) -> pd.DataFrame:
    """Encode categorical columns in `taxi_rides` dataframe."""
    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    taxi_rides[categorical_cols] = (
        taxi_rides[categorical_cols].fillna(-1).astype("int").astype("str")
    )
    return taxi_rides


train_df = encode_categorical_cols(train_df)

In [9]:
train_df

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount,congestion_surcharge,airport_fee,duration
0,1,2021-01-01 00:30:10,2021-01-01 00:36:12,1,2.10,1.0,N,142,43,2,8.00,3.00,0.5,0.00,0.00,0.3,11.80,2.5,,6.033333
2,1,2021-01-01 00:43:30,2021-01-01 01:11:06,1,14.70,1.0,N,132,165,1,42.00,0.50,0.5,8.65,0.00,0.3,51.95,0.0,,27.600000
3,1,2021-01-01 00:15:48,2021-01-01 00:31:01,0,10.60,1.0,N,138,132,1,29.00,0.50,0.5,6.05,0.00,0.3,36.35,0.0,,15.216667
4,2,2021-01-01 00:31:49,2021-01-01 00:48:21,1,4.94,1.0,N,68,33,1,16.50,0.50,0.5,4.06,0.00,0.3,24.36,2.5,,16.533333
5,1,2021-01-01 00:16:29,2021-01-01 00:24:30,1,1.60,1.0,N,224,68,1,8.00,3.00,0.5,2.35,0.00,0.3,14.15,2.5,,8.016667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1369763,2,2021-01-31 23:04:00,2021-01-31 23:18:00,-1,7.74,,,159,259,0,22.15,0.00,0.5,0.00,0.00,0.3,22.95,,,14.000000
1369764,2,2021-01-31 23:03:00,2021-01-31 23:33:00,-1,8.89,,,229,181,0,27.78,0.00,0.5,7.46,0.00,0.3,38.54,,,30.000000
1369765,2,2021-01-31 23:29:00,2021-01-31 23:51:00,-1,7.43,,,41,70,0,32.58,0.00,0.5,0.00,6.12,0.3,39.50,,,22.000000
1369766,2,2021-01-31 23:25:00,2021-01-31 23:38:00,-1,6.26,,,74,137,0,16.85,0.00,0.5,3.90,0.00,0.3,24.05,,,13.000000


In [None]:
def vectorize_dataframe(
    taxi_rides: pd.DataFrame,
    categorical_cols: List[str] = None,
    dict_vectorizer: DictVectorizer = None,
    with_target: bool = True,
) -> Tuple:
    """Convert a DataFrame into a sparse matrix and target array, optionally using a pre-fit dictionary.

    Args:
        taxi_rides (pd.DataFrame): DataFrame to be converted.
        dict_vectorizer (DictVectorizer, optional): The DictVectorizer to use. Defaults to None.

    Returns:
        Tuple[csr_matrix, np.ndarray, DictVectorizer]: Tuple containing the sparse matrix representation of the
        DataFrame, the target array, and the DictVectorizer used to perform the conversion.
    """
    if categorical_cols is None:
        categorical_cols = ["PULocationID", "DOLocationID", "passenger_count"]
    dicts = taxi_rides[categorical_cols].to_dict(orient="records")

    target = None
    if with_target:
        target = taxi_rides["duration"].values

    if dict_vectorizer is None:
        dict_vectorizer = DictVectorizer()
        dict_vectorizer.fit(dicts)

    features = dict_vectorizer.transform(dicts)
    return features, target, dict_vectorizer


X_train, y_train, dict_vectorizer = vectorize_dataframe(train_df)

## 5 - Train model

We train a basic linear regression model to have a baseline performance

In [None]:
def train_model(x_train: csr_matrix, y_train: np.ndarray):
    lr_model = LinearRegression()
    lr_model.fit(x_train, y_train)
    return lr_model


model = train_model(X_train, y_train)

## 6 - Evaluate model

We evaluate the model on train and test data

### 6-1 On train data

In [None]:
def predict_duration(input_data: csr_matrix, model: LinearRegression):
    return model.predict(input_data)


def evaluate_model(y_true: np.ndarray, y_pred: np.ndarray):
    return mean_squared_error(y_true, y_pred, squared=False)


prediction = predict_duration(X_train, model)
train_mse = evaluate_model(y_train, prediction)

In [None]:
print(f"Train MSE: {train_mse:.2f}")

### 6-2 On test data

In [None]:
test_df = pd.read_parquet(TEST_PATH)

In [None]:
X_test, y_test, _ = (
    test_df.pipe(compute_target)
    .pipe(encode_categorical_cols)
    .pipe(vectorize_dataframe, dict_vectorizer=dict_vectorizer)
)

In [None]:
y_pred_test = predict_duration(X_test, model)
test_mse = evaluate_model(y_test, y_pred_test)

In [None]:
print(f"Test MSE: {test_mse:.2f}")

## 7 - Predict

We can now use our model to predict on fresh unseen data and forecast what is going to be the duration of a tawi trip depending on trip characteristics.

In [None]:
predict_df = pd.read_parquet(PREDICT_PATH)

In [None]:
x_pred, _, _ = predict_df.pipe(encode_categorical_cols).pipe(
    vectorize_dataframe, dict_vectorizer=dict_vectorizer, with_target=False
)

y_pred = predict_duration(x_pred, model)