# Trip Duration Model
Using the [NYC Taxis Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), Yellow taxi dataset for January and February 2022

This is notebook is a copy of the same one I used for the intro section.

In [1]:
# standard data analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# sklearn imports
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

In [3]:
# saving models
import pickle

In [4]:
# experiment tracking 
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc_taxi_exp")

2023/05/19 20:15:24 INFO mlflow.tracking.fluent: Experiment with name 'nyc_taxi_exp' does not exist. Creating a new experiment.


<Experiment: artifact_location='/home/dan/mlops-zoomcamp/02-experiment-tracking/mlruns/1', creation_time=1684527324082, experiment_id='1', last_update_time=1684527324082, lifecycle_stage='active', name='nyc_taxi_exp', tags={}>

In [9]:
# create a processing function to reformat the input data
def process_taxis(filename):
    df = pd.read_parquet(f'../data/{filename}')

    # set duration feature
    df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    # filter outliers
    df = df[(df.duration >= 1) & (df.duration <= 60)]

    # change cols to categorical
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)

    df = df[['duration','PULocationID', 'DOLocationID']]
    
    return df

In [10]:
yellow_jan = process_taxis("yellow_tripdata_2022-01.parquet")

In [11]:
yellow_jan

Unnamed: 0,duration,PULocationID,DOLocationID
0,17.816667,142,236
1,8.400000,236,42
2,8.966667,166,166
3,10.033333,114,68
4,37.533333,68,163
...,...,...,...
2463926,5.966667,90,170
2463927,10.650000,107,75
2463928,11.000000,113,246
2463929,12.050000,148,164


In [18]:
# target vector
target = 'duration'

In [19]:
# turns the df into a giant list of dictionaries
train_dicts = yellow_jan.drop([target], axis=1).to_dict(orient='records')

In [16]:
# using the Dictionary Vectorizer from sklearn
dv = DictVectorizer()

# training feature matrix
X_train = dv.fit_transform(train_dicts)

In [20]:
# target vector
y_train = yellow_jan[target].values

In [25]:
# bring in the february data for validation
yellow_feb = process_taxis('yellow_tripdata_2022-02.parquet')


In [26]:
# turn categorical columns into list of dicts
val_dicts = yellow_feb.drop([target], axis=1).to_dict(orient='records')

In [27]:
# validation feature matrix
X_val = dv.transform(val_dicts)

# validation target matrix
y_val = yellow_feb[target].values

## MLFlow & Experiment tracking
MLFlow is a python package for tracking experiments in a Machine Learning Project. It allows the developer to keep track of their results and the parameters and other facets of the model that produced those results.

Sample run for MLFlow with a Lasso regression:

In [28]:
with mlflow.start_run():
    # data pathways
    mlflow.log_param("train_data_path", "data/yellow_tripdata_2022-01.parquet")
    mlflow.log_param("val_data_path", "data/yellow_tripdata_2022-02.parquet")

    alpha = 0.1
    mlflow.log_param("alpha", alpha)
    lr = Lasso()
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    mlflow.log_metric("rmse", rmse)