# Trip Duration Model
<a id='top'></a>

[Experiment Tracking with MLFlow](#exp-tracking)

Using the [NYC Taxis Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), Yellow taxi dataset for January and February 2022

This is notebook is a copy of the same one I used for the intro section.

In [1]:
# standard data analysis
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# sklearn imports
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error

In [3]:
# saving models
import pickle

In [4]:
# experiment tracking 
import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")
mlflow.set_experiment("nyc_taxi_exp")

<Experiment: artifact_location='/home/dan/mlops-zoomcamp/02-experiment-tracking/mlruns/1', creation_time=1684527324082, experiment_id='1', last_update_time=1684527324082, lifecycle_stage='active', name='nyc_taxi_exp', tags={}>

In [5]:
# create a processing function to reformat the input data
def process_taxis(filename):
    df = pd.read_parquet(f'../data/{filename}')

    # set duration feature
    df['duration'] = df['tpep_dropoff_datetime'] - df['tpep_pickup_datetime']
    df.duration = df.duration.apply(lambda td: td.total_seconds() / 60)

    # filter outliers
    df = df[(df.duration >= 1) & (df.duration <= 60)]

    # change cols to categorical
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)

    df = df[['duration','PULocationID', 'DOLocationID']]
    
    return df

In [6]:
yellow_jan = process_taxis("yellow_tripdata_2022-01.parquet")

In [7]:
yellow_jan

Unnamed: 0,duration,PULocationID,DOLocationID
0,17.816667,142,236
1,8.400000,236,42
2,8.966667,166,166
3,10.033333,114,68
4,37.533333,68,163
...,...,...,...
2463926,5.966667,90,170
2463927,10.650000,107,75
2463928,11.000000,113,246
2463929,12.050000,148,164


In [8]:
# target vector
target = 'duration'

In [9]:
# turns the df into a giant list of dictionaries
train_dicts = yellow_jan.drop([target], axis=1).to_dict(orient='records')

In [10]:
# using the Dictionary Vectorizer from sklearn
dv = DictVectorizer()

# training feature matrix
X_train = dv.fit_transform(train_dicts)

In [11]:
# target vector
y_train = yellow_jan[target].values

In [12]:
# bring in the february data for validation
yellow_feb = process_taxis('yellow_tripdata_2022-02.parquet')


In [13]:
# turn categorical columns into list of dicts
val_dicts = yellow_feb.drop([target], axis=1).to_dict(orient='records')

In [14]:
# validation feature matrix
X_val = dv.transform(val_dicts)

# validation target matrix
y_val = yellow_feb[target].values

## MLFlow & Experiment tracking 
<a id='exp-tracking'></a>
[Top](#top)

MLFlow is a python package for tracking experiments in a Machine Learning Project. It allows the developer to keep track of their results and the parameters and other facets of the model that produced those results.

Sample run for MLFlow with a Lasso regression:

In [16]:
with mlflow.start_run():
    # data pathways
    mlflow.log_param("train_data_path", "data/yellow_tripdata_2022-01.parquet")
    mlflow.log_param("val_data_path", "data/yellow_tripdata_2022-02.parquet")

    alpha = 0.1
    mlflow.log_param("alpha", alpha)
    lr = Lasso()
    lr.fit(X_train, y_train)

    y_pred = lr.predict(X_val)
    rmse = mean_squared_error(y_val, y_pred, squared=False)
    mlflow.log_metric("rmse", rmse)

### Using a more complicated model
Optimizing an xgboost model with hyperopt

In [17]:
import xgboost as xgb

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from hyperopt.pyll import scope

In [20]:
train = xgb.DMatrix(X_train, label=y_train)
valid = xgb.DMatrix(X_val, label=y_val)

In [18]:
def objective(params):
    """
    Trains and runs an xgboost model on the validation data
    Logs the results to MLFLow
    """
    with mlflow.start_run():
        mlflow.set_tag("model", "xgboost")
        mlflow.log_params(params)
        booster = xgb.train(
            params=params,
            dtrain=train,
            num_boost_round=1000, # maximum rounds of boosting
            evals=[(valid, "validation")], # validation set
            early_stopping_rounds=50 # stop training if there's 50 rounds without an improvement
        )
        y_pred = booster.predict(valid)
        rmse = mean_squared_error(y_val, y_pred, squared=False)
        mlflow.log_metric("rmse", rmse)
    
    return {'loss': rmse, 'status': STATUS_OK}

In [21]:
# range in which hyperopt should be exploring the hyperparameters
search_space = {
    # quniform returns real number, which we convert to int - tests from 4 to 100 
    'max_depth': scope.int(hp.quniform('max_depth',4,100,1)), 

    # these have logarithmic search spaces (log of of the return value is uniformally distributed)
    # var is constrained in that interval
    'learning_rate': hp.loguniform('learning_rate', -3, 0),
    'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
    'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
    'min_child_weight': hp.loguniform('min_child_weight',-1,3),
    'objective': 'reg:linear',
    'seed': 42
}

best_result = fmin(
    fn=objective,
    space=search_space,
    algo=tpe.suggest,
    max_evals=50,
    trials=Trials()
)

[0]	validation-rmse:13.59296                          
[1]	validation-rmse:11.54505                          
[2]	validation-rmse:10.06411                          
[3]	validation-rmse:8.69966                           
[4]	validation-rmse:7.99097                           
[5]	validation-rmse:7.20874                           
[6]	validation-rmse:6.67176                           
[7]	validation-rmse:6.42908                           
[8]	validation-rmse:6.12969                           
[9]	validation-rmse:6.01087                           
[10]	validation-rmse:5.83981                          
[11]	validation-rmse:5.77922                          
[12]	validation-rmse:5.68904                          
[13]	validation-rmse:5.65564                          
[14]	validation-rmse:5.63110                          
[15]	validation-rmse:5.57407                          
[16]	validation-rmse:5.56068                          
[17]	validation-rmse:5.53972                          
[18]	valid

KeyboardInterrupt: 

Currently not working with a GPU so I'm just going to steal the winning parameter results from Christian's video [MLOops Zoomcamp 2.3](https://www.youtube.com/watch?v=iaJz-T7VWec&list=PL3MmuxUbc_hIUISrluw_A7wDSmfOhErJK&index=13)

In [22]:
params = {
    'learning_rate': 0.20472169880371677,
    'max_depth': 17,
    'min_child_weight': 1.2402611720043835,
    'objective': 'reg:linear',
    'reg_alpha': 0.2567896734700793,
    'reg_lambda': 0.004264404814393109,
    'seed': 42
}

mlflow.xgboost.autolog()

booster = xgb.train(
        params=params,
        dtrain=train,
        num_boost_round=1000, # maximum rounds of boosting
        evals=[(valid, "validation")], # validation set
        early_stopping_rounds=50 # stop training if there's 50 rounds without an improvement
    )



2023/05/21 14:59:05 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '16157ba0a7694057863848511b4f5f37', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current xgboost workflow


[0]	validation-rmse:14.13654
[1]	validation-rmse:12.41407
[2]	validation-rmse:11.13752
[3]	validation-rmse:10.20585
[4]	validation-rmse:9.52644
[5]	validation-rmse:9.03517
[6]	validation-rmse:8.61399
[7]	validation-rmse:8.35407
[8]	validation-rmse:8.16316
[9]	validation-rmse:8.02544
[10]	validation-rmse:7.86176
[11]	validation-rmse:7.78026
[12]	validation-rmse:7.70959
[13]	validation-rmse:7.65227
[14]	validation-rmse:7.61065
[15]	validation-rmse:7.56120
[16]	validation-rmse:7.52321
[17]	validation-rmse:7.45013
[18]	validation-rmse:7.42742
[19]	validation-rmse:7.40429
[20]	validation-rmse:7.38648
[21]	validation-rmse:7.36524
[22]	validation-rmse:7.34339
[23]	validation-rmse:7.32905
[24]	validation-rmse:7.31197
[25]	validation-rmse:7.24365
[26]	validation-rmse:7.23123
[27]	validation-rmse:7.20749
[28]	validation-rmse:7.19439
[29]	validation-rmse:7.17239
[30]	validation-rmse:7.16097
[31]	validation-rmse:7.08732
[32]	validation-rmse:7.03359
[33]	validation-rmse:7.02734
[34]	validation-rmse

