___


# <font color= #8FC3FA> **NYC Taxi Predictions 2025 - Model Experiments 2** </font>
#### <font color= #2E9AFE> `Data Science Project - Homework 5`</font>
- <Strong> Viviana Toledo </Strong>
- <Strong> Fecha: </Strong> 28/10/2025

___

In [1]:
# General Libraries
import pandas as pd
from datetime import datetime

# Databricks Env
from dotenv import load_dotenv
import pickle
import pathlib

# Feature Engineering
from sklearn.feature_extraction import DictVectorizer

# Optimization
import math
import optuna

# Modeling
import mlflow
import xgboost as xgb
import mlflow.pyfunc
from optuna.samplers import TPESampler
from mlflow.models.signature import infer_signature
from mlflow import MlflowClient

# Evaluation
from sklearn.metrics import root_mean_squared_error

# Autolog function
mlflow.sklearn.autolog()

# <font color= #8FC3FA> **1. Data Loading** </font>

First of all, we'll start by loading the data:

In [2]:
def read_dataframe(path):
    df = pd.read_parquet(path)
    df["duration"] = (df.lpep_dropoff_datetime - df.lpep_pickup_datetime).dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)]
    df[["PULocationID", "DOLocationID"]] = df[["PULocationID", "DOLocationID"]].astype(str)
    return df

df_train = read_dataframe("../data/green_tripdata_2025-01.parquet")
df_val = read_dataframe("../data/green_tripdata_2025-02.parquet")

df_train["PU_DO"] = df_train["PULocationID"] + "_" + df_train["DOLocationID"]
df_val["PU_DO"] = df_val["PULocationID"] + "_" + df_val["DOLocationID"]

categorical = ["PU_DO"]
numerical = ["trip_distance"]

dv = DictVectorizer()
X_train = dv.fit_transform(df_train[categorical + numerical].to_dict(orient="records"))
X_val = dv.transform(df_val[categorical + numerical].to_dict(orient="records"))

y_train = df_train["duration"].values
y_val = df_val["duration"].values

The data is stored in .parquet files. Our function defined above helps us handle these types of data.

In [3]:
df_train = read_dataframe('../data/green_tripdata_2025-01.parquet')
df_val = read_dataframe('../data/green_tripdata_2025-02.parquet')

# <font color= #8FC3FA> **2. Feature Engineering** </font>

Afterwards, we will proceed to apply feature engineering, which includes dividing features by categorical and numerical types:

In [4]:
def preprocess(df, dv):
    df['PU_DO'] = df['PULocationID'] + '_' + df['DOLocationID']
    categorical = ['PU_DO']
    numerical = ['trip_distance']
    train_dicts = df[categorical + numerical].to_dict(orient='records')
    return dv.transform(train_dicts)

In [None]:
# Define categorical and numerical values in our data
categorical = ['PULocationID', 'DOLocationID']
numerical = ['trip_distance']

# Dictionaries for preprocessing
dv = DictVectorizer()
train_dicts = df_train[categorical + numerical].to_dict(orient='records')

# Train and Evaluation
X_train = dv.fit_transform(train_dicts)
X_val = preprocess(df_val, dv)

# Define targets
target = 'duration'
y_train = df_train[target].values
y_val = df_val[target].values

Finally, we have our training and evaluation datasets:

In [6]:
training_dataset = mlflow.data.from_numpy(X_train.data, targets=y_train, name="green_tripdata_2025-01")
validation_dataset = mlflow.data.from_numpy(X_val.data, targets=y_val, name="green_tripdata_2025-02")

# <font color= #8FC3FA> **3. Hyperparameters Optimization** </font>