## Homework

The goal of this homework is to create a simple training pipeline, use mlflow to track experiments and register best model, but use Mage for it.

We'll use [the same NYC taxi dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page), the **Yellow** taxi data for March, 2023. 

## Question 1. Select the Tool

You can use the same tool you used when completing the module,
or choose a different one for your homework.

## Question 2. Version

What's the version of the orchestrator? 

## Question 3. Creating a pipeline

Let's read the March 2023 Yellow taxi trips data.

How many records did we load? 

- 3,003,766
- 3,203,766
- 3,403,766    <--
- 3,603,766

(Include a print statement in your code)

In [28]:
import pandas as pd

In [29]:
data = pd.read_parquet('https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-03.parquet')
print(f"Loaded {len(df):,} records")

Loaded 3,403,766 records


## Question 4. Data preparation

Let's continue with pipeline creation.

We will use the same logic for preparing the data we used previously. 

This is what we used (adjusted for yellow dataset):

```python
def read_dataframe(filename):
    df = pd.read_parquet(filename)

    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60

    df = df[(df.duration >= 1) & (df.duration <= 60)]

    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
    
    return df
```

Let's apply to the data we loaded in question 3. 

What's the size of the result? 

- 2,903,766
- 3,103,766
- 3,316,216    <---
- 3,503,766

In [30]:
def prepare_dataframe(df):
    df['duration'] = df.tpep_dropoff_datetime - df.tpep_pickup_datetime
    df.duration = df.duration.dt.total_seconds() / 60
    df = df[(df.duration >= 1) & (df.duration <= 60)]
    categorical = ['PULocationID', 'DOLocationID']
    df[categorical] = df[categorical].astype(str)
  
    return df

In [31]:
data = prepare_dataframe(data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[categorical] = df[categorical].astype(str)


In [32]:
print(f"Loaded {len(data):,} records")

Loaded 3,316,216 records


## Question 5. Train a model

We will now train a linear regression model using the same code as in homework 1.

* Fit a dict vectorizer.
* Train a linear regression with default parameters.
* Use pick up and drop off locations separately, don't create a combination feature.

Let's now use it in the pipeline. We will need to create another transformation block, and return both the dict vectorizer and the model.

What's the intercept of the model? 

Hint: print the `intercept_` field in the code block

- 21.77
- 24.77      <--
- 27.77
- 31.77

In [33]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import train_test_split

import pickle

In [34]:
import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("homework3")

<Experiment: artifact_location='mlflow-artifacts:/2', creation_time=1749432753134, experiment_id='2', last_update_time=1749432753134, lifecycle_stage='active', name='homework3', tags={}>

In [35]:
data['target'] = data['duration']
categorical = ['PULocationID', 'DOLocationID']
def df_to_dict(df):
    return df[categorical].to_dict(orient='records')
train_df, val_df = train_test_split(data, test_size=0.2, random_state=42)
dv = DictVectorizer()
X_train = dv.fit_transform(df_to_dict(train_df))
X_val = dv.transform(df_to_dict(val_df))
target = 'duration'
y_train = train_df[target].values
y_val = val_df[target].values

In [36]:
with mlflow.start_run():
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_val)
    rmse = root_mean_squared_error(y_val, y_pred)
    mlflow.log_param("model_type", "LinearRegression")
    mlflow.log_metric("rmse", rmse)
    mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="models",
        input_example=X_val[:1],
        signature=mlflow.models.signature.infer_signature(X_val, y_pred)
    )

lr.fit(X_train, y_train)
y_pred = lr.predict(X_train)
print('y-Intercept is:', lr.intercept_) 

🏃 View run dashing-crow-418 at: http://127.0.0.1:5000/#/experiments/2/runs/be0ae907142e40cbb088c344048cd82d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/2
y-Intercept is: 24.75145745943043


## Question 6. Register the model 

The model is trained, so let's save it with MLFlow.

Find the logged model, and find MLModel file. What's the size of the model? (`model_size_bytes` field):

* 14,534
* 9,534
* 4,534   <--
* 1,534

In [39]:
from mlflow.tracking import MlflowClient

MLFLOW_TRACKING_URI = "sqlite:///home/mlflow/mlflow.db"
mlflow.set_experiment("LR-model")   
mlflow.sklearn.autolog()
    
with mlflow.start_run() as run:

    dv = df
    #Save and log the artifact (dict vectorizer)
    with open("models/preprocessor.b", "wb") as f_out:
        pickle.dump(dv, f_out)

    mlflow.log_artifact("models/preprocessor.b", artifact_path="preprocessor")
       
    # Log the linear regression model and register as version 1
    mlflow.sklearn.log_model(
            sk_model=lr,
            artifact_path="sklearn-model",
            registered_model_name="linear-reg-model",
    )

Successfully registered model 'linear-reg-model'.
2025/06/08 20:18:32 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: linear-reg-model, version 1


🏃 View run caring-dove-859 at: http://127.0.0.1:5000/#/experiments/3/runs/533df48171b443e09e517fa84807e85a
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/3


Created version '1' of model 'linear-reg-model'.
