Q1) Install MLflow

What's the version of MLflow installed

Answer:

mlflow, version 1.26.1

In [None]:
!mlflow --version

Q2) Download and Preprocess the Data

How many files were saved to OUTPUT_FOLDER

Answer:

4 pkl files were created 

In [None]:
def run(raw_data_path: str, dest_path: str, dataset: str = "green"):
    # load parquet files
    df_train = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2021-01.parquet")
    )
    df_valid = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2021-02.parquet")
    )
    df_test = read_dataframe(
        os.path.join(raw_data_path, f"{dataset}_tripdata_2021-03.parquet")
    )

    # extract the target
    target = 'duration'
    y_train = df_train[target].values
    y_valid = df_valid[target].values
    y_test = df_test[target].values

    # fit the dictvectorizer and preprocess data
    dv = DictVectorizer()
    X_train, dv = preprocess(df_train, dv, fit_dv=True)
    X_valid, _ = preprocess(df_valid, dv, fit_dv=False)
    X_test, _ = preprocess(df_test, dv, fit_dv=False)

    # create dest_path folder unless it already exists
    os.makedirs(dest_path, exist_ok=True)

    # save dictvectorizer and datasets
    # CREATES THE FOUR PKL FILES 
    dump_pickle(dv, os.path.join(dest_path, "dv.pkl"))
    dump_pickle((X_train, y_train), os.path.join(dest_path, "train.pkl"))
    dump_pickle((X_valid, y_valid), os.path.join(dest_path, "valid.pkl"))
    dump_pickle((X_test, y_test), os.path.join(dest_path, "test.pkl"))



Q3) Train a Model with Autolog

How many parameters are automatically logged by MLflow?

Answer:

17 parameters were logged using mlflow.autolog

In [None]:
def run(data_path):
    mlflow.autolog()
    # start mlflow
    with mlflow.start_run():
        # use autolog
        mlflow.sklearn.autolog()

        X_train, y_train = load_pickle(os.path.join(data_path, "train.pkl"))
        X_valid, y_valid = load_pickle(os.path.join(data_path, "valid.pkl"))

        rf = RandomForestRegressor(max_depth=10, random_state=0)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_valid)

        rmse = mean_squared_error(y_valid, y_pred, squared=False)
        mlflow.log_metric('rmse ', rmse)


Q4) Launch The Tracking Server Locally 

In addition to --backend-store-uri, what else do you need to pass to properly configure the server?

--default-artifact-root:

Directory in which to store artifacts for any new experiments created. For tracking server backends that rely on SQL, this option is required in order to store artifacts. Note that this flag does not impact already-created experiments with any previous configuration of an MLflow server instance. By default, data will be logged to the mlflow-artifacts:/ uri proxy if the –serve-artifacts option is enabled. Otherwise, the default location will be ./mlruns.

In [None]:
!mlflow server --backend-store-uri sqlite:///backend.db --default-artifact-root ./artifacts

Q5) Tune The Hyperparameters Of The Model

The idea is to just log information that you need to answer the question below, including 

    the list of hyperparameters that are passed to the objective function during the optimization 

    the RMSE obtained on the validation set(Feb 2021 data)



In [None]:
    def objective(params):

        with mlflow.start_run():
            # log the "model" tag 
            # Will log all 50 runs of the models
            # mlflow.set_tag("model", "random-forest")  

            # log all the hyperparameters that got passed into this function
            mlflow.log_params(params)

            rf = RandomForestRegressor(**params)
            rf.fit(X_train, y_train)
            y_pred = rf.predict(X_valid)

            rmse = mean_squared_error(y_valid, y_pred, squared=False)
            # log rmse
            mlflow.log_metric("rmse", rmse)

        return {'loss': rmse, 'status': STATUS_OK}

What's the best validation RMSE that you got

Duration

10.1s to Train 

Metrics

RSME: 6.628

Parameters

max_depth: 19

min_samples_leaf: 3

min_samples_split: 5

n_estimators: 28

random_state: 42

Q6) Promote the Best Model to The Model Registry

In [None]:
def run(data_path, log_top):

    client = MlflowClient()

    # retrieve the top_n model runs and log the models to MLflow
    experiment = client.get_experiment_by_name(HPO_EXPERIMENT_NAME)
    runs = client.search_runs(
        experiment_ids=experiment.experiment_id,
        run_view_type=ViewType.ACTIVE_ONLY,
        max_results=log_top,
        order_by=["metrics.rmse ASC"]
    )
    for run in runs:
        train_and_log_model(data_path=data_path, params=run.data.params)

    # select the model with the lowest test RMSE
    experiment = client.get_experiment_by_name(EXPERIMENT_NAME)
    best_run = client.search_runs(

        experiment_ids=experiment.experiment_id, #grab the experiment id
        run_view_type=ViewType.ACTIVE_ONLY, # 
        max_results=1, #Promote one model 
        order_by=["metrics.test_rmse ASC"] #Order by rmse
    )[0]

    # register the best model
    run_id = best_run.info.run_id
    model_uri = f"runs:/{run_id}/model"
    mlflow.register_model(model_uri = model_uri, name="greentaxi_regressor")

What is the Test RMSE of the Best Model

Best RMSE:

6.548
