## MLflow 5 minute Tracking Quickstart

This notebook demonstrates using a local MLflow Tracking Server to log, register, and then load a model as a generic Python Function (pyfunc) to perform inference on a Pandas dfFrame.

Throughout this notebook, we'll be using the MLflow fluent API to perform all interactions with the MLflow Tracking Server.

In [3]:
!pip install mlflow

Defaulting to user installation because normal site-packages is not writeable
Collecting mlflow
  Obtaining dependency information for mlflow from https://files.pythonhosted.org/packages/3b/bb/28dedc7ca2a16bdf825c2bc1aff15721dfba06d1dea07b3c1686fbf29d37/mlflow-2.9.1-py3-none-any.whl.metadata
  Downloading mlflow-2.9.1-py3-none-any.whl.metadata (13 kB)
Collecting cloudpickle<4 (from mlflow)
  Obtaining dependency information for cloudpickle<4 from https://files.pythonhosted.org/packages/96/43/dae06432d0c4b1dc9e9149ad37b4ca8384cf6eb7700cd9215b177b914f0a/cloudpickle-3.0.0-py3-none-any.whl.metadata
  Downloading cloudpickle-3.0.0-py3-none-any.whl.metadata (7.0 kB)
Collecting databricks-cli<1,>=0.8.7 (from mlflow)
  Obtaining dependency information for databricks-cli<1,>=0.8.7 from https://files.pythonhosted.org/packages/ae/a3/d56f8382c40899301f327d1c881278b09c9b8bc301c2c111633a0346d06e/databricks_cli-0.18.0-py2.py3-none-any.whl.metadata
  Downloading databricks_cli-0.18.0-py2.py3-none-any.

In [4]:
!pip install pandas
!pip install scikit-learn


Defaulting to user installation because normal site-packages is not writeable
[0mDefaulting to user installation because normal site-packages is not writeable
[0m

In [5]:
!pip show mlflow

Name: mlflow
Version: 2.9.1
Summary: MLflow: A Platform for ML Development and Productionization
Home-page: https://mlflow.org/
Author: Databricks
Author-email: 
License: Apache License 2.0
Location: /home/tarik/.local/lib/python3.10/site-packages
Requires: alembic, click, cloudpickle, databricks-cli, docker, entrypoints, Flask, gitpython, gunicorn, importlib-metadata, Jinja2, markdown, matplotlib, numpy, packaging, pandas, protobuf, pyarrow, pytz, pyyaml, querystring-parser, requests, scikit-learn, scipy, sqlalchemy, sqlparse
Required-by: 


In [6]:
!pip show seaborn

Name: seaborn
Version: 0.12.2
Summary: Statistical data visualization
Home-page: 
Author: 
Author-email: Michael Waskom <mwaskom@gmail.com>
License: 
Location: /home/tarik/.local/lib/python3.10/site-packages
Requires: matplotlib, numpy, pandas
Required-by: 


In [7]:
import mlflow
from mlflow.models import infer_signature
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.pipeline import Pipeline  # Import Pipeline class
import joblib



### Preprocessing csv

In [11]:
# Load your dataset
df = pd.read_csv('data/2020_Building_Energy_Benchmarking.csv', sep=',')

# df.dropna(axis=0, inplace=True)
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace("(", "")
df.columns = df.columns.str.replace(")", "")
df.columns


if 'yearsenergystarcertified' in df.columns and 'outlier' in df.columns:
    df.drop(['yearsenergystarcertified', 'outlier'], axis=1, inplace=True)

if 'compliancestatus' in df.columns:
    # Filter the dfFrame to keep only df with Compliant in ComplianceStatus
    df = df[df["compliancestatus"] == 'Compliant']
# Drop the column after check only compliance in compliancesstatus
df.drop(['compliancestatus'], axis=1, inplace=True)


# Filter the dfFrame to keep only rows where siteenergyusekbtu is not null
df = df[df["siteenergyusekbtu"].notnull()]
# fill Nan Null with np.nan
df = df.fillna(np.nan)
# # Replace "NULL" with np.nan in your df
# df = df.replace("NULL", np.nan).replace("NA", np.nan)
# Replace "NULL" with np.nan in your df
df = df.replace("NA", np.nan)

# Add column is elec etc
# Create new columns with 1 or 0 based on conditions
df['is_using_steamusekWh'] = np.where(df['steamusekbtu'] > 0, 1, 0)
df['is_using_electricitykWh'] = np.where(df['electricitykbtu'] > 0, 1, 0)
df['is_using_naturalgaskWh'] = np.where(df['naturalgaskbtu'] > 0, 1, 0)

# filter column
selected_columns = ["siteenergyusekbtu", 'totalghgemissions','yearbuilt','is_using_electricitykWh', 'is_using_naturalgaskWh', 'is_using_steamusekWh', 'largestpropertyusetypegfa', 'numberofbuildings', 'numberoffloors', 'propertygfabuildings','buildingtype', 'primarypropertytype']

# Filter the DataFrame to select only the desired columns
df = df[selected_columns]



# save result as csv
df.to_csv("data/dataset_2020.csv", sep=",", index=False)

    

## Load our saved model as a Python Function

Although we can load our model back as a native scikit-learn format with `mlflow.sklearn.load_model()`, below we are loading the model as a generic Python Function, which is how this model would be loaded for online model serving. We can still use the `pyfunc` representation for batch use cases, though, as is shown below.

### Set the MLflow Tracking URI 
In this step, we're going to use the local MLflow tracking server that we started. 

If you chose to define a different port when starting the server, apply that port to the following cell. 

In [None]:
mlflow.set_tracking_uri(uri="http://127.0.0.1:8090")

## Load training data and train a simple model

For our quickstart, we're going to be using the familiar iris dataset that is included in scikit-learn. Following the split of the data, we're going to train a simple logistic regression classifier on the training data and calculate some error metrics on our holdout test data. 

Note that the only MLflow-related activities in this portion are around the fact that we're using a `param` dictionary to supply our model's hyperparameters; this is to make logging these settings easier when we're ready to log our model and its associated metadata.

In [None]:
# Load your dataset
df = pd.read_csv('data/dataset_complete.csv', sep=',')
df.dropna(axis=0, inplace=True)
df.columns = df.columns.str.lower()
df.columns
print(df.columns)
# Separate your target variables from features
X = df.drop(["siteenergyusekbtu", 'totalghgemissions'], axis=1)
Y = df[["siteenergyusekbtu", 'totalghgemissions']]

# Define column transformers for numeric and categorical features
numeric_features = ['yearbuilt', 'largestpropertyusetypegfa', 'numberofbuildings', 'numberoffloors', 'propertygfabuildings']
categorical_features = ['buildingtype', 'primarypropertytype']

numeric_transformer = Pipeline(steps=[
    ('scaler', RobustScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Combine preprocessing and modeling into a single pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', MultiOutputRegressor(GradientBoostingRegressor()))
])

# Split your dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Define your model hyperparameters
model_params = {
    'model__estimator__loss': 'huber',
    'model__estimator__n_estimators': 500,
    'model__estimator__max_depth': 5,
    'model__estimator__learning_rate': 0.01
}

# Train your model
pipeline.set_params(**model_params)
pipeline.fit(X_train, Y_train)  # Fit the model

# Predict on the test set
Y_pred = pipeline.predict(X_test)

# Calculate evaluation metrics
r2_score_test = r2_score(Y_test, Y_pred)
mae_test_score = mean_absolute_error(Y_test, Y_pred)


Index(['unnamed: 0', 'osebuildingid', 'datayear', 'buildingtype',
       'primarypropertytype', 'propertyname', 'address', 'city', 'state',
       'zipcode', 'taxparcelidentificationnumber', 'councildistrictcode',
       'neighborhood', 'latitude', 'longitude', 'yearbuilt',
       'numberofbuildings', 'numberoffloors', 'propertygfatotal',
       'propertygfaparking', 'propertygfabuildings',
       'listofallpropertyusetypes', 'largestpropertyusetype',
       'largestpropertyusetypegfa', 'secondlargestpropertyusetype',
       'secondlargestpropertyusetypegfa', 'thirdlargestpropertyusetype',
       'thirdlargestpropertyusetypegfa', 'energystarscore', 'siteeuikbtu/sf',
       'siteeuiwnkbtu/sf', 'sourceeuikbtu/sf', 'sourceeuiwnkbtu/sf',
       'siteenergyusekbtu', 'siteenergyusewnkbtu', 'steamusekbtu',
       'electricitykwh', 'electricitykbtu', 'naturalgastherms',
       'naturalgaskbtu', 'defaultdata', 'comments', 'totalghgemissions',
       'ghgemissionsintensity'],
      dtype='object

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.

## Define an MLflow Experiment

In order to group any distinct runs of a particular project or idea together, we can define an Experiment that will group each iteration (runs) together. 
Defining a unique name that is relevant to what we're working on helps with organization and reduces the amount of work (searching) to find our runs later on. 

In [None]:
# Log the experiment in MLflow
mlflow.set_experiment("Seatle_co2_pred_maud_tarik")


<Experiment: artifact_location='mlflow-artifacts:/5', creation_time=1702373653964, experiment_id='5', last_update_time=1702373653964, lifecycle_stage='active', name='Seatle_co2_pred_maud_tarik', tags={}>

## Log the model, hyperparameters, and loss metrics to MLflow.

In order to record our model and the hyperparameters that were used when fitting the model, as well as the metrics associated with validating the fit model upon holdout data, we initiate a run context, as shown below. Within the scope of that context, any fluent API that we call (such as `mlflow.log_params()` or `mlflow.sklearn.log_model()`) will be associated and logged together to the same run. 

In [None]:
with mlflow.start_run():
    # Log model hyperparameters
    mlflow.log_params(model_params)

    # Log evaluation metrics
    mlflow.log_metric("R2_score_test", r2_score_test)
    mlflow.log_metric("MAE_test_score", mae_test_score)

    # Set tags for additional information
    mlflow.set_tag("Training Info", "GradientBoostingRegressor for your use case")

    # Infer the model signature
    signature = infer_signature(X_train, Y_pred)

    # Log the model
    mlflow.sklearn.log_model(
        sk_model=pipeline,
        artifact_path="data/best_model_GradientBoostingRegressor.pkl",
        signature=signature,
        input_example=X_train,
        registered_model_name="data/model_tracking",
    )

  inputs = _infer_schema(model_input) if model_input is not None else None


MlflowException: API request to http://127.0.0.1:8090/api/2.0/mlflow-artifacts/artifacts/5/653578e0d1554d7689fa919a05463d88/artifacts/data/best_model_GradientBoostingRegressor.pkl/model.pkl failed with exception HTTPConnectionPool(host='127.0.0.1', port=8090): Max retries exceeded with url: /api/2.0/mlflow-artifacts/artifacts/5/653578e0d1554d7689fa919a05463d88/artifacts/data/best_model_GradientBoostingRegressor.pkl/model.pkl (Caused by ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')))

In [None]:
# Load the model and make predictions
loaded_model = joblib.load("data/best_model_GradientBoostingRegressor.pkl")

## Use our model to predict the iris class type on a Pandas DataFrame

In [None]:
predictions = loaded_model.predict(X_test)

# Display the results
print("R2 score on test data:", r2_score_test)
print("MAE on test data:", mae_test_score)
print("Predictions:", predictions)

R2 score on test data: 0.6937168820537403
MAE on test data: 1873212.4523843962
Predictions: [[6.76617689e+06 1.41303500e+02]
 [2.64620362e+06 2.21110077e+01]
 [2.30553299e+08 8.78657086e+03]
 [1.70886790e+06 5.14269277e+01]
 [3.31067735e+06 7.41053507e+01]
 [7.84384320e+06 1.27132367e+02]
 [7.64786433e+05 3.32197892e+00]
 [1.71311628e+06 4.29914112e+01]
 [5.17426594e+06 9.25306120e+01]
 [1.10120487e+06 3.88854515e+01]
 [8.94055037e+05 7.55164899e+00]
 [2.95780814e+06 2.38872624e+01]
 [7.64786433e+05 3.97786058e+00]
 [1.27059409e+07 3.86971600e+02]
 [1.00480775e+07 1.89785586e+02]
 [3.89209907e+06 2.79438254e+01]
 [3.16562292e+06 6.93879788e+01]
 [4.95628097e+06 8.89008129e+01]
 [3.97206028e+06 2.86078854e+01]
 [4.85625566e+06 3.64226429e+01]
 [2.02296276e+07 2.94176295e+02]
 [1.13526791e+07 3.18264148e+02]
 [4.48482466e+06 3.82318718e+01]
 [3.15595992e+06 1.88816753e+01]
 [1.07343215e+07 3.08906522e+02]
 [4.55572917e+06 7.29508351e+01]
 [6.75445481e+07 6.29920295e+03]
 [4.29893405e+07 