<a href="https://colab.research.google.com/github/eyeonkarim/SDS2022-MLFlow/blob/main/LC3_MLFlow_Models_Regression_SwissHousing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center><a target="_blank" href="https://www.sds2022.ch/"><img src="https://drive.google.com/uc?id=1S7k7kTXs9qIylw3C7LA9rHkLycjlY8te" width="500" style="background:none; border:none; box-shadow:none;" /></a> </center>

<center><a target="_blank" href="http://www.sit.academy"><img src="https://drive.google.com/uc?id=1x9_jQgLhozCSWDSaOdVxKmxOEAe_OLgV" width="250" style="background:none; border:none; box-shadow:none;" /></a> </center>

_____

<center> <h1> Live Coding  </h1> </center>

<p style="margin-bottom:1cm;"></p>

_____

<center>SIT Academy, 2022</center>




# MLFlow Workshop - Sequence 3 - Models API

MLflow Models offer a convention for packaging machine learning models in multiple `flavors`, and a variety of tools to help you deploy them. 

Each Model is saved as a directory containing arbitrary files and a descriptor file that lists several “flavors” the model can be used in. 

For example, a TensorFlow model can be loaded as a `TensorFlow DAG`, or as a `Python function` to apply to input data. MLflow provides tools to deploy many common model types to diverse platforms: for example, any model supporting the `Python function` flavor can be deployed to a Docker-based REST server, to cloud platforms such as Azure ML and AWS SageMaker, and as a user-defined function in Apache Spark for batch and streaming inference. 

If you output `MLflow Models` using the `Tracking API`, MLflow also automatically remembers which Project and run they came from.



# Install Dependencies

In [None]:
!pip install mlflow --quiet

[K     |████████████████████████████████| 17.8 MB 317 kB/s 
[K     |████████████████████████████████| 209 kB 17.4 MB/s 
[K     |████████████████████████████████| 81 kB 2.2 MB/s 
[K     |████████████████████████████████| 79 kB 4.0 MB/s 
[K     |████████████████████████████████| 146 kB 46.1 MB/s 
[K     |████████████████████████████████| 596 kB 35.8 MB/s 
[K     |████████████████████████████████| 181 kB 51.4 MB/s 
[K     |████████████████████████████████| 54 kB 1.3 MB/s 
[K     |████████████████████████████████| 63 kB 1.4 MB/s 
[K     |████████████████████████████████| 78 kB 5.6 MB/s 
[?25h  Building wheel for databricks-cli (setup.py) ... [?25l[?25hdone


# Load Dependencies

In [None]:
import os
import warnings
import sys

import pandas as pd
import numpy as np

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

In [None]:
import mlflow
import mlflow.sklearn

In [None]:
import logging
logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

# Utilities for Data and Metrics

In [None]:
def prepare_data():
    #id = 1eNTyJc4jXJMkLPXW0eY6LL7_P9YN1GWO
    warnings.filterwarnings("ignore")
    np.random.seed(42)

    # Read the home price csv file from the URL
    orig_url = "https://drive.google.com/file/d/1eNTyJc4jXJMkLPXW0eY6LL7_P9YN1GWO/view"
    file_id = orig_url.split('/')[-2]
    data_path='https://drive.google.com/uc?export=download&id=' + file_id
    
    try:
        data = pd.read_csv(data_path)
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e)
    
    #numbers are written in this format "1,235,00" converting them to integers
    data["price"] = data["price"].str.replace(',', '')
    data["price"] = pd.to_numeric(data["price"])
    data = data.drop(["Unnamed: 0", 'zip'], 1)
    data = data.dropna()

    y = data["price"]
    X = data.drop("price", 1)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return X_train, X_test, y_train, y_test


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

# Load Dataset

In [None]:
X_train, X_test, y_train, y_test = prepare_data()

data = {
    'X_train': X_train,
    'X_test': X_test,
    'y_train': y_train,
    'y_test': y_test
}

data['X_train'].head()

Unnamed: 0,type,room_num,floor,area_m2,floors_num,year_built,last_refurbishment,city,lat,lon,canton
276,Apartment,4.5,2,145.0,3.0,2020.0,2020.0,Mendrisio,45.8862,8.988967,Ticino
1171,Row house,4.5,4,140.0,4.0,1984.0,2017.0,Agno,46.0005,8.9028,Ticino
1894,Single house,7.5,GF,143.0,1.0,1971.0,1971.0,St-Maurice,46.1988,6.99565,Canton du Valais
117,Apartment,5.5,1,174.0,1.0,2014.0,2014.0,Cheseaux-sur-Lausanne,46.5822,6.5958,Canton de Vaud
2028,Villa,8.5,GF,400.0,3.0,1972.0,2005.0,Aigle,46.3147,6.9716,Canton de Vaud


In [None]:
data['y_train'].head()

276     1060000
1171     900000
1894     870000
117     1450000
2028    2150000
Name: price, dtype: int64

# Utilities for Modeling and Tracking Experiments

In [None]:
def train_random_forest(data, n_trees=100, max_depth=None):

    # Train and track experiment   
    with mlflow.start_run():

        categorical_features = ['type', 'floor', 'city', 'canton']
        continious_features = ['room_num', 'area_m2', 'floors_num', 'year_built', 'last_refurbishment', 'lat', 'lon']

        numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])

        categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])

        preprocessor = ColumnTransformer( transformers = [("num", numeric_transformer, continious_features),
                    ("cat", categorical_transformer, categorical_features)])
        
        # Execute RF
        rf = RandomForestRegressor(n_estimators=n_trees, max_depth=max_depth, random_state=42)
        pipeline_rf = Pipeline([("col_transformer", preprocessor), 
                            ("estimator", rf)])
        pipeline_rf.fit(data['X_train'], data['y_train'])

        # Evaluate Metrics
        predicted_qualities = pipeline_rf.predict(data['X_test'])
        (rmse, mae, r2) = eval_metrics(data['y_test'], predicted_qualities)

        # Print out metrics
        print("Random Forest model (n_estimators={}, max_depth={}):".format(n_trees, max_depth))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.log_param('Model', 'Random Forest')  
        mlflow.log_param("n_estimators", n_trees)
        mlflow.log_param("max_depth", max_depth)
        
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        mlflow.sklearn.log_model(pipeline_rf, "model")

## Experiments

In [None]:
train_random_forest(data)

Random Forest model (n_estimators=100, max_depth=None):
  RMSE: 946901.6327767096
  MAE: 427173.6942019544
  R2: 0.7379139162588411


In [None]:
#train_random_forest(data, n_trees=500, max_depth=None)

In [None]:
#train_random_forest(data, n_trees=1000, max_depth=None)

In [None]:
#train_random_forest(data, n_trees=500, max_depth=5)

These models will create files in a folder named as mlruns. Which will be used by MLFLow for the UI.

## MLFLow UI

Run `mlflow ui` in terminal
<br>and view it at http://localhost:5000 in case running locally from jupyter. 
<br> In case of running in colab we will have to use ngrok tunnel.

In [None]:
!pip install pyngrok --quiet

[?25l[K     |▍                               | 10 kB 17.9 MB/s eta 0:00:01[K     |▉                               | 20 kB 17.6 MB/s eta 0:00:01[K     |█▎                              | 30 kB 7.5 MB/s eta 0:00:01[K     |█▊                              | 40 kB 6.6 MB/s eta 0:00:01[K     |██▏                             | 51 kB 4.3 MB/s eta 0:00:01[K     |██▋                             | 61 kB 5.1 MB/s eta 0:00:01[K     |███                             | 71 kB 5.6 MB/s eta 0:00:01[K     |███▌                            | 81 kB 5.0 MB/s eta 0:00:01[K     |████                            | 92 kB 5.6 MB/s eta 0:00:01[K     |████▍                           | 102 kB 5.1 MB/s eta 0:00:01[K     |████▉                           | 112 kB 5.1 MB/s eta 0:00:01[K     |█████▎                          | 122 kB 5.1 MB/s eta 0:00:01[K     |█████▊                          | 133 kB 5.1 MB/s eta 0:00:01[K     |██████▏                         | 143 kB 5.1 MB/s eta 0:00:01[K   

In [None]:
from pyngrok import ngrok
from getpass import getpass

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = getpass('Enter the ngrok authtoken: ')
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Model UI:", ngrok_tunnel.public_url)

MLflow Model UI: https://ad1a-35-185-96-26.ngrok.io


## Serve the Model
Serve a model saved with MLflow by launching a webserver on the specified host and port. The command supports models with the python_function or crate (R Function) flavor. For information about the input data formats accepted by the webserver, see the following documentation: https://www.mlflow.org/docs/latest/models.html#built-in-deployment-tools.

You can make requests to `POST /invocations` in pandas split- or record-oriented formats.

**Example**:

```bash
$ mlflow models serve -m mlruns/my-run-id/model-path 

$ curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "columns": ["a", "b", "c"],
    "data": [[1, 2, 3], [4, 5, 6]]
}' 
```

further information can be found [here](https://www.mlflow.org/docs/latest/cli.html#mlflow-models-serve):

In [None]:
# MLFlow Tracking in case we need to kill the process 
# get_ipython().system_raw("mlflow ui --port 5000 &")
# get_ipython().system_raw("killall mlflow")
#!mlflow ui --port 5000

In [None]:
# MLFlow Models
# serve model from the specific run, the path will contain the name from .log_model() function
!mlflow models serve --env-manager=local -m mlruns/0/812a7a4a42824f6bb0c88be1ec4066e4/artifacts/model

2022/06/19 15:34:04 INFO mlflow.models.cli: Selected backend for flavor 'python_function'
2022/06/19 15:34:04 INFO mlflow.pyfunc.backend: === Running command 'exec gunicorn --timeout=60 -b 127.0.0.1:5000 -w 1 ${GUNICORN_CMD_ARGS} -- mlflow.pyfunc.scoring_server.wsgi:app'
[2022-06-19 15:34:05 +0000] [273] [INFO] Starting gunicorn 20.1.0
[2022-06-19 15:34:05 +0000] [273] [INFO] Listening at: http://127.0.0.1:5000 (273)
[2022-06-19 15:34:05 +0000] [273] [INFO] Using worker: sync
[2022-06-19 15:34:05 +0000] [276] [INFO] Booting worker with pid: 276
[2022-06-19 22:01:46 +0000] [273] [INFO] Handling signal: int

Aborted!
[2022-06-19 22:01:46 +0000] [276] [INFO] Worker exiting (pid: 276)
[2022-06-19 22:01:46 +0000] [273] [INFO] Shutting down: Master


In [None]:
# to predict use the terminal or the another colab notebook
#!curl https://6c74-35-196-27-68.ngrok.io/invocations -H 'Content-Type: application/json' -d f'''{"columns": {columns_},"data": {data_}}'''

# Assignments

1. Add more runs and try to serve different models.
2. Change test data dictionary to predict the price of the housing (use terminal or another colab notebook).
3. Convert you ML model code from work into MLFlow compatible code and run it using MLFlow API to track your experiment and deploy your model.
4. Explore MLFlow [GitHub examples](https://github.com/amesar/mlflow-examples).     