# ADAC Lab 3 - MLFlow

Your task will be to:
1. run MLFlow server
2. create and perform feature engineering pipeline of Microsoft Security Incident Prediction using Apache Spark
3. create ML model and register it to MLFlow

![](https://raw.githubusercontent.com/aaubs/ds-master/main/data/Images/mlflow.jpg)

## 1. Run MLFlow server and expose it using ngrok

We mount the Google Drive filesystem in Colab and changes the current working directory to a specific directory within the mounted Google Drive directory.



In [None]:
# mount to Google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
# first you need to create a folders' path in your drive called learn/mlflow
os.chdir('/content/drive/My Drive/learn/mlflow')

The second cell installs the MLflow Python package if it's not already installed, and then imports it. It also imports several other Python packages that may be used later in the code, such as os and pandas. Finally, it prints the version of MLflow that is installed.



In [None]:
## Step 1 - Installing MLflow and checking the version

# install and import mlflow
import importlib

if importlib.util.find_spec('mlflow') is None:
  !pip install mlflow --q


import os
import pandas as pd

import mlflow

print(mlflow.__version__)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.0/29.0 MB[0m [31m69.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m113.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m242.5/242.5 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m147.8/147.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.9/114.9 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m722.9/722.9 kB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We will create a directory called artefacts_mlflow if it doesn't already exist, and then creates an MLflow experiment with the name "Iris Classification". It then retrieves the ID of the newly created experiment.



In [None]:
## Step 1 - Setting mlflow artefacts
artefacts_temp_dir = 'artefacts_mlflow'
if not os.path.exists(artefacts_temp_dir):
    os.makedirs(artefacts_temp_dir)

mlflow.create_experiment('Iris Classification')

'421999058781628427'

In [None]:
# Get the experiment ID for the experiment with the specified name
experiment_id = mlflow.get_experiment_by_name('Iris Classification').experiment_id


In [None]:
## Step 2 - Starting MLflow, running UI in background

# Start an MLflow run
with mlflow.start_run(run_name="my-run", nested=True, experiment_id=experiment_id):
    # Log some metrics
    mlflow.log_metric("accuracy", 0.85)
    mlflow.log_metric("precision", 0.75)

The fourth cell starts a new MLflow run within the previously created experiment, with the name "my-run". It then logs two metrics for the run, "accuracy" and "precision", with the respective values 0.85 and 0.75. Finally, it starts the MLflow tracking UI in the background using a system command.

We install the Pyngrok Python package and imports it, and then prompt the user to enter their Ngrok authentication token. It then sets the authentication token in the Pyngrok library, creates an HTTP tunnel to the MLflow tracking UI running on port 5000, and prints the public URL of the tunnel. This allows the user to access the MLflow tracking UI from a remote location.





![](https://hackernoon.com/hn-images/1*OBNbvLxAESaQTEqWdqBCGw.png)

In [None]:
# run tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &")
## Step 3 - Installing pyngrok for remote tunnel access using ngrock.com
!pip install pyngrok --quiet
from pyngrok import ngrok
from getpass import getpass
# Terminate open tunnels if any exist
ngrok.kill()
## Step 4 - Login on ngrok.com and get your authtoken from https://dashboard.ngrok.com
# Enter your auth token when the code is running
NGROK_AUTH_TOKEN = getpass('Enter the ngrok authtoken: ')
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

Enter the ngrok authtoken: ··········
MLflow Tracking UI: https://316f-34-19-8-184.ngrok-free.app


## 2. EXAMPLE - Feature engineering and model learning using Apache Spark with register to MLFlow

### Intasll packages

In [None]:
!pip3.8 install -U  google.cloud "pandas<2.0.0" google-cloud-storage==2.9.0 mlflow==2.3.1

### Set env variables

In [None]:
%env IAP_CLIENT_ID="389410459067-mltiuc7631od8mhp9aokhb03qdlj81qp.apps.googleusercontent.com"

### Get OIDC token for authentication to MLflow instance behind Identity-aware Proxy (IaP)

In [None]:
import os
import subprocess
mlflow_token=subprocess.getoutput("""curl -s -X POST -H "content-type: application/json" -H "Authorization: Bearer $(gcloud auth print-access-token)" -d "{\"audience\": \"${IAP_CLIENT_ID}\", \"includeEmail\": true }" "https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/$(gcloud auth list --filter=status:ACTIVE --format='value(account)'):generateIdToken"  | jq -r '.token'""")
os.environ['MLFLOW_TRACKING_TOKEN'] = mlflow_token

In [None]:
%env MLFLOW_TRACKING_URI=https://mlflow-dot-tbd-2023l-mlops.ew.r.appspot.com/

In [None]:
%env PYSPARK_PYTHON=/usr/bin/python3.8
%env PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8

### Please specify your student id

In [None]:
%env STUDENT_ID=2003

### Test connectivity with MLflow tracking server

In [None]:
%%bash 
mlflow experiments search

### Prepare training data

In [None]:
%%bash
gsutil mb -l europe-west1 gs://tbd-2023l-${STUDENT_ID}-data

In [None]:
%%bash
curl -L https://github.com/datascienceverse/stack-overflow-dataset-2022/raw/master/survey_results_public.csv | gsutil cp - gs://tbd-2023l-${STUDENT_ID}-data/survey_results_public.csv

In [None]:
%%bash
gsutil du -h gs://tbd-2023l-${STUDENT_ID}-data/survey_results_public.csv

### GCS connector

In [None]:
%%bash
wget https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.9/gcs-connector-hadoop3-2.2.9-shaded.jar

### Spark session

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .master("local[1]") \
    .config("spark.driver.memory", "4g") \
    .config("spark.jars", "/tmp/gcs-connector-hadoop3-2.2.17-shaded.jar") \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS") \
    .config("spark.hadoop.fs.gs.auth.service.account.enable", "false") \
    .config("spark.hadoop.fs.gs.auth.null.enable", "true") \
    .getOrCreate()

In [None]:
import os
db_name = "tbd"
student_id = os.environ['STUDENT_ID']
gs_path = "gs://gdl-workshops-bd-public/survey_results_public.csv"
spark.sql(f'DROP DATABASE IF EXISTS {db_name} CASCADE')
spark.sql(f'CREATE DATABASE {db_name}')
spark.sql(f'USE {db_name}')
table_name = "survey_2020" 

spark.sql(f'DROP TABLE IF EXISTS {table_name}')

spark.sql(f'CREATE TABLE IF NOT EXISTS {table_name} \
          USING csv \
          OPTIONS (HEADER true, INFERSCHEMA true, NULLVALUE "NA") \
          LOCATION "{gs_path}"')

spark_df= spark.sql(f'SELECT *, CAST((ConvertedCompYearly > 60000) AS STRING) AS compAboveAvg \
                    FROM {table_name} WHERE ConvertedCompYearly IS NOT NULL ')

In [None]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml import Pipeline
y = 'compAboveAvg' 
feature_columns = ['OpSys', 'EdLevel', 'MainBranch' , 'Country', 'YearsCode']

stringindexer_stages = [StringIndexer(inputCol=c, outputCol='strindexed_' + c).setHandleInvalid("keep") for c in feature_columns]
stringindexer_stages += [StringIndexer(inputCol=y, outputCol='label').setHandleInvalid("keep")]

onehotencoder_stages = [OneHotEncoder(inputCol='strindexed_' + c, outputCol='onehot_' + c) for c in feature_columns]
extracted_columns = ['onehot_' + c for c in feature_columns]
vectorassembler_stage = VectorAssembler(inputCols=extracted_columns, outputCol='features') 

final_columns = [y] + feature_columns + extracted_columns + ['features', 'label']

transformed_df = Pipeline(stages=stringindexer_stages + \
                          onehotencoder_stages + \
                          [vectorassembler_stage]).fit(spark_df).transform(spark_df).select(final_columns)
training, test = transformed_df.randomSplit([0.8, 0.2], seed=1234) # Podzial na zbior treningowy/testowy

### Set the experiment that we would like to use for tracking training runs

In [None]:
import mlflow   
import mlflow.spark

ename = f"tbd-2023l-{student_id}"
artifacts_location= "artifacts"
mlflow.set_experiment(experiment_name=ename)
experiment = mlflow.get_experiment_by_name(ename)
experiment

### Prepare metrics that we would like to log

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
evaluator_acc = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedRecall")
evaluator_prec = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedPrecision")
evaluator_f = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="weightedFMeasure")

### Start a training using the decision tree model

In [None]:
from pyspark.ml.classification import DecisionTreeClassifier

with mlflow.start_run(experiment_id = experiment.experiment_id):
    mlflow.log_param('model_type', 'DecisionTreeClassifier')
    dt = DecisionTreeClassifier(featuresCol='features', labelCol='label')
    mlflow.set_tag("classifier", "decision_tree")  ## ustawienie tagow
    mlflow.log_param("depth", dt.getMaxDepth())    ## zapisanie metadanych - hiperparametrow

    dt_model = Pipeline(stages=[dt]).fit(training)
    pred_dt = dt_model.transform(test)
    label_and_pred = pred_dt.select('label', 'prediction')
    res = dt_model.transform(test)

    test_metric_acc = evaluator_acc.evaluate(res)
    test_metric_recall = evaluator_recall.evaluate(res)
    test_metric_prec = evaluator_prec.evaluate(res)
    test_metric_f = evaluator_f.evaluate(res)

    mlflow.log_metric(evaluator_acc.getMetricName(), test_metric_acc) 
    mlflow.log_metric(evaluator_recall.getMetricName(), test_metric_recall) 
    mlflow.log_metric(evaluator_prec.getMetricName(), test_metric_prec)     
    mlflow.log_metric(evaluator_f.getMetricName(), test_metric_f)
    mlflow.spark.log_model(dt_model, artifact_path=artifacts_location)

### Start a training using the gradient boost trees model

In [None]:
from pyspark.ml.classification import GBTClassifier
gbt = GBTClassifier(labelCol="label", featuresCol="features", maxIter=10)
gbt_model = gbt.fit(training)

with mlflow.start_run(experiment_id = experiment.experiment_id):
    mlflow.log_param('model_type', 'GBTClassifier')
    mlflow.log_param("depth", gbt.getMaxDepth())
    res = gbt_model.transform(test)
    test_metric_acc = evaluator_acc.evaluate(res)
    test_metric_recall = evaluator_recall.evaluate(res)
    test_metric_prec = evaluator_prec.evaluate(res)
    test_metric_f = evaluator_f.evaluate(res)

    mlflow.log_metric(evaluator_acc.getMetricName(), test_metric_acc) 
    mlflow.log_metric(evaluator_recall.getMetricName(), test_metric_recall) 
    mlflow.log_metric(evaluator_prec.getMetricName(), test_metric_prec)     
    mlflow.log_metric(evaluator_f.getMetricName(), test_metric_f) 
  
    mlflow.spark.log_model(spark_model=gbt_model, artifact_path='gbt_classifier') 

### Run predictions

In [None]:
import mlflow
logged_model = 'runs:/83b4e502895840719d976337812b0d3b/artifacts'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)

# Predict on a Pandas DataFrame.
import pandas as pd
loaded_model.predict(pd.DataFrame(test.limit(10).toPandas()))

In [None]:
spark.stop()

## 3. Student implementation
- Please prepare machine learning pipeline (feature engineering and machine learning using appropriate model) for Microsoft Security Incident Prediction from Kaggle, using Apache Spark, following the example.
- The model should be registered on your MLFlow with appropriate metrics.
- Please provide screenshots of MLFlow dashboard, export this notebook as PDF and provide in the assignment.

### 3.A. Data Preprocessing

Please perform data preprocessing using Apache Spark, including the following steps:
- Handle missing and incomplete data: Identify and appropriately address null or missing values, either by removing rows/columns, imputing values, or other relevant techniques.
- Select relevant features for prediction: Remove columns that do not contribute to predicting the label, such as unique identifiers or irrelevant metadata.
- Split the dataset: Divide the dataset into training, test, and optionally validation sets in appropriate proportions (e.g., 70% training, 20% test, 10% validation).
- Normalize the data: Apply feature scaling to ensure the input features are on a similar scale, which is crucial for many machine learning models.

### 3.B. Model Training, Testing, and Deployment Using Spark and MLflow

Please perform the training of machine learning models using Apache Spark and MLflow, with the following three classifiers:
- SVC (Support Vector Classifier)
- MLP (Multi-Layer Perceptron)
- KNN (k-nearest neighbors).

Please classify column called `IncidentGrade`.

Tasks:
1.	Train the models using the training dataset.
2.	Test the models using the test dataset.
3.	Evaluate and compare model performance using multiple metrics such as:
- Accuracy
- Precision
- Recall
- F1 Score
4. Track and log experiments using MLflow, including:
- Parameters
- Metrics
- Model artifacts