# Model auditing with MLflow on SherlockML

After installling MLflow (`pip install MLflow`), just import it as any Python library.

In [1]:
import mlflow
from mlflow import log_artifact, log_metric, log_param

MLflow can log various objects (models, parameters, runs, entire projects) and keep track of them through the UI. When something is logged, MLflow creates an `mlruns` folder where it stores the information. You can then run the MLflow UI as described in the readme to this project to see all the information.

## Load the data (car evaluation dataset)

Data downloaded from: http://mlr.cs.umass.edu/ml/datasets/Car+Evaluation

In [2]:
import pandas as pd

In [5]:
data = pd.read_csv("../data/car.data", header=None)
data.columns = ['buying','maint', 'doors', 'persons', 'lug_boot', 'safety', 'class']

In [6]:
data.head()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


## Data exploration

The dataset is completely categorical. Let's get some simple information aboutit.

In [12]:
data.describe()

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,class
count,1728,1728,1728,1728,1728,1728,1728
unique,4,4,4,3,3,3,4
top,low,low,3,2,small,low,unacc
freq,432,432,432,576,576,576,1210


Let's have a closer look at the classes to which the datapoints belong.

In [28]:
data.groupby(['class']).size()

class
acc       384
good       69
unacc    1210
vgood      65
dtype: int64

## Train and selection of a model

Since the dataset is completely categorical, we choose to use a random forest to predict the class of a datapoint given its features. In particular we want to find the value of the `min_samples_leaf` parameter that gives the best accuracy and log all the results with MLflow.

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

First we one-hot-encode the categorical features.

In [30]:
X = pd.get_dummies(data.drop('class', axis=1))
Y = data['class']

Cross validation: we split the data into a train and test dataset. The model will be trained on the training dataset and the accuracy will be evaluated on the test one.

In [32]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

We choose a range of values for the `min_samples_leaf` paramenter and we log the accuracy we obtain using each value using MLflow. The information is stored in the `/mlruns/` folder. We can then run the MLflow UI as described in the readme (please start it in the parent folder to `/mlruns/`) and visualise the information about our runs.

In [33]:
for min_samples_leaf in range(1, 11):
    with mlflow.start_run():
        rf_random_state = 42
        rf = RandomForestClassifier(min_samples_leaf=min_samples_leaf, random_state=rf_random_state)
        rf.fit(X_train, Y_train)
        accuracy = rf.score(X_test, Y_test)

        log_param('min_samples_leaf', min_samples_leaf)
        log_param('rf_random_state', rf_random_state)
        log_metric('accuracy', accuracy)