<header style="background: white; border-left: 8px solid #602663; padding: 1em;">
<div>
<span style="color: black; font-size: small; font-weight: 700; text-transform: uppercase;">Level 6 Data Science / Software Engineering</span><br><span style="color: #602663; font-size: xx-large; font-weight: 900;">Topic 8: AutoML</span>
</div>
</header>

First, let's install the necessary pacakges. 

> **Note**:
The packages and their dependencies (included in `requirements.txt`) are the ones that we used when building this notebook. We use `pip install -q` to install them without showing any output.

In [None]:
%pip install -q -r ../requirements.txt

In [None]:
# Ignore warnings - they are mostly about deprecation of certain features
import warnings
warnings.filterwarnings("ignore")

# Ignore matplotlib font manager logging - which is not relevant for this notebook
import logging
logging.getLogger("matplotlib.font_manager").setLevel(logging.ERROR)

We will use the *IBM HR Analytics Employee Attrition & Performance* data set from Kaggle.

> *Uncover the factors that lead to employee attrition and explore important questions such as ‘show me a breakdown of distance from home by job role and attrition’ or ‘compare average monthly income by education and attrition’. This is a fictional data set created by IBM data scientists*: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

In [None]:
from pandas import read_csv

data = read_csv('https://raw.githubusercontent.com/BPP-Digital-Advanced-Data-Analytics/public_datasets/main/WA_Fn-UseC_-HR-Employee-Attrition.csv')
data.head()

In [None]:
data.info()

Most AutoML packages for binary outcomes require it to be an integer. Let's make the `Attrition` column `1` for `Yes` and `0` for `No`.

In [None]:

data['Attrition'] = data['Attrition'].replace({'Yes': 1, 'No': 0})
data['Attrition'] = data['Attrition'].astype(int)

Scikit Learn is a machine learning library that we will use to build our models. Here we configure it and enable metadata routing to ensure that the Area Under the ROC Curve (AUC) is calculated correctly for each model.

In [None]:
from sklearn import set_config

set_config(enable_metadata_routing=True)

In [None]:
from pycaret.classification import setup

s = setup(
    data,  # our dataframe
    target = "Attrition",  # the feature that we want to predict
    ignore_features = [  # features we want to exclude because they are not useful
        "EmployeeCount",
        "EmployeeNumber",
        "Over18",
        "StandardHours",
    ],
    session_id = 123,
)

In [None]:
from pycaret.classification import compare_models

best_accuracy = compare_models(sort = 'Accuracy', fold = 5, exclude = ['lightgbm'])


In [None]:
from pycaret.classification import plot_model

plot_model(best_accuracy, plot = 'confusion_matrix')

In [None]:
from pycaret.classification import predict_model

pred_holdout = predict_model(best_accuracy)

In [None]:
from pycaret.classification import create_model

thresh = 0.5  # CHANGE THIS above and below 0.5 but ensure it is greater than 0 and less than 1

lda = create_model('lda',
                   probability_threshold = thresh,
                   fold = 5)

plot_model(lda, plot = 'confusion_matrix')

holdout_pred = predict_model(lda)

In [None]:
plot_model(lda, plot = 'pr')

In [None]:
plot_model(lda, plot = 'feature')

In [None]:
from pycaret.classification import check_fairness

check_fairness(lda, sensitive_features = ['Gender'])