# Module 2: Model evaluation and validation

In this module, we cover

- Training and test datasets
- Evaluation metrics (e.g. accuracy, recall, precision, F1 score, AUC)
- Cross-validation
- Over and underfitting
- Hyperparameters
- Validation datasets
- Model tuning
- Hands-on example of evaluating and tuning an ML model

The [notebooks](https://github.com/decisionmechanics/lt541v) for the course are available on GitHub. Clone or download them to follow along.

In this notebook, we make use of the following third-party packages.

```bash
pip install jupyterlab feature-engine hyperopt matplotlib scikit-learn scikit-plot scipy yellowbrick
```

Examples will make use of data from Kaggle's [2018 Machine Learning and Data Science Survey](https://www.kaggle.com/datasets/kaggle/kaggle-survey-2018).

We will use this data to try and predict whether someone is a data scientist or a software engineer.

Some of the packages have deprecation warnings. These can be disabled.

In [None]:
import warnings

warnings.filterwarnings("ignore")
warnings.simplefilter("ignore")

## Data preparation using pipelines

Import the survey data.

In [None]:
import janitor.polars
import polars as pl

survey_raw_df = pl.read_csv("data/ml-ds-survey.csv").clean_names()

survey_raw_df.head(n=3)

Looks like we have two header rows.

In [None]:
survey_raw_df = pl.read_csv(
    "data/ml-ds-survey.csv", skip_rows_after_header=1, infer_schema_length=None
).clean_names()

survey_raw_df.head()

Create a function that reads the raw data, cleans it up and selects the desired features.

In [None]:
def load_data():
    survey_raw_df = pl.read_csv(
        "data/ml-ds-survey.csv", skip_rows_after_header=1, infer_schema_length=None
    ).clean_names()

    top_majors = survey_raw_df.head(n=3).get_column("q5").to_list()

    return (
        survey_raw_df.rename(
            {
                "q1": "gender",
                "q3": "country",
                "q2": "age",
                "q4": "education",
                "q8": "experience",
                "q9": "compensation",
                "q5": "major",
                "q16_part_1": "python",
                "q16_part_2": "r",
                "q16_part_3": "sql",
                "q6": "title",
            }
        )
        .select(
            "gender",
            "country",
            pl.col("age").str.slice(0, 2).cast(pl.Int8),
            pl.col("education")
            .replace(
                {
                    "No formal education past high school": 12,
                    "Some college/university study without earning a bachelor’s degree": 13,
                    "Bachelor’s degree": 16,
                    "Master’s degree": 18,
                    "Professional degree": 19,
                    "Doctoral degree": 20,
                    "I prefer not to answer": None,
                }
            )
            .cast(pl.Int8),
            pl.when(pl.col("major").is_in(top_majors))
            .then(pl.col("major"))
            .otherwise(pl.lit("Other")),
            pl.col("experience")
            .str.replace(r"\s*\+", "")
            .str.split("-")
            .list.first()
            .cast(pl.Int8),
            pl.col("compensation")
            .fill_null(0)
            .replace(
                {
                    "500,000+": "500",
                    "I do not wish to disclose my approximate yearly compensation": None,
                }
            )
            .str.split("-")
            .list.first()
            .cast(pl.Int32)
            * 1000,
            (pl.col("python").fill_null("") == "Python").cast(pl.Int8),
            (pl.col("r").fill_null("") == "R").cast(pl.Int8),
            (pl.col("sql").fill_null("") == "SQL").cast(pl.Int8),
            "title",
        )
        .filter(
            pl.col("country").is_in(["China", "India", "United States of America"]),
            pl.col("title").is_in(["Data Scientist", "Software Engineer"]),
        )
    )

If we haven't already loaded the dataset and stored it in parquet format, do that now. Otherwise, load the data from the parquet file.

In [None]:
import os

if not os.path.isfile("temp/ml-ds-survey.parquet"):
    survey_df = load_data()

    survey_df.write_parquet("temp/ml-ds-survey.parquet")
else:
    survey_df = pl.read_parquet("temp/ml-ds-survey.parquet")

We've filtered the data to only include the three most populous countries (India, China and the USA). This ensures we have a reasonable number of samples for each country.

We've also simplified our task to performing a binary classification, by including only data scientists and software engineers.

In [None]:
from feature_engine import encoding, imputation
from sklearn import pipeline
from sklearn.compose import ColumnTransformer

categorical_pipeline = pipeline.Pipeline(
    [
        (
            "encode_categoricals",
            encoding.OneHotEncoder(
                top_categories=5,
                drop_last=True,
            ),
        )
    ]
)

numeric_pipeline = pipeline.Pipeline(
    [
        (
            "imputate",
            imputation.MeanMedianImputer(imputation_method="median"),
        ),
    ]
)

column_transformer = ColumnTransformer(
    transformers=[
        ("categorical_pipeline", categorical_pipeline, ["gender", "country", "major"]),
        ("numeric_pipeline", numeric_pipeline, ["education", "experience"]),
    ],
    remainder="passthrough"
)

column_transformer.set_output(transform="polars")

data_preperation_pipeline = pipeline.Pipeline(
    [
        ("column_transformer", column_transformer),
    ]
)

In [None]:
from sklearn import set_config

set_config(display="diagram")
display(data_preperation_pipeline)

In [None]:
X_df = survey_df.select(
    [
        "gender",
        "country",
        "age",
        "education",
        "major",
        "experience",
        "compensation",
        "python",
        "r",
        "sql",
    ]
)

y_df = survey_df.select("title")

In [None]:
from sklearn import model_selection

SEED = 123

X_train_df, X_test_df, y_train_df_, y_test_df_ = model_selection.train_test_split(
    X_df, y_df, test_size=0.3, random_state=SEED, stratify=y_df
)

In [None]:
X_train_df_ = data_preperation_pipeline.fit_transform(X_train_df.to_pandas())

X_train_df_

In [None]:
X_test_df_ = data_preperation_pipeline.fit_transform(X_test_df.to_pandas())

X_test_df_

In [None]:
from sklearn.preprocessing import LabelEncoder

y_train = LabelEncoder().fit_transform(y_train_df_.get_column("title"))

y_train

In [None]:
y_test = LabelEncoder().fit_transform(y_test_df_.get_column("title"))

y_test

## Classification using pipelines

In [None]:
from xgboost import XGBClassifier

xgb_pipeline = pipeline.Pipeline(
    [
        ("column_transformer", column_transformer),
        ("classifier", XGBClassifier(random_state=SEED)),
    ]
)

In [None]:
xgb_pipeline.fit(X_train_df.to_pandas(), y_train)

In [None]:
xgb_pipeline.predict(X_test_df.to_pandas())[:10]

## Model quality

Before we can use a model we have to ask ourselves, "Is it any good?"

We need some objective way of measuring the models. There are many different metrics for assessing model quality.

We also need some idea of the baseline. Is our model better than the alternative? The alternative might be a naive approach (e.g. assuming the weather next week will be the same as today), or it might be another model.

## Training and test datasets

We've been splitting our dataset into training and test partitions. This is to allow us to evaluate our models.

When we train our models, we don't train them on the entire dataset. We keep some data aside, not letting the model see it, so we can test the performance of the model.

The proportion of the observations set aside depends on factors like the amount of data available, the ML algorithm being used, and the problem under study. Typical proporitions are 50%, 30%, 25% and 10%.

It's important that we don't have duplicate values that appear in both the training and test datasets. If we do, this means that the training phase gets to sneak a peak at the "exam answers", so its quality will be overstated when we apply it to our test data.

We also need to make sure that the test and training datasets are representative of the full dataset.

Using stratified sampling (via the `stratify` parameter) when we split the data will help to ensure this---especially when we have very unbalanced classes. 

Do the training and test dataset have a similar target composition to the original dataset?

In [None]:
(survey_df.get_column("title").value_counts(normalize=True, sort=True))

In [None]:
(
    pl.DataFrame(
        {
            "title": y_train,
        }
    )
    .with_columns(
        pl.col("title").replace_strict(
            {
                0: "Data Scientist",
                1: "Software Engineer",
            },
            return_dtype=pl.String,
        )
    )
    .get_column("title")
    .value_counts(normalize=True, sort=True)
)

In [None]:
(
    pl.DataFrame(
        {
            "title": y_test,
        }
    )
    .with_columns(
        pl.col("title").replace_strict(
            {
                0: "Data Scientist",
                1: "Software Engineer",
            },
            return_dtype=pl.String,
        )
    )
    .get_column("title")
    .value_counts(normalize=True, sort=True)
)

Validation datasets will be introduced when we discuss hyperparameters in more depth.

## Evaluation metrics

### Accuracy

Accuracy is simply the proportion of correct predicitions

In [None]:
predicted = xgb_pipeline.predict(X_test_df.to_pandas())

(
    pl.DataFrame(
        {
            "actual": y_test,
            "predicted": predicted,
        }
    )
    .select(
        (pl.col("actual") == pl.col("predicted")).alias("agreement"),
    )
    .mean()
    .item()
)

It's available from the scikit-learn pipeline.

In [None]:
xgb_pipeline.score(X_test_df.to_pandas(), y_test)

#### Limitations of accuracy

Accuracy is an attractive evaluation metric as it's very simple. However, it has serious limitations.

It assumes that the target classes are fairly well balanced.

In [None]:
(
    survey_df.get_column("title")
    .value_counts()
    .plot.arc(
        theta="count",
        color="title",
    )
)

In the case of our survey data, this assumption holds---we have relatively equal numbers of data scientists and software engineers.

However, consider a rare disease---a disease that only occurs in 1% of the population.

We'll create a dataset with four features (x1, x2, x3, x4) and a target (whether the person has the disease). 

Generate 1000 observtions. Use random numbers for the four features and set 99% of the targets to 0 (doesn’t have the disease) and 1% to 1 (does have the disease).

In [None]:
import numpy as np

np.random.seed(SEED)

data = np.random.randint(0, 100, size=(1000, 4))

rare_disease_df = pl.DataFrame(data, schema=["x1", "x2", "x3", "x4"]).with_columns(
    pl.Series(name="target", values=[0] * 990 + [1] * 10)
)

rare_disease_df.head()

In [None]:
(
    rare_disease_df.get_column("target")
    .value_counts()
    .plot.arc(
        theta="count",
        color="target",
    )
)

There is _no_ predictive information in the four features---they are random values. They are completely unrelated to the target.

Let's use them to predict the target.

In [None]:
(
    rare_disease_X_train_df,
    rare_disease_X_test_df,
    rare_disease_y_train_df,
    rare_disease_y_test_df,
) = model_selection.train_test_split(
    rare_disease_df.select(["x1", "x2", "x3", "x4"]),
    rare_disease_df.select("target"),
    test_size=0.3,
    random_state=SEED,
    stratify=rare_disease_df.select("target"),
)

rare_disease_classifier = XGBClassifier().fit(
    rare_disease_X_train_df.to_pandas(), rare_disease_y_train_df.get_column("target")
)

rare_disease_classifier.score(
    rare_disease_X_test_df.to_pandas(), rare_disease_y_test_df.get_column("target")
)

That junk model is 99% accurate! Great! But...wait...how can it be?

It's because we have unbalanced target classes. We can get the same results from a simple (baseline) model---i.e. no-one has the disease.

In [None]:
(
    rare_disease_y_test_df.rename(
        {
            "target": "actual",
        }
    )
    .with_columns(
        pl.lit(0).alias("predicted"),
    )
    .select(
        (pl.col("actual") == pl.col("predicted")).alias("agreement"),
    )
    .mean()
    .item()
)

Accuracy can also be a poor metric when the classes are balanced. In classification tasks, we convert a predicted probability into a class. As accuracy is based on the predicted _class_, it doesn't consider the predicted _probability_.

This means that a predicted probability of 51% carries the same weight as one of 99%. Quality metrics that can make use of the probability values provide a more nuanced assessment. 

Accuracy isn't a _useless_ metric---we use it throughout the course. It's easy for people to interpret. But you have to be careful when interpreting it, especially when you have unbalanced target classes.

Always use a variety of quality metrics when assessing your models.

### Confusion matrices

Confusion matrices are used to evaluate the performance of a classification model by displaying the counts of true positives, true negatives, false positives, and false negatives. They help visualise where the model is making accurate predictions and where it may be misclassifying, aiding in identifying patterns of errors.

<img src="images/module2-confusion-matrix.png" alt="Confusion matrix" width="500" />

The terms "positive" and "negative" don't have pejorative interpretations in ML. For example, a cancer diagnosis may be labelled as a positive outcome.

We can count cells of the table using the `confusion_matrix` function.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, predicted)

In this matrix,

- the top-left is a correct prediction of Data Scientist (0)
- the bottom-right is a correct prediction of Software Engineer (1)
- the top-right is a incorrect prediction of Software Engineer
- the bottom-left is a incorrect prediction of Data Scientist


In [None]:
(
    pl.DataFrame(
        {
            "actual": y_test,
            "predicted": predicted,
        }
    )
    .with_columns(
        pl.all().replace_strict(
            {
                0: "Data Scientist",
                1: "Software Engineer",
            },
            return_dtype=pl.String,
        )
    )
    .group_by(pl.all())
    .agg(pl.len().alias("count"))
    .sort("count", descending=True)
)

We can normalise the counts in the confusion matrix, which can make it easier to interpret.

In [None]:
confusion_matrix(y_test, predicted, normalize="true")

It can be hard to remember what the columns and rows in a confusion matrix represent, so visualising it, with sensible labels, eases interpretation.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_test, predicted, normalize="true"),
    display_labels=["DS", "SE"],
).plot();

The confusion matrix can provide insights into the _type_ of errors the model is making. There are problems in which false negatives are worse than false positives and vice versa.

For example, if we define testing positive for cancer as a "positive" outcome, then we want to tune the model to prioritise false positives over false negatives. This will generally reduce the overall accuracy of the model, but makes sense in the context of the domain under study.

Are false positives or false negatives more desirable in your industry?

As with accuracy, confusion matrices are calculated using predicted classes, rather than probabilties, so the same concerns apply.

### Precision score

Precision is the measure of the accuracy of positive predictions made by a model, calculated as the ratio of true positives to the sum of true positives and false positives. It reflects how often the model's positive predictions are actually correct, making it particularly important in scenarios where false positives need to be minimized.

<img src="images/module2-precision-score.png" alt="Precision score calculation" width="500" />

We can obtain a precision score using `precision_score`.

In [None]:
from sklearn.metrics import precision_score

precision_score(y_test, predicted)

Precision is "How good are we at identifying X?" It's a useful metric for investment models. You are willing to pass on a few "too good to be true" cars to reduce your chances of ending up with a lemon.

Is precision a useful metric in your industry?

As with accuracy, precision scores are calculated using predicted classes, rather than probabilties, so the same concerns apply.

### Recall score

Recall (senstivity) is the measure of a model's ability to identify all relevant positive instances, calculated as the ratio of true positives to the sum of true positives and false negatives. It indicates how well the model captures actual positives, making it crucial in contexts where missing positive cases is costly.

<img src="images/module2-recall-score.png" alt="Recall score calculation" width="500" />

We can obtain a recall score using `recall_score`.

In [None]:
from sklearn.metrics import recall_score

recall_score(y_test, predicted)

Recall is "How good are we at identifying _not_ X?" It's a useful metric for a customer churn model. You are willing to overplease a few satisfied customers to make sure you don’t lose any.

Is recall a useful metric in your industry?

As with accuracy, recall scores are calculated using predicted classes, rather than probabilties, so the same concerns apply.

### F1 score

F1 scores combine precision and recall into a single metric. It differs from accuracy in being more robust where there are unbalanced classes.

It's calculated by taking the harmonic mean of the precision and recall scores.

$$
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
$$

In [None]:
precision = precision_score(y_test, predicted)
recall = recall_score(y_test, predicted)

2 * (precision * recall) / (precision + recall)

We can calculate this directly using `f1_score`.

In [None]:
from sklearn.metrics import f1_score

f1_score(y_test, predicted)

As with recall and precision, F1 scores are calculated using predicted classes, rather than probabilties, so the same concerns apply.

### Classification report

You can get accuracy, precision score, recall score, f1 score and other information from a classification report.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, predicted, target_names=["DS", "SE"]))

The macro average calculates the unweighted mean of metrics like precision, recall, and F1 score across all classes. This means each class contributes equally to the final average, regardless of its frequency in the dataset, making macro averages especially useful for understanding overall model performance in imbalanced datasets.

The weighted average calculates the mean of metrics like precision, recall, and F1 score, but it weights each class's contribution according to its support (i.e. the number of true instances of each class). This approach provides a more balanced view that accounts for class distribution, making it helpful in accessing predictive accuracy in imbalanced datasets where more frequent classes might dominate the overall metrics.

### Receiver Operator Characteristic and Area Under Curve

Receiver Operator Characteristic (ROC) curves were use in the Second World War to assess radars. They allowed the study of tradeoffs between actual contacts and ghost images.

Unlike the other metrics we've looked at so far, ROC curves consider the probability of predictions via the decision threshold (the point at which a probability is considered to move from negative to positive).

An ROC curve is a graphical representation that shows the performance of a binary classifier across various threshold settings. It plots the true positive rate (recall) against the false positive rate, illustrating the trade-off between correctly identifying positives and avoiding false positives as the decision threshold changes.

The true positive rate (TPR) is calculated as follows.

$$
TPR = \frac{TP}{TP + FN}
$$

The false positive rate (FPR) is calculated as follows.

$$
FPR = \frac{FP}{FP + TN}
$$

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve

y_scores = xgb_pipeline.predict_proba(X_test_df.to_pandas())[:, 1]

fpr, tpr, thresholds = roc_curve(y_test, y_scores)

plt.plot(fpr, tpr, label="ROC curve")
plt.plot(
    [0, 1], [0, 1], "k--", label="Random guess"
)  # Dashed diagonal for random guessing
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="best")
plt.show()

The ROC curve illustrates how the model is trading off identify true positives versus misclassifying false positives. Good models pull the curve towards the top-left of the chart.

The chart lets us see how we can trade off reducing false positives with reducing the number of true positives we identify. 

We can calculate the ROC curve manually to see how it's constructed.

In [None]:
probability_df = pl.DataFrame(
    {
        "actual": y_test,
        "threshold": y_scores,
    }
)

points = []

for i in range(0, 1001):
    decision_threshold = i / 1000

    threshold_df = probability_df.with_columns(
        pl.when(pl.col("threshold") < decision_threshold)
        .then(pl.lit(0))
        .otherwise(pl.lit(1))
        .alias("predicted")
    )

    tn, fp, fn, tp = confusion_matrix(
        y_test, threshold_df.get_column("predicted")
    ).ravel()

    fpr = fp / (fp + tn)
    tpr = tp / (tp + fn)

    points.append((fpr, tpr))

(
    pl.DataFrame(
        {
            "fpr": [point[0] for point in points],
            "tpr": [point[1] for point in points],
        }
    ).plot.point(
        x="fpr",
        y="tpr",
    )
)

We can compute a single index from the ROC curve---Area Under Curve (AUC). It's pretty much as described---the area under the curve. As the limits of both axes of the chart are [0, 1], the AUC falls between 0 and 1.

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_scores)

### Precision-Recall curve

We can also plot the precision score against the recall score to examine the tradeoffs between these competing metrics.

In [None]:
from yellowbrick.classifier import precision_recall_curve

precision_recall_curve(XGBClassifier(), X_train_df_, y_train, X_test_df_, y_test);

If we want to avoid false positives, we can try to improve the precision score, but we see that this will have a negative impact on our recall performance.

### Log loss

Log loss quantifies the accuracy of a classifier by comparing the predicted probability $p_{i}$ for each instance with the actual label 
$y_{i}$.	For a binary classification problem, the formula for log loss is:

$$
Log\ Loss = - \frac{1}{N} \sum_{i=1}^{N} (y_{i} log(p_{i}) + (1 - y_{i}) log(1 - p_{i}))
$$

Where

- $N$ is the number of observations
- $y_{i}$ is the class label for observation $i$
- $p_{i}$ is the predicted probability of observation $i$ being in the positive class

In [None]:
from sklearn.metrics import log_loss

log_loss(y_test, predicted)

The value of log loss isn't very instructive on it's own. It's a relative metric. Better models have lower log loss values. A perfect classifier would have a log loss of 0.

As log loss uses the _probabilities_ of the predictions, it utilises more information than class-based metrics. However, it is much less intuitive than most of the other metrics.

### Cumulative gains curve

Cumulative gains curves are used to assess how well a model distinguishes between positive and negative instances. 

In the context of marketing, for example, a cumulative gains curve helps determine how many customers are likely to respond to a campaign if targeted based on model predictions. 
    
The curve plots the cumulative (ordered) proportion of positive outcomes (like responders) on the y-axis against the cumulative proportion of the population on the x-axis. A steeper curve indicates a stronger model, showing that a higher proportion of positives are captured within a smaller portion of the population.
                                                                                                                                                                The random baseline reflects what would happen without the model, serving as a benchmark. An ideal model would have a high initial slope, capturing many positives early, and then flattening as the remaining instances become less likely to be positive.

In [None]:
from scikitplot.metrics import plot_cumulative_gain

plot_cumulative_gain(y_test, xgb_pipeline.predict_proba(X_test_df.to_pandas()));

Looking at the chart, the 40% of the population who are "most obviously" software engineers represent 60% of all instances of software engineers.

### Lift curves

Lift curves show the improvement (or "lift") that a model provides over random selection.

The curve plots the lift factor on the y-axis against the proportion of the population or sample on the x-axis (as with the cumulative gains curve). At any given point on the curve, the lift factor represents how much better the model is at capturing positive instances compared to a random baseline. 

For example, a lift of 1.8 at 5% of the population means that, in the top 5% of ranked predictions, the model captures 1.8 times as many positive cases as would be expected by random chance. A strong model will have a high lift at the beginning of the curve, showing it can concentrate positive outcomes in the top ranks, and the curve will typically decline as more of the population is included, eventually converging to a lift of 1, where it performs no better than random selection.

In [None]:
from scikitplot.metrics import plot_lift_curve

plot_lift_curve(y_test, xgb_pipeline.predict_proba(X_test_df.to_pandas()));

## Cross-validation

Cross-validation allows us to assess how well a model generalises to an independent dataset. It helps to prevent overfitting. In cross-validation, the data is split into multiple subsets or "folds," and the model is trained and evaluated on these folds.

It works as follows.

1. The data is divided into $k$ subsets (folds) of equal or near-equal size. Commonly, $k$ is set to 5 or 10.

2. The model is trained on $k−1$ of the folds and tested on the remaining fold. This process is repeated $k$ times, each time using a different fold for testing and the rest for training.

3. After all $k$ iterations, the evaluation metrics (like accuracy) are averaged to get an overall estimate of the model's performance.

<img src="images/module2-cross-validation.png" alt="Cross validation" width="800" />

There are different forms of cross-validation.

- $k$-Fold cross-validation is the Standard approach, where $k$ is the number of folds.
- Leave-One-Out Cross-Validation (LOOCV) is a special case of $k$-fold where $k$ equals the number of data points. Each iteration uses all data points except one for training, with the single remaining point for testing.
- Stratified $k$-Fold: Basically $k$-fold, but maintains the distribution of target variable classes across folds, which is useful for imbalanced datasets.

Cross-validation provides a more robust measure of model performance compared to a single training-test split, as it ensures the model is evaluated across multiple subsets of the data. This leads to better insights into how the model will perform in real-world scenarios.

We can conduct a 5-fold validation on the survey pipeline, using F1 scores as the evaluation metric.

In [None]:
from sklearn.model_selection import cross_val_score

y = LabelEncoder().fit_transform(y_df.get_column("title"))

scores = cross_val_score(xgb_pipeline, X_df.to_pandas(), y, cv=5, scoring="f1")

We have $k$=5 metrics---one for each fold.

In [None]:
scores

These can be combined as required.

In [None]:
np.mean(scores)

## Over and underfitting

A learning curve visualisation is a graphical representation that shows the model's performance a function of the training data size. It helps track how well a model is learning and generalising by plotting metrics against the number of training samples.

<img src="images/module2-learning-curve.png" alt="Model fitting" width="800" />

When both the training and test curves have low accuracy, this indicates that the model is too simple (underfitting). If the training curve shows high accuracy, but the test curve has low accurary, this indicates overfitting.

Good models tend to have both curves converging to a high accuracy.

In [None]:
from yellowbrick.model_selection import learning_curve

learning_curve(xgb_pipeline, X_df.to_pandas(), y);

If you are using a white box modelling approach, then a very complicated model can also be a sign of overfitting.

## Hyperparameters

Hyperparameters are settings that define how the model is trained rather than the values learned by the model itself. Unlike parameters (weights and biases) that are adjusted during the learning process, hyperparameters are set _before_ the training starts and remain fixed unless manually tuned.

Hyperparameters are a key tool for controlling over and underfitting.

We can view all the hyperparameters for a Scikit Learn estimator.

In [None]:
XGBClassifier().get_params()

Hyperparameters commonly employed to tweak XGBoost classifiers, include

- `early_stopping_rounds`: Stop if we go this number of rounds without an improvement
- `learning_rate`: After each boosting round, multiple weights by this value. (0, 1] range. Lower is more conservative and general requires more estimators. 
- `max_depth`: Maximum depth of each tree
- `n_estimators`: Number of trees

We saw earlier that our model appeared to be overfitting. Let's restrict the depth of the trees to see if that helps.

Storing hyperparameters in a dictionary makes it easy to group and reuse them.

In [None]:
params = {
    "max_depth": 2,
}

classifier = XGBClassifier(**params)
classifier.fit(X_train_df_, y_train)

classifier.score(X_test_df_, y_test)

That's an improvement over the default settings.

Review the learning curve for this model.

In [None]:
X_df_ = data_preperation_pipeline.fit_transform(X_df.to_pandas())

learning_curve(classifier, X_df_.to_pandas(), y);

Again, an improvement over the default settings.

We can also look at a baseline model as a background to our experiments.

In [None]:
params = {
    "max_depth": 1,
    "n_estimators": 1,
}

classifier = XGBClassifier(**params)
classifier.fit(X_train_df_, y_train)

classifier.score(X_test_df_, y_test)

Which feature is the classifier using?

In [None]:
np.array(X_test_df_.columns)[classifier.feature_importances_ == 1]

## Validation datasets

When applying different hyperparameters, there's a risk of "hyperparameter hacking"---i.e. changing the parameters until you find a set that happen to work well with the specific test data you have. This will result in poor performance when applied to real-world data.

To prevent this, we can split our data into train, test and _validation_ sets. Tuning is done against the validation data, with the final parameters being checked agains the test data.

In [None]:
def train_validate_test_split(
    X_df, y_df, validate_size=0.25, test_size=0.25, random_state=None, stratify=None
):
    X_remainder_df, X_test_df, y_remainder_df, y_test_df = (
        model_selection.train_test_split(
            X_df,
            y_df,
            test_size=test_size,
            random_state=random_state,
            stratify=stratify,
        )
    )

    X_train_df, X_validate_df, y_train_df, y_validate_df = (
        model_selection.train_test_split(
            X_remainder_df,
            y_remainder_df,
            test_size=validate_size / (1 - test_size),
            random_state=random_state,
            stratify=stratify,
        )
    )

    return X_train_df, X_validate_df, X_test_df, y_train_df, y_validate_df, y_test_df

When using cross-validation, we don't need a validation dataset. If you have a small dataset, cross-validation will make better use of the data when performing model tuning.

## Model tuning

Looking for signs of over or underfitting, and guessimating a better hyperparameter is very "hit and miss". It would help to have a more formal approach.

We can _search_ the hyperparameter space using a number of different approaches.

Note that we still need to use our judgement and intuition. Searching for hyperparameters is expensive, and we need to decide what, where and how to conduct the search.

### Exhaustive grid search

What is the best value for `max_depth`? Setting it to 2 seemed to improve the model. But would 3 have been even better? Or 1? Or 5?

We can have the computer try all the different values and pick the best one.

In [None]:
scores = {}

for max_depth in range(1, 11):
    params = {
        "max_depth": max_depth,
    }

    classifier = XGBClassifier(**params, random_state=SEED)
    classifier.fit(X_train_df_, y_train)

    scores[max_depth] = classifier.score(X_test_df_, y_test)

scores

In [None]:
(
    pl.DataFrame(
        {
            "max_depth": scores.keys(),
            "score": scores.values(),
        }
    ).plot.line(
        x="max_depth",
        y="score",
    )
)

We can see that the optimal maximum depth is 3, and we start to see overfitting beyond that.

Note that I should strictly be using a validation set here, but we don't need it to demonstrate the concepts.

But this is only one hyperparameter. What about the others?

We can search through multiple parameter combinations using an exhaustive grid search.

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "learning_rate": [0.3, 0.5],
    "max_depth": [1, 2, 3, 5, 10],
    "n_estimators": [100, 500],
}

xgb_classifer = XGBClassifier(early_stopping_rounds=5, random_state=SEED)

cv = GridSearchCV(xgb_classifer, param_grid, cv=5).fit(
    X_train_df_, y_train, eval_set=[(X_test_df_, y_test)], verbose=None
)

In [None]:
cv.best_params_, cv.best_score_

We can plug these values back into a classifier.

In [None]:
XGBClassifier(**cv.best_params_, random_state=SEED).fit(X_train_df_, y_train).score(
    X_test_df_, y_test
)

The scores may not match as the grid search is using cross-validation, so the test datasets won't be identical.

### Randomized parameter optimization

Exhaustive grid search is straightforward, but it can be computationally expensive when searching over a large parameter space.

Randomized parameter optimization randomly samples hyperparameter candidates from distributions. This means that you aren't testing all combinations, but allows you to efficiently explore a wider parameter space.

In [None]:
from scipy.stats import uniform
from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    "learning_rate": uniform(),
    "max_depth": list(range(1, 11)),
    "n_estimators": list(range(50, 1000, 50)),
}

cv = RandomizedSearchCV(
    xgb_classifer,
    param_grid,
    cv=5,
    n_iter=10,
    random_state=SEED,
).fit(X_train_df_, y_train, eval_set=[(X_test_df_, y_test)], verbose=None)

In [None]:
cv.best_params_, cv.best_score_

### Successive halving

Successive halving evaluates candidate parameter combinations using a limited number of resources (e.g. observations). It them takes a number of the winners from that round, and gives them more resources. This process continues until we have a single winner.

The resource increase should be large enough so that improvements in scores outweigh differences due to statistical significance.

Scikit Learn has two successive halving estimators---[`HalvingGridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV) and [`HalvingRandomSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html#sklearn.model_selection.HalvingRandomSearchCV)

In [None]:
from scipy.stats import uniform
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

param_grid = {
    "learning_rate": uniform(),
    "max_depth": list(range(1, 11)),
    "n_estimators": list(range(50, 1000, 50)),
}

cv = HalvingRandomSearchCV(
    xgb_classifer,
    param_grid,
    cv=5,
    random_state=SEED,
).fit(X_train_df_, y_train, eval_set=[(X_test_df_, y_test)], verbose=None)

In [None]:
cv.best_params_, cv.best_score_

### Hyperopt

Hyperopt uses Bayesian optimisation to search the parameter space. Promising areas are explored in more detail. This makes it faster than other parameter search approaches, but it can get stuck in suboptimal areas.

In [None]:
from hyperopt import Trials, fmin, hp, tpe
from sklearn.model_selection import cross_val_score


def objective(space):
    classifier = XGBClassifier(
        learning_rate=space["learning_rate"],
        n_estimators=space["n_estimators"],
        max_depth=int(space["max_depth"]),
    )

    classifier.fit(X_train_df_, y_train)

    accuracies = cross_val_score(estimator=classifier, X=X_train_df_, y=y_train, cv=5)

    return 1 - accuracies.mean()


space = {
    "learning_rate": hp.quniform("learning_rate", 0.01, 0.5, 0.01),
    "max_depth": hp.choice("max_depth", range(2, 10, 1)),
    "n_estimators": hp.choice("n_estimators", range(20, 205, 5)),
}

trials = Trials()

best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=10, trials=trials)

In [None]:
best

In [None]:
XGBClassifier(**best).fit(X_train_df_, y_train).score(X_test_df_, y_test)

## Hands-on example of evaluating and tuning an ML model

In this hands-on section you will work with the Lending Club dataset.

In [None]:
lending_club_df = pl.read_parquet("data/lending-club-sample-preprocessed.parquet")

You will

- Review the different evaluation scores
- Display a learning curve to see if the model is under or overfitting
- Try different hyperparameters
- Tune the model by searching the parameter space

In [None]:
from sklearn.preprocessing import StandardScaler

(
    lending_club_feature_train_df,
    lending_club_feature_validate_df,
    lending_club_feature_test_df,
    lending_club_target_train_df,
    lending_club_target_validate_df,
    lending_club_target_test_df,
) = train_validate_test_split(
    lending_club_df.drop("fully_paid"),
    lending_club_df.select("fully_paid"),
    validate_size=0.10,
    test_size=0.10,
    random_state=SEED,
)

scaler = StandardScaler()

lc_X_train = scaler.fit_transform(lending_club_feature_train_df)

lc_y_train = lending_club_target_train_df.get_column("fully_paid")

lc_X_validate = scaler.fit_transform(lending_club_feature_validate_df)

lc_y_validate = lending_club_target_validate_df.get_column("fully_paid")

lc_X_test = scaler.fit_transform(lending_club_feature_test_df)

lc_y_test = lending_club_target_test_df.get_column("fully_paid")

In [None]:
lending_club_classifier = XGBClassifier(random_state=SEED)

lending_club_classifier.fit(
    lc_X_train,
    lc_y_train,
)

In [None]:
lending_club_classifier.score(lc_X_validate, lc_y_validate)

Examine some of the other evaluation metrics. How to do they compare to the accuracy score?

In [None]:
# Examine evaluation metrics

What is a better metric for this model---precision or recall? Why?

Is the model focusing on the best metric?

In [None]:
# Display the precision and recall metrics

Review the ROC curve.

In [None]:
# Display the ROC curve

Review the learning curve for this model.

In [None]:
# Display the learning curve

Is the model over or underfitting or overfitting?

Review the XGBoost hyperparameters.

In [None]:
# Display the estimator's parameters

Train a new model, with a new hyperparameter, to improve the accurary.

In [None]:
# Train a new model, updating a hyperparameter

Did it improve the accurary score?

In [None]:
# Display the accuracy score

Perform an exhaustive grid search and obtain the optimal parameters.

In [None]:
# Perform an exhaustive grid search

Did your search improve the accuracy score?

Create a new model with these optimal parameters.

In [None]:
# Create a new model with the optimal parameters 

Evaluate this model using the _test_ data.

In [None]:
# Fit and evaluate the new model

Did it perform well on the test data?