<a href="https://colab.research.google.com/github/dxcim/Business-Analytics-Foundations/blob/main/m4_4_partitioning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Module 4 | Pattern discovery**

`m4_4_partitioning.ipynb` | 2025-03-25 11:36

# Evaluation fundamentals with Python

PyCaret handles a lot of the complexity of machine learning for you. It can automatically partition your data into training and test sets, and use the correct partition for each step of the modelling process.

Let's do another classification task, predicting customer churn. We'll use the same dataset as before, but this time we'll use PyCaret to compare multiple models, tune hyperparameters, and evaluate the final model on the test set.

## How to use this notebook


To run this notebook in Colab, choose **Runtime** from the top menu and then **Run all**. This will set up the notebook and then run all the cells.

The first run may take a few minutes to install the required libraries and download the data. Subsequent runs will be faster.

You can also run the cells one by one using the play button next to each cell.

---

This section of the notebook contains code to set up the notebook environment. It installs the required libraries, downloads the data, and sets the display style for charts.

After this section of the notebook runs successfully, you can hide the cells in this section. To do this in Colab, choose **View** from the top menu, then **Collapse sections**, or click the downward chevron ⌄ next to the section title.

You do not need to understand the code in the "How to use this notebook" section to follow the rest of the notebook.


In [1]:
# install PyCaret
# if prompted by Colab, restart the runtime after installing: Runtime -> Restart session

%pip install --upgrade --quiet pycaret

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m962.5 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.7/169.7 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m486.1/486.1 kB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.8/106.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.8/21.8 MB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m17.8 MB/s[0m

In [2]:
# download data

from urllib.request import urlretrieve
from pathlib import Path

if not Path("/content/customer_churn_numeric.csv").exists():
    urlretrieve("https://canvas.uts.edu.au/files/8948624/download?download_frd=1&verifier=9uzGTenHhCcY6WWUfonlnFDwCpeqpmAigJzKsp4U", "customer_churn_numeric.csv")

In [5]:
!pip install --upgrade --no-cache-dir numpy pandas seaborn

Collecting numpy
  Downloading numpy-2.2.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting pandas
  Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Downloading pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m221.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.1.4
    Uninstalling pandas-2.1.4:
      Successfully uninstalled pandas-2.1.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installe

In [8]:
# format figures for display in Canvas

import seaborn as sns

sns.set_theme(style="white", rc={"figure.figsize": (12, 6)})
sns.set_context("notebook", font_scale=1.25, rc={"lines.linewidth": 2.5})

# in Colab, if this cell fails with
# "ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject"
# first try restarting the runtime.
# Runtime menu -> Restart runtime, or press Ctrl-M then . (period) and confirm. Then re-run this cell.
# For Mac-using friends, the keyboard shortcut is ⌘-M then . (period).

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

## Load the data and set up the PyCaret session

In [9]:
import pandas as pd

# load the data
df = pd.read_csv('/content/customer_churn_numeric.csv')

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

For this example there is no separate file containing holdout data.

In [None]:
# manually keep 20% holdout data

holdout = df.sample(frac=0.2, random_state=22804)
df.drop(holdout.index, inplace=True)

This time we will specify how much of the data to use for training, and how to partition the data for cross-validation. We'll use 60% of the data for training, and 5-fold cross-validation.

In [None]:
from pycaret.classification import *

# set up a PyCaret classification session to predict customer churn
# specify 60% training data and 5-fold cross-validation
setup(data=df,
      target='Churn',    # target variable
      train_size=0.6,    # 60% training
      fold=5,            # 5-fold cross-validation
      session_id=22804)  # For reproducibility

## Build and evaluate the model

To continue exploring, we'll choose a specific model. We'll use a Random Forest model this time.

The model performance table has a row for each fold of cross-validation, and the final row is the average performance across all folds. Area under curve (AUC) is a common metric for classification tasks, and it ranges from 0 to 1, with higher values indicating better performance.

In [None]:
# Create specific model

model = create_model('rf')  # Random Forest

_Area under curve_ hints that there's an actual curve to look at. Let's plot it.

The ROC curve is a plot of the true positive rate (sensitivity) against the false positive rate (1 - specificity) for different threshold values. The AUC is the area under the ROC curve, and it's a measure of how well the model can distinguish between classes.

ROC is short for "receiver operating characteristic". Knowing this will not help you understand ROC curves, but it might come up in pub trivia someday.

In [None]:
# show ROC curve

plot_model(model, plot='auc')

## Using cross-validation to improve the model

The model performance information shows that the Random Forest model has an AUC of 0.83. This is a good starting point, but we can try to improve it by tuning hyperparameters.

In [None]:
# Tune hyperparameters using 5 fold cross-validation
# This uses the 60% training data sample made by PyCaret

tuned_model = tune_model(model)
# technical note: `tune_model` uses X_train for tuning, I checked the source code

In this case, tuning the model improved the AUC to 0.84. This is a small improvement, but it's better than nothing.

## Evaluating the model on test data

It's time to evaluate the final model on the test set. PyCaret will use the best Random Forest model from the tuning step.

In [None]:
# Final evaluation on test set

test_predictions = predict_model(tuned_model)

The final model has the same performance on the test set. This is a good result, and it shows that the model is generalising well to new data.

## Evaluating the model on holdout data

PyCaret has evaluated the model using the test set that it made. Now it's our turn to evaluate PyCaret using the holdout data that we made!

The model has never seen this data before, so it's a good test of how well it generalises to new data.

In [None]:
# evaluate the model on holdout data

_ = predict_model(tuned_model, data=holdout)
# sidenote: assigning to _ says "I don't care about this value"
# this is a common convention in Python
# we use it here so the predictions dataframe isn't printed to the output

## Exercises

1. How does the model performance on holdout data compare to the test set? What does this tell you about the model?
2. Try a different percentage split for training and test data. Does it make any difference?
3. Build a model using a different algorithm. How does it compare to the Random Forest model?