# Module 1: Supervised learning algorithms

In this module, we cover

- Overview of classification algorithms (e.g. logistic regression, decision trees, ensemble learning algorithms)
- Demonstration of how decision tree classifiers work
- Hands-on example of using XGBoost to classify observations 

The [notebooks](https://github.com/decisionmechanics/lt541v) for the course are available on GitHub. Clone or download them to follow along.

In this notebook, we make use of the following third-party packages.

```bash
pip install jupyterlab matplotlib numpy plotly 'polars[all]' scikit-learn xgboost
```

## Overview of classification algorithms

Classification algorithms attempt to assign observations to a number of classes. Two-class (binary) classifiers are a special case.

For example, a facial recognition classifier may be used to classify people as employees or members of the public. A spam filter classifies e-mail messages as legitimate or spam.

Choosing a suitable classification algorithm is as much art as it is science. There are attempts to formalise the process (e.g. Microsoft's [Machine Learning Algorithm Cheat Sheet](https://docs.microsoft.com/en-us/azure/machine-learning/media/algorithm-cheat-sheet/machine-learning-algorithm-cheat-sheet.png)), but the correct choice depends on things like

- the task
- the data
- the skill of the team
- the budget
- the timeframe
                                                                                                                           
It's usually worth trying a range of approaches, if time and budget allow it.

Prepare the (scaled) training and test datasets. We'll use this data to explore the different algorithms.

The data will be partitioned into training and test datasets. The latter is used to evaluate the model.

In [None]:
import polars as pl
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

loan_approval_df = pl.read_csv("data/loan-approval-dataset.csv")

loan_approval_with_dummies_df = loan_approval_df.to_dummies(
    ["education", "self_employed"], drop_first=True
)

loan_approval_features_df = loan_approval_with_dummies_df.drop("loan_id", "loan_status")

loan_approval_target_df = loan_approval_df.select("loan_status")

SEED = 123

(
    loan_approval_feature_train_df,
    loan_approval_feature_test_df,
    loan_approval_target_train_df,
    loan_approval_target_test_df,
) = train_test_split(
    loan_approval_features_df,
    loan_approval_target_df,
    test_size=0.30,
    random_state=SEED,
)

scaler = StandardScaler()

loan_approval_feature_train_scaled_df = scaler.fit_transform(
    loan_approval_feature_train_df
)

loan_approval_feature_test_scaled_df = scaler.fit_transform(
    loan_approval_feature_test_df
)

Do we have balanced target classes?

In [None]:
loan_approval_df.get_column("loan_status").value_counts(normalize=True)

## Baseline model

The distribution of the target classes gives us a baseline for the performance of our model. If we assume that loans are always approved, we have an accuracy of over 62%.

We may already have experts in place analysing loans. How accurate are they? More than 62%? If so, that might be our baseline.

What about if we use credit score as an indicator?

In [None]:
(
    loan_approval_df
    .with_columns(
        (pl.col("cibil_score") >= 550).alias("good_credit_score")
    )
    .select(["good_credit_score", "loan_status"])
    .group_by(["good_credit_score", "loan_status"])
    .agg(
        pl.len().alias("count")
    )
)

In [None]:
(1600 + 2471) / len(loan_approval_df)

That gives us an accuracy of 95% . 

### Logistic regression

Logistic regression is a statistical method used for binary classification tasks, where the goal is to predict a binary outcome (often coded as 0 or 1). Despite its name, logistic regression is used for classification problems, not regression problems.

It begins by calculating a weighted sum of the (input) features. This is similar to how linear regression works.

$$
    z = b + w_{1}x_{1} + w_{2}x_{2} + \cdots + w_{n}x_{n}
$$

$z$ is then passed through a sigmoid (logistic) function to map it to a number in the $[0, 1]$ range.

$$
    P(y=1|X) = \frac{1}{1 + e^{-z}}
$$

This represents the probability that the observation belongs to class 1 (e.g. "yes").

We can visualise this.

<img src="images/module1-logistic-regression.svg" alt="Logistic regression" width="200px" />
                                                                         
A threshold (often 0.5) is then chosen to classify the outcome.

- If $P(y=1|X) > 0.5$, the prediction is class 1
- If $P(y=1|X) <= 0.5$, the prediction is class 0

Classes are separated via a linear decision boundary---a line or a plane---that separates the two classes. This means that logistic regression is only suitable for simpler classification tasks.

Logistic regression can be extended to multiclass classification problems using techniques such as [softmax regression](http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/).

#### Logistic regression example

Fit the logistic regression model to the training data.

In [None]:
from sklearn.linear_model import LogisticRegression

logistic_regression_classifier = LogisticRegression(penalty=None)

logistic_regression_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Review the classes being predicted by the model.

In [None]:
logistic_regression_classifier.classes_

Which features have the most impact on each of the two classes (positive coefficients support "Rejected")?

In [None]:
import altair as alt

(
    pl.DataFrame(
        {
            "feature": loan_approval_feature_train_df.columns,
            "coefficient": logistic_regression_classifier.coef_[0],
        }
    )
    .sort("coefficient", descending=True)
    .plot.bar(
        x="coefficient",
        y=alt.Y("feature", sort="-x"),
    )
    .properties(
        width=500,
        height=300,
    )
)

What are the probabilities that each of the observations in the test dataset have been assigned to the classes?

In [None]:
logistic_regression_classifier.predict_proba(loan_approval_feature_test_scaled_df)

Show the predicted (most probable) class for each observation.

In [None]:
logistic_regression_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
from sklearn.metrics import classification_report

print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        logistic_regression_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is (a lowly) 91%.

#### Benefits of logistic regression

- **Simplicity**: easy to implement and interpret
- **Efficiency**: fast to train, even on large datasets
- **Probabilistic output**: provides class probabilities, not just binary predictions
- **Works with smaller datasets**: can be used when there’s a limited number of data points
- **No strict assumptions**: Does not require normally-distributed features or residuals

#### Weaknesses of logistic regression

- **Linear decision boundary**: Struggles with complex, non-linear relationships
- **Sensitive to outliers**: Outliers can distort the model
- **Requires feature scaling**: Performs better when features are scaled or normalised
- **Assumes independence**: Assumes that the features are independent of each other
- **Limited to binary classification**: Extensions are needed for multiclass problems
- **Not suitable for high-dimensional data**: Tends to overfit if there are too many features

### k-Nearest Neighbours

[k-Nearest Neighbours]() (k-NN) is another simple classification algorithm.

Unlike most supervised learning techniques, it doesn't have an explicit training phase.

When asked to classify a new observation, k-NN works as follows.

1. Compute the distance between the observation and all the observations in the training data. A number of distance metrics can be used, but Euclidean distance is the most common.

$$
d = \sqrt{(x_{1} - x_{2})^{2} + (y_{1} - y_{2})^{2} + (z_{1} - z_{2})^{2} + \cdots}
$$

2. Find the $k$ nearest neighbours---i.e. the $k$ points nearest to the observation. $k$ is an input to the algorithm and is critical to its performance.

3. Vote on the class. Each of the $k$ neighbours votes on the class and the new observation is classified by the majority vote.

#### k-NN example

Fit the k-NN model to the training data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

k = 5

knn_classifier = KNeighborsClassifier(n_neighbors=k)

knn_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Show the predicted (majority vote) class for each observation.

In [None]:
knn_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        knn_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 88%. Tweaking $k$ would probably result in improvements, but k-NN is a relatively unsophisticated technique.

#### Benefits of k-NN

- **Simplicity**: easy to understand and implement
- **No assumptions**: no assumptions about the underlying data distribution, which makes it suitable for non-linear and complex data distributions

#### Weakness of k-NN

- **Computationally expensive**: since it stores all the data and computes distances during prediction, it can be slow and memory-intensive for large datasets
- **Curse of dimensionality**: as the number of features increases, the distance between points becomes less meaningful, requiring dimensionality reduction techniques or modifications to the algorithm
- **Sensitive to imbalanced data**: if certain classes are overrepresented in the dataset, they might dominate the voting process, leading to biased predictions

### Support Vector Machine

[Support Vector Machine (SVM) classifiers](https://www.ibm.com/topics/support-vector-machine) attempt to find a linear decision boundary that separates the classes. They differ from logistic regression classifiers in how they optimise their objectives.

SVM classifiers maximise the margin---the distance between the decision boundary the the nearest data points (support vectors) for each class.

Unlike logistic regression classifiers, SVM classifiers can handle non-linearly separable data. They achieve this by transforming the original data into a higher dimensional space where it becomes linearly separable. This is known as the kernel trick.

#### SVM example

Fit the SVM model to the training data.

In [None]:
from sklearn import svm

svm_classifier = svm.SVC()

svm_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Show the predicted (majority vote) class for each observation.

In [None]:
svm_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        svm_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 93%. We can tune the following hyperparameters in an attempt to increase this.

- **Kernel**: linear, polynomial, radial basis function (RBF)
- **Regularisation**: the penalty parameter \(C\) controls error tolerance
- **Gamma**: controls the extent of overfitting (high values mean more overfitting to the training data)

Note: In ML, [regularisation](https://www.ibm.com/topics/regularization) is used to prevent overfitting by adding a penalty term to the model's loss function, which discourages complex models by constraining the magnitude of the model's parameters. This helps improve generalisation to unseen data (i.e. reduce overfitting).

#### Benefits of SVM classifiers

- **Effective in high-dimensional spaces**: works for cases where the number of dimensions is greater than the number of samples
- **Works well with non-linear boundaries**: kernel functions can transform the data to allow it to be separated

#### Weakness of SVM classifiers

- **Computationally expensive**: especially for non-linear kernels
- **Complicated**: choosing the right kernel and tuning parameters can be challenging

### Naive Bayes

[Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) is a classification algorithm based on [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes'_theorem).

Bayes' Theorem updates prior beliefs (probability distributions) based on new evidence (observations).

So, the probability that a vector of values $X$ belongs to a class $C_{k}$ is

$$
P(C_{k}|X) = \frac{P(X|C_{k})P(C_{k})}{P(X)}
$$

Or, to put it in words

$$
posterior = \frac{prior \times likelihood}{evidence}
$$

When comparing two classes, we have the same evidence, so the computation simplifies to

$$
P(X|C_{k})P(C_{k})
$$

The naive bit of a Naive Bayes' classifier reference to the assumption of independence between the features. This results in the following simplification.

$$
P(X|C_{k}) = P(x_{1}|C_{k}) P(x_{2}|C_{k}) \cdots P(x_{n}|C_{k})
$$

This results in a _significant_ computational saving when working with the high-dimensional text data that Naive Bayes is often used to classify.

Let's review how Naive Bayes might classify an e-mail as spam or ham.

1. The training data is used to calculate $P(spam)$ and $P(ham)$.
2. For each feature (e.g. significant word in corpus of e-mails), calculate $P(w_{i}|C_{k})$ for each class $k$
3. Calculate $P(X|C_{k}) = \prod_{i=1}^{n}{P(w_{i}|C_{k})}$
4. Apply Bayes' Theorem by combining the prior and likelihood to calculate the posterior probabilities $P(C_{k}|X)$
5. Choose the class with the highest posterior probability

#### Naive Bayes example

The Gaussian Naive Bayes algorithm assumes the likelihood of the features to be normally distributed.

Fit the Naive Bayes model to the training data.

In [None]:
from sklearn.naive_bayes import GaussianNB

naive_bayes_classifier = GaussianNB()

naive_bayes_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Show the predicted (majority vote) class for each observation.

In [None]:
naive_bayes_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        naive_bayes_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 92%.

#### Benefits of Naive Bayes classifiers

- **Simple and fast**: easy to implement and computationally efficient, even with large datasets
- **Works well with small datasets**: can perform well with relatively small amounts of training data
- **Handles categorical data well**: especially effective for text classification and natural language processing tasks
- **Less sensitive to irrelevant features**: even with non-relevant features, the algorithm can still perform well
- **Robust to noise**: handles noisy data reasonably well in many practical applications

#### Weaknesses of Naive Bayes classifiers

- **Strong independence assumption**: assumes all features are independent, which is often not true in real-world data
- **Poor performance with correlated features**: fails when features are highly dependent or correlated
- **Zero probability issue**: if a category in a feature was never observed in the training set, it assigns zero probability to that class (can be handled with techniques like smoothing)
- **Limited to simple decision boundaries**: tends to work less well with complex datasets and decision boundaries compared to other algorithms like decision trees or SVMs

### Decision trees

Decision tree classifiers work by splitting classification tasks up into two (in the case of binary classifiers) simpler subtasks. They use a recursive divide-and-conquer approach to continually subdivide tasks until they become trivial.

The output of training a decision tree classifier is a decision tree that is then used to classify observations.
    
The training steps are as follows.

1. A root node is created that represents the entire dataset.
2. The dataset is split on a feature value that provides the most information. Information can be measured in a number of ways, but [Gini impurity](https://victorzhou.com/blog/gini-impurity/) is a popular metric. Optimising based on information gain identifies the best place to split the data into two subgroups.
3. The algorithm the recursively partitions each of the subgroups---looking for the optimal splits in each.
4. The recursive partitioning stops when one of the following is true.
    - All the observations in the partition are of the same class.
    - The branch of the tree has reached a pre-defined depth limit (usually defined as a hyperparameter).
    - There are less than a predefined number of points in the partition.
    - There is no partitioning strategy that will improve the information gain.
5. When a partition cannot be split further, the leaf node of the branch is assigned the majority label of all the points in the partition.
6. To make a prediction, observations are subjected to the decisions (i.e. splits) defined by the the generated tree.

#### Decision tree example

Fit the decision tree model to the training data.

In [None]:
from sklearn.tree import DecisionTreeClassifier

decision_tree_classifier = DecisionTreeClassifier(max_depth=3, random_state=SEED)

decision_tree_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Review the generated decision tree.

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

loan_approval_class_names = sorted(loan_approval_df["loan_status"].unique())

plt.figure(figsize=(10, 10), dpi=300)
plot_tree(
    decision_tree_classifier,
    feature_names=loan_approval_features_df.columns,
    class_names=loan_approval_class_names,
    filled=True,
)
plt.show()

We can see which features are most important to the classification process as they are closer to the root of the tree (e.g. CIBIL Score).

This information is available from the fitted classifier.

In [None]:
pl.DataFrame(
    {
        "feature": loan_approval_feature_train_df.columns,
        "importance": decision_tree_classifier.feature_importances_,
    }
).sort("importance", descending=True)

Show the predicted class for each observation.

In [None]:
loan_approval_feature_test_df

In [None]:
decision_tree_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        decision_tree_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 95%.

#### Benefits of decision tree classifiers

- **Interpretability**: decision trees are easy to visualise and interpret
- **Non-parametric**: no assumptions about the underlying data distribution
- **Handling of non-linear relationships**: can model non-linear relationships between features and classes

#### Weaknesses of decision tree classifiers

- **Overfitting**: can easily become too complex and overfit the training data
- **Instability**: small changes in the data can result in entirely different trees
- **Bias**: can be biased towards features with more levels or categories

### Random forests

Random forest classifiers combine many decision trees to improve classification accuracy and reduce overfitting.

They are a type of ensemble model as they make use of multiple component models (the individual decision trees). Ensemble models have been shown to be effective in ML competitions.

Each tree in the forest makes a prediction and the overall prediction is determined by majority vote.                                                               

There might be thousands of trees in a random forest.
    
<img src="images/module1-random-forest.png" alt="Random forest" width="600px" />

The random forest produces the individual trees using a subset of the data.

- Each tree is trained on a (bootstrapped) subset of the observations
- At each split, a random subset of the features is considered

This results in the trees seeing the data from different "perspectives", thus producing more robust predications.

Having multiple trees making predictions allows errors in individual trees to be averaged out.

#### Random forest example

Fit the random forest model to the training data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

random_forest_classifier = RandomForestClassifier(max_depth=3, random_state=SEED)

random_forest_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

We can see which features are most important to the classification process.

In [None]:
pl.DataFrame(
    {
        "feature": loan_approval_feature_train_df.columns,
        "importance": random_forest_classifier.feature_importances_,
    }
).sort("importance", descending=True)

As with the decision tree, the CIBIL Score dominates.

Show the predicted (majority vote) class for each observation.

In [None]:
random_forest_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        random_forest_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 94%.

Interestingly, the accuracy of the decision tree model was 95%.

Remember that accuracy is a pretty rough metric. Also, 95% vs 94% may well be rounding errors. Also, the classification of this dataset appears to be significantly influenced by a single feature (i.e. CIBIL Score).

#### Benefits of random forest classifiers

- **Accuracy**: random forest classifiers tend to be more accurate because they reduce overfitting by averaging the predictions from many trees
- **Robustness**: less sensitive to noise in the training data compared to individual decision trees
- **Feature importance**: provide a measure of the importance of each feature in making predictions

#### Weaknesses of random forest classifiers

- **Interpretability**: the ensemble of trees can make it hard to explain the final decision
- **Training time**: training many trees can be computationally expensive and time-consuming

### Boosted trees

Boosted trees are another ensemble learning technique. They have been found to be very effective in ML competitions and represent the current state-of-the-art in tabular ML.

The basic idea behind boosted trees is to combine multiple weak learners (typically decision trees) to form a strong predictive model. They can be used for both classification and regression.

There are numerous boosted tree algorithms. Popular ones are

- AdaBoost
- Gradient Boosting
    - XGBoost
    - LightGBM
    - CatBoost

Weak learners are designed to perform slightly better than random guessing. They may be implemented with shallow trees.

Boosting involve sequentially training a series of weak learners.  Each learner attempts to correct the mistakes made by the previous learners. The overall idea is that, by combining multiple weak learners, you get much better predictions.

Having each individual tree be a _weak_ learner ensures that the insights produced by the model are distributed over a large number of tree, rather than been concentrated in a few.

Boosted trees are trained as follows.

1. The first decision tree is trained on the original dataset. Its predictions are used to generate the first set of residuals (errors), which are the differences between the predicted and actual target values.
2. In each subsequent round, a new decision tree is trained. However, this new tree is not trained on the original dataset but on the residuals (the errors) of the previous model. The new tree focuses on correcting the errors made by the previous model by giving more attention (i.e., higher weight) to the instances that the previous models got wrong.
3. The final prediction is obtained by combining the predictions from all the individual trees. The trees are added together, often weighted by their importance in reducing the error. 
    
A learning rate can be provided, which controls how much each tree contributes to the overall prediction. Small learning rates slow down the learning, but allow for more fine tuning.

AdaBoost adjusts the weights of misclassified instances in each round. Gradient Boosting focuses on reducing the residuals.

#### Boosted tree example

Fit the boosted tree model to the training data.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

boosted_tree_classifier = GradientBoostingClassifier(random_state=SEED)

boosted_tree_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

As with decision trees, we can see which features are most important to the classification process.

In [None]:
pl.DataFrame(
    {
        "feature": loan_approval_feature_train_df.columns,
        "importance": boosted_tree_classifier.feature_importances_,
    }
).sort("importance", descending=True)

As with the decision tree, the CIBIL Score dominates.

Show the predicted class for each observation.

In [None]:
boosted_tree_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        boosted_tree_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 96%.

#### Benefits of boosted tree classifiers

- **High predictive accuracy**: often outperform other models due to their ability to combine weak learners
- **Handles mixed data types**: process both numerical and categorical data without much preprocessing
- **Captures complex relationships**: excel in modelling non-linear interactions between features
- **Robust to overfitting**: with proper regularisation techniques, boosting algorithms effectively prevent overfitting
- **Versatile for tasks**: can be applied to both classification and regression problems

#### Weaknesses of boosted tree classifiers

- **Sensitive to hyperparameters**: performance heavily depends on fine-tuning parameters like learning rate and tree depth
- **Risk of overfitting**: without sufficient regularisation or when using too many trees, boosted models may overfit the training data
- **Less interpretable**: understanding how boosted trees make decisions is complex
- **Computationally expensive**: training, especially with many trees, requires significant time and resources
- **Longer training times**: training takes longer than simpler approaches

### Neural networks

Neural Networks (NNs) are computational models inspired by the structure and function of the human brain. They consist of layers of interconnected "neurons" or nodes, which process and transmit information. Here’s a breakdown of how NNs work:

1. **Neurons and layers**
    - Neurons: The basic units in a neural network, akin to brain neurons. Each neuron receives input, processes it, and sends an output.
    - Layers: ANNs are organised into layers:
        - Input layer: Receives the raw data (e.g., images, text).
        - Hidden layers: Intermediate layers where computations are performed. The more hidden layers, the deeper the network (hence the term "deep learning").
        - Output layer: Produces the final prediction or classification (e.g., cat vs. dog in an image classification task).
2. **Connections and weights**
    - Neurons are connected by "edges" or "weights," which determine the strength of connections between neurons. These weights are adjustable and are crucial in learning patterns from data.
        - Weights: When information passes through a connection, it is multiplied by a weight. Initially, weights are random, but they are fine-tuned during training.
        - Bias: An additional parameter added to each neuron’s output, helping the network make better predictions.
3. **Activation function**
    - Each neuron has an activation function that determines whether it should "fire" (send a signal). Common activation functions include:
        - Step: Uses a threshold to map the input to 0 or 1, often used for binary classification.
        - Sigmoid: Squashes the input to be between 0 and 1, often used for binary classification.
        - ReLU (Rectified Linear Unit): Outputs zero if the input is negative, otherwise passes the input as is.
        - Softmax: Converts outputs into probabilities for multi-class classification.
4. **Forward propagation**
    - Data flows forward through the network:

        - Inputs are passed to the neurons in the input layer.
        - Neurons in each layer process the inputs by multiplying them by weights, summing them, adding biases, and applying the activation function.
        - This process repeats across all layers until the output layer produces a result (e.g., a prediction).
5. **Loss function**
    - The output is compared to the actual result (e.g., the correct label in supervised learning). The loss function measures how far the prediction is from the truth.
        - Common loss functions include mean squared error (for regression tasks) and cross-entropy loss (for classification tasks).
6. **Backpropagation and learning**
    - After computing the loss, the network adjusts its weights to reduce errors. This is done through backpropagation, which involves:
        - Calculating the gradient (the direction and magnitude of change needed) of the loss function with respect to each weight.
        - Gradient descent is used to update the weights in small steps to minimise the loss.
        - The learning rate determines how large these steps are; too high and the network won’t converge, too low and learning is slow.
7. **Training process**
    - The network undergoes a training phase where:
        - Input data is passed through the network (forward propagation).
        - The loss is calculated.
        - Weights are adjusted via backpropagation. This process repeats over many iterations (called epochs), gradually improving the network's performance.
8. **Generalisation**
    - Once trained, the NN should be able to generalise, meaning it can correctly predict outputs for unseen data (e.g., recognising new images it wasn’t explicitly trained on).

<img src="images/module1-ann.png" alt="Artificial Neural Network" width="600px" />

There are multiple _types_ of NN. The type described above is an **artificial neural network (ANN)**. This term is often used synonymously with NN, but, strictly, it's a type of NN.

There are also:

- **Convolutional neural networks (CNNs)** which are particularly effective in image recognition tasks. The ubiquity of these kinds of tasks, coupled with the difficulty of solving them using traditional analysis approaches, makes CNNs one of the most popular types of NN.
- **GANs (generative adversarial networks)** which use two neural networks, pitching one against the other to improve their accuracy.
- **Recurrent neural networks (RNNs)** which are able to process past and input data—i.e. they have memory. ANNs operate only on the current input.
- **Transformers**, which are similar to RNNs are used with sequential data. However, transformers have great flexibility in how they traverse the sequence.

#### NN example

Fit the NN model to the training data.

In [None]:
from sklearn.neural_network import MLPClassifier

nn_classifier = MLPClassifier(
    hidden_layer_sizes=(11, 50, 2), max_iter=1_000, random_state=SEED
)

nn_classifier.fit(
    loan_approval_feature_train_scaled_df,
    loan_approval_target_train_df.get_column("loan_status"),
)

Show the predicted class for each observation.

In [None]:
nn_classifier.predict(loan_approval_feature_test_scaled_df)

How does the model perform on the test data?

In [None]:
print(
    classification_report(
        loan_approval_target_test_df.get_column("loan_status"),
        nn_classifier.predict(loan_approval_feature_test_scaled_df),
    )
)

The accuracy of this model is 96%.

#### Benefits of NN classifiers

- **Learn Complex Patterns**: can identify intricate, non-linear relationships in data, making them suitable for tasks like image recognition and speech processing
- **Adaptability**: can be trained for a wide range of tasks and data types, from structured to unstructured, across various industries like healthcare, finance, and AI
- **Feature learning**: automatically learn relevant features from raw data, reducing the need for manual feature engineering in fields like image and text processing
- **Scalable**: performance improves with larger datasets, making them highly effective for big data applications
- **Parallelisation**: training can be parallelised using GPUs, accelerating the process, especially for large-scale networks
- **Handle high-dimensional data**: effective with complex, high-dimensional data such as images, videos, or large datasets with many features
- **Good generalisation**: can generalise to unseen data and make accurate predictions in real-world applications after training

#### Weaknesses of NN classifiers

- **Data hungry**: large datasets are required for optimal performance, and results suffer if data is limited or imbalanced
- **Black boxes**: difficult to interpret, making it hard to understand or explain the decision-making process
- **Computationally intensive**: training is resource-heavy, requiring significant computational power and time
- **Prone to overfitting**: can memorise training data, performing poorly on unseen data.
- **Require hyperparameter tuning**: choosing the right architecture and tuning parameters like learning rate, layers, and neurons often requires expertise and trial-and-error
- **Sensitive to input quality**: rely on high-quality, well-prepared data; noisy or biased data can severely affect performance
- **Limited transfer learning**: often need retraining when applied to different tasks
- **Challenging debugging**: debugging and fine-tuning neural networks can be difficult and time-consuming
- **Risk of bias**: can inadvertently learn and propagate biases present in the training data

## Demonstration of how decision tree classifiers work

Decision trees use a divide-and-conquer strategy to recursively subdivide the dataset into simpler sub-problems.

The dataset is split where it will maximise the available information---e.g. minimise the Gini impurity.

For a dataset with $k$ classes, the Gini impurity is calculated by

$$
    G = 1 - \sum_{i=1}^{k}p_{i}^{2}
$$

$p_{i}$ is the probability of selecting an item of class $i$ (i.e. the proportion of that class in the dataset).

The penguin dataset has three classes (i.e. species)--Adelie, Chinstrap and Gentoo.

We'll simplify the example by talking the two most significant features---`flipper_length_mm` and `bill_length_mm`.

In [None]:
penguin_df = (
    pl.read_csv("data/penguins.csv")
    .select(["flipper_length_mm", "bill_length_mm", "species"])
    .drop_nulls()
)

Define a function to calculate Gini impurity given groups (partitions) and classes (species).

In [None]:
def calculate_gini_impurity(groups, classes):
    # Number of samples at split point
    total_samples = sum([len(group) for group in groups])

    # Initialize Gini score
    gini_impurity = 0.0

    # Calculate Gini index for each group (left and right)
    for group in groups:
        size = len(group)

        # Avoid division by zero
        if size == 0:
            continue

        score = 0.0
        
        # Calculate the score for each class
        for class_ in classes:
            proportion = [row[-1] for row in group].count(class_) / size
            score += proportion ** 2

        # Weight the group Gini score by its size
        gini_impurity += (1.0 - score) * (size / total_samples)

    return gini_impurity

Define a function to partition a dataset based on feature and split value.

In [None]:
def partition_data(df, feature_name, split_value):
    left_df = df.filter(pl.col(feature_name) <= split_value)
    right_df = df.filter(pl.col(feature_name) > split_value)

    return left_df, right_df

Visualise the Gini impurity at various splits along `flipper_length_mm`.

In [None]:
import altair as alt

classes = penguin_df.get_column("species").unique().to_list()

feature_values = sorted(penguin_df.get_column("flipper_length_mm").unique().to_list())

gini_impurities = {}

for split_value in feature_values:
    partitions = partition_data(penguin_df, "flipper_length_mm", split_value)

    gini_impurities[split_value] = calculate_gini_impurity(
        [partition.rows() for partition in partitions], classes
    )

(
    pl.DataFrame(
        {
            "flipper_length_mm": gini_impurities.keys(),
            "gini_impurity": gini_impurities.values(),
        }
    )
    .plot.line(
        x=alt.X("flipper_length_mm", title="Flipper length (mm)"),
        y=alt.Y("gini_impurity", title="Gini impurity"),
    )
    .properties(width=500)
)

We can see the Gini impurity is minimised at $\approx$ 206mm.

Define a function to calculate the optimal split (across all features).

In [None]:
def calculate_split(df, target_column_name):
    feature_column_names = df.drop(target_column_name).columns
    classes = penguin_df.get_column(target_column_name).unique().to_list()

    split = (None, None, 1)

    for feature_column_name in feature_column_names:
        feature_values = sorted(df.get_column(feature_column_name).unique().to_list())

        for split_value in feature_values:
            partitions = partition_data(df, feature_column_name, split_value)

            gini_impurity = calculate_gini_impurity(
                [partition.rows() for partition in partitions], classes
            )

            if gini_impurity < split[2]:
                split = feature_column_name, split_value, gini_impurity

    return split

Define a function to recursively split the dataset (i.e. build the decision tree).

In [None]:
def divide_and_conquer(df, target_column_name, maximum_depth, depth=0):
    split = calculate_split(df, "species")

    if split[2] < 1e-6 or depth == 4:
        print(f"{'  ' *  depth}Terminated")

        return

    print(f"{'  ' *  depth}{split[0]} <= {split[1]}")

    left_df = df.filter(pl.col(split[0]) <= split[1])
    right_df = df.filter(pl.col(split[0]) > split[1])

    divide_and_conquer(left_df, target_column_name, maximum_depth, depth + 1)
    divide_and_conquer(right_df, target_column_name, maximum_depth, depth + 1)

Create a decision tree with a maximum depth of 4.

In [None]:
divide_and_conquer(penguin_df, "species", maximum_depth=4)

Plot the penguins and the splits for the first (left-most) branch of the decision tree.

In [None]:
import plotly.express as px

fig = px.scatter(
    x=penguin_df.select("flipper_length_mm").to_series(),
    y=penguin_df.select("bill_length_mm").to_series(),
    color=penguin_df.select("species").to_series(),
    labels={
        "x": "flipper length (mm)",
        "y": "bill length (mm)",
        "color": "species",
    },
    color_discrete_sequence=px.colors.qualitative.Safe,
    width=1000,
    height=800,
    template="simple_white",
)

fig.update_traces(marker=dict(size=12), selector=dict(mode="markers"))

fig.add_vline(x=206, line_width=3, line_dash="solid")
fig.add_hline(y=43.2, line_width=3, line_dash="dash")
fig.add_hline(y=42.3, line_width=3, line_dash="dot")
fig.add_hline(y=40.8, line_width=3, line_dash="dashdot")

fig.add_shape(
    type="rect",
    x0=penguin_df.get_column("flipper_length_mm").min(),
    x1=206,
    y0=penguin_df.get_column("bill_length_mm").min(),
    y1=40.8,
    opacity=0.5,
    line_width=0,
)

fig.show()

## Hands-on example of using XGBoost to classify observations

The Kaggle Lending Club Loan Dataset contains a variety of fields related to loans, borrowers, and their financial behaviour.

The original dataset has $\approx$ 2.5m loans in it. We've extracted a random sample of 250,000 loans to make it more manageable.

Below is a breakdown of the key fields in this dataset, which are commonly used for analysis.

1. Loan and Credit Information
    - `loan_amnt`: The loan amount funded to the borrower.
    - `funded_amnt`: The total amount that was actually funded to the borrower.
    - `funded_amnt_inv`: The total amount that was funded by investors.
    - `term`: The term of the loan, typically 36 or 60 months.
    - `int_rate`: The interest rate on the loan.
    - `installment`: The monthly payment owed by the borrower.
    - `grade`: Lending Club assigned loan grade (A, B, C, etc.).
    - `sub_grade`: Lending Club assigned sub-grade for finer granularity within a grade.
    - `emp_length`: Length of employment in years, provided by the borrower.
    - `home_ownership`: The borrower’s home ownership status (e.g., RENT, OWN, MORTGAGE).
    - `annual_inc`: The borrower’s annual income.
    - `verification_status`: Whether the borrower's income was verified (e.g. Verified, Source Verified, Not Verified).
    - `fico_range_low`: Lower bound of the borrower’s FICO score range at the time of loan issuance.
    - `fico_range_high`: Upper bound of the borrower’s FICO score range at the time of loan issuance.

2. Loan Status and Performance
    - `loan_status`: The current status of the loan (e.g., Fully Paid, Charged Off, Late).
    - `pymnt_plan`: Indicates if there is a payment plan for the loan (Y/N).
    - `purpose`: The reason the borrower is taking out the loan (e.g., debt_consolidation, credit_card, home_improvement).
    - `title`: The loan title provided by the borrower.
    - `dti`: Debt-to-income ratio, which is the borrower’s monthly debt payments divided by their monthly income.
    - `delinq_2yrs`: The number of delinquencies in the past two years.
    - `earliest_cr_line`: The date of the borrower’s earliest reported credit line.
    - `inq_last_6mths`: The number of inquiries in the past 6 months.
    - `mths_since_last_delinq`: Months since the borrower’s last delinquency.
    - `open_acc`: The number of open credit lines in the borrower’s credit file.
    - `pub_rec`: The number of derogatory public records.
    - `revol_bal`: The borrower’s revolving balance (total credit card debt).
    - `revol_util`: Revolving line utilisation rate, or the amount of credit used relative to credit available.
    - `total_acc`: The total number of credit lines in the borrower’s credit file.

3. Payment and Financial Metrics
    - `out_prncp`: The remaining outstanding principal on the loan.
    - `out_prncp_inv`: The remaining outstanding principal for investors.
    - `total_pymnt`: The total amount paid by the borrower so far.
    - `total_pymnt_inv`: The total payment received by investors.
    - `total_rec_prncp`: The total principal received to date.
    - `total_rec_int`: The total interest received to date.
    - `total_rec_late_fee`: The total late fees received.
    - `recoveries`: The total amount recovered from the borrower after the loan was charged off.
    - `collection_recovery_fee`: Fees incurred while recovering charged-off loans.
    - `last_pymnt_d`: The date of the borrower’s last payment.
    - `last_pymnt_amnt`: The amount of the borrower’s last payment.

4. Borrower Demographics and Other Information
    - `addr_state`: The state where the borrower resides.
    - `zip_code`: The first 3 digits of the borrower’s postal code.
    - `policy_code`: Indicates the public policy governing the loan.
    - `application_type`: Whether the loan application is individual or joint.
    - `acc_now_delinq`: The number of accounts that are currently delinquent.
    - `tot_coll_amt`: The total collection amounts ever owed by the borrower.
    - `tot_cur_bal`: The borrower’s total current balance for all accounts.

We are going to build a model to predict whether a borrower will fully repay their loan. There's a `loan_status` column that we'll use to create a new target column called `fully_paid`.

Ideally, we'd consult some experts to determine the best features to use to predict load repayment. We'll cover feature engineering in a subsequent module. For now, we'll pick some plausible sounding variables.

- `loan_amnt`
- `term`
- `int_rate`
- `grade` (using dummy variables)
- `emp_length`
- `home_owner` (derived from `home_ownership`)
- `annual_inc`

In [None]:
lending_club_raw_df = (
    pl.read_parquet("data/lending-club-sample.parquet")
    .with_columns(
        pl.col("home_ownership").is_in(["OWN", "MORTGAGE"]).alias("home_owner"),
        (pl.col("loan_status") == "Fully Paid").alias("fully_paid"),
        ((pl.col("fico_range_low") + pl.col("fico_range_high")) / 2).alias(
            "fico_average"
        ),
    )
    .select(
        [
            "loan_amnt",
            "term",
            "int_rate",
            "grade",
            "emp_length",
            "home_owner",
            "annual_inc",
            "fico_average",
            "fully_paid",
        ]
    )
)

lending_club_df = lending_club_raw_df

lending_club_df

Check if there are any null values in the data.

In [None]:
lending_club_df.null_count()

Drop any observations with null `emp_length`.

In [None]:
lending_club_df = lending_club_raw_df.filter(pl.col("emp_length").is_not_null())

lending_club_df.null_count()

For the ML model to be effective, we need a reasonable number of examples of each class we are predicting.

In [None]:
lending_club_df.get_column("fully_paid").value_counts()

`grade` is a categorical field. We could encode it using dummy variables, but, as it's ordinal, let's map it to a linear value. This involves a number of implicit assumptions.

In [None]:
lending_club_df = lending_club_raw_df.filter(
    pl.col("emp_length").is_not_null()
).with_columns(
    pl.col("grade")
    .replace(
        {
            "A": 7,
            "B": 6,
            "C": 5,
            "D": 4,
            "E": 3,
            "F": 2,
            "G": 1,
        }
    )
    .str.to_integer()
)

lending_club_df

Split the data into training and test datasets.

In [None]:
(
    lending_club_feature_train_df,
    lending_club_feature_test_df,
    lending_club_target_train_df,
    lending_club_target_test_df,
) = train_test_split(
    lending_club_df.drop("fully_paid"),
    lending_club_df.select("fully_paid"),
    test_size=0.30,
    random_state=SEED,
)

lending_club_feature_train_scaled = scaler.fit_transform(lending_club_feature_train_df)

lending_club_feature_test_scaled = scaler.fit_transform(lending_club_feature_test_df)

Fit a "stump" classifier.

In [None]:
import xgboost as xgb

stump_classifier = xgb.XGBClassifier(max_depth=1, n_estimators=1, random_state=SEED)

stump_classifier.fit(
    lending_club_feature_train_scaled,
    lending_club_target_train_df.get_column("fully_paid"),
)

How accurate is the stump classifier?

In [None]:
stump_classifier.score(
    lending_club_feature_test_scaled,
    lending_club_target_test_df.get_column("fully_paid"),
)

So we'd hope to be able to beat this when we tune our model.

We can examine the tree. There's only one, as we set `n_estimators` to 1.

In [None]:
stump_classifier.get_booster().feature_names = lending_club_feature_train_df.columns

xgb.plot_tree(stump_classifier);

The leaf values are logit values. We can convert them to probabilities.

In [None]:
import math

def inverse_logit(p):
    return math.exp(p) / (1 + math.exp(p))

In [None]:
(inverse_logit(0.0732165724), inverse_logit(-0.169875145))

Lower terms suggest that the loan is more likely to be paid back.

Create and train an a more sophisticated XGBoost classifier.

In [None]:
lending_club_classifier = xgb.XGBClassifier(max_depth=4, random_state=SEED)

lending_club_classifier.fit(
    lending_club_feature_train_scaled,
    lending_club_target_train_df.get_column("fully_paid"),
)

Evaluate the classifier.

In [None]:
print(
    classification_report(
        lending_club_target_test_df.get_column("fully_paid"),
        lending_club_classifier.predict(lending_club_feature_test_scaled),
    )
)

60% accuracy. Still pretty poor. As we get into real-world data, the answers don't just fall into our lap. When we look at worked examples, they often end up with high levels of accuracy, making it look like ML is a panacea. Building good models is an iterative process, with many blind alleys.

Try using different features. Does changing the `max_depth` hyperparameter make any difference? What other changes might we consider to try and improve the quality of the model.