<img src="../images/cover.jpg" width="1920"/>

# Supervised Learning for Classification Problems

## Introduction to Classification

Classification is a supervised learning task where the model learns to predict discrete class labels from labeled training data. Unlike regression which predicts continuous values, classification predicts categorical outcomes.

Consider an email spam classifier. Each email (input) has specific features, such as:

- Number of recipients
- Presence of specific keywords
- Time sent
- Links within the email
- Sender’s domain

The model's task is to classify each email into one of two categories (binary classification):

- Spam (1)
- Not Spam (0)

## 1. Logistic Regression

### Introduction
Logistic Regression is a classification algorithm used to predict the probability of a binary outcome, where the target variable can be one of two classes, often labeled as 0 or 1. Despite its name, logistic regression is actually a linear model for classification rather than regression. It works by estimating the probability that an input (or instance) belongs to a specific class.

<img src="../images/linear_vs_logistic.png" width="1920"/>

### How Logistic Regression Works

1. **Linear Combination of Inputs**: 
   Logistic regression starts by calculating a weighted sum of the input features. Given a set of input features $X = (x_1, x_2, \ldots, x_n)$, it computes a linear combination as follows:
   
   $$z = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$$
   
   where $w_0$ is the intercept (bias term), and $( w_1, w_2, \ldots, w_n )$ are the weights or coefficients associated with each feature.

2. **Sigmoid (Logistic) Function**: 
   The output of the linear combination, $z$, is then passed through the sigmoid function to convert it into a probability value between 0 and 1. The sigmoid function is defined as:
   $$
   \sigma(z) = \frac{1}{1 + e^{-z}}
   $$
   where $e$ is the base of the natural logarithm. The sigmoid function "squashes" the output of $z$ to a range between 0 and 1, representing the probability that the input belongs to the positive class (often labeled as 1).

3. **Probability Interpretation and Thresholding**:
   The output $\sigma(z)$ can be interpreted as the probability of the instance belonging to the positive class. By default, if this probability is greater than or equal to 0.5, the model assigns the class label 1; otherwise, it assigns 0. However, this threshold can be adjusted based on the problem requirements.

4. **Training the Model**:
   During training, logistic regression adjusts the weights $w_0, w_1, \ldots, w_n$ to minimize the difference between the predicted probabilities and the actual class labels. This is often done using maximum likelihood estimation or, more commonly, by minimizing the **binary cross-entropy loss** (log loss) through gradient descent.

### Mathematical Equation for Logistic Regression

The equation for predicting the probability of the positive class $P(y=1|X)$ given input $X$ is:
$$
P(y=1|X) = \frac{1}{1 + e^{-(w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n)}}
$$

In summary, logistic regression is a linear model with a sigmoid transformation applied to make predictions between 0 and 1, representing the probability of belonging to the positive class. This makes it especially useful for binary classification problems where a clear threshold decision is required.

### When to Use
- Binary classification problems
- When you need probabilistic outcomes
- When you want interpretable results
- When relationships between features and outcomes are roughly linear
- Base model for comparing more complex algorithms

### Implementation

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

In [None]:
# Sample email spam dataset
data = {
    "recipients": [1, 45, 3, 2, 100, 1, 2, 15, 25, 1],
    "contains_urgent": [0, 1, 0, 0, 1, 0, 0, 1, 1, 0],
    "links_count": [1, 15, 2, 1, 20, 0, 2, 10, 12, 1],
    "is_spam": [0, 1, 0, 0, 1, 0, 0, 1, 1, 0],
}

# Create DataFrame
df = pd.DataFrame(data)

df.head()

In [None]:
# Separate features and target
X = df[["recipients", "contains_urgent", "links_count"]]
y = df["is_spam"]

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
for i, j in zip(X_train_scaled, y_train):
    print(f"{i}\t->\t{j}")
    print("_" * 50)

In [None]:
# Create and train the model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

In [None]:
# Make predictions
y_pred = log_reg.predict(X_test_scaled)

# Compare predictions and actual values
for i, j, k in zip(X_test_scaled, y_pred, y_test):
    print(f"Input: {i}\nPredicted: {j}\nActual: {k}")
    print("_" * 50)

In [None]:
import numpy as np

In [None]:
z = 100
1 / (1 + np.exp(-z))

## 2. Decision Trees

### Introduction
Decision Trees are a type of supervised learning algorithm used for both classification and regression tasks. They work by splitting data into subsets based on feature values, creating a tree-like model of decisions. Each internal node of the tree represents a decision based on a specific feature, each branch represents the outcome of that decision, and each leaf node represents a final prediction or outcome. Decision trees are popular because they are easy to understand, interpret, and visualize, and they work well with both numerical and categorical data.

<img src="../images/decision_tree_classifier_fruit.jpg" width="500"/>

### How Decision Trees Work

1. **Choosing a Feature to Split**:
   The decision tree algorithm starts at the root node and selects the feature that best separates the data based on a criterion (often **Gini impurity** or **information gain**). It splits the data into subsets such that each subset contains similar instances in terms of the target variable.

2. **Splitting the Data**:
   The algorithm evaluates possible splits at each node and selects the one that maximizes the separation between classes (for classification) or minimizes prediction error (for regression). This splitting process is recursive, creating branches in the tree.

3. **Stopping Criteria**:
   The tree continues splitting the data until one of several stopping criteria is met:
   - All data in a node belong to the same class (for classification).
   - A maximum depth is reached.
   - Further splits do not improve the model significantly.

4. **Making Predictions**:
   Once the tree is built, predictions are made by traversing the tree from the root node to a leaf node based on the feature values of the input. The label or value in the leaf node is the final prediction for that input.

### Mathematical Criteria for Splitting

To determine the best splits, decision trees use criteria like **Gini impurity** and **information gain**.

#### Gini Impurity (for Classification)

Gini impurity is a measure of how often a randomly chosen element would be incorrectly classified if it was randomly labeled according to the distribution of labels in the subset. For a binary classification, the Gini impurity for a node $( t )$ with two classes (0 and 1) is calculated as:
$$
\text{Gini}(t) = 1 - \sum_{i=1}^C p_i^2
$$
where $( p_i )$ is the probability of a randomly selected element being classified as class $( i )$, and $( C )$ is the number of classes. A lower Gini impurity indicates a "purer" node.

#### Information Gain (using Entropy)

Another criterion is information gain, which measures the reduction in entropy from a split. Entropy, a measure of disorder, is calculated for a node \( t \) as:
$
\text{Entropy}(t) = - \sum_{i=1}^C p_i \log_2(p_i)
$
where $( p_i )$ is the probability of class $( i )$ in node $( t )$. Information gain for a split $( S )$ is then calculated as:
$$
\text{Information Gain}(S) = \text{Entropy}(t_{\text{parent}}) - \sum_{k=1}^K \frac{|t_k|}{|t_{\text{parent}}|} \text{Entropy}(t_k)
$$
where $( t_{\text{parent}} )$ is the original node before the split, $( t_k )$ are the resulting child nodes from the split, and $( |t_k| )$ is the size of each child node.

The split with the highest information gain (or lowest Gini impurity) is chosen at each step to build the tree.

In Summary, Decision Trees build a model by recursively splitting the dataset based on features that best separate the data, according to criteria like Gini impurity or information gain. The final tree structure can be used for making predictions by following decision paths from the root to the leaf nodes. This structure makes decision trees intuitive and interpretable, though they can sometimes be prone to overfitting, especially if they are allowed to grow too deep.

### When to Use
- When you need interpretable results
- When you have mixed data types (numerical and categorical)
- When you don't need to scale features
- When relationships between features and target are non-linear
- When you want to capture feature interactions

### Implementation

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# Toy dataset color, diameter, label of fruits
training_data = [
    ["Green", 3, "Apple"],
    ["Yellow", 3, "Apple"],
    ["Red", 1, "Grape"],
    ["Red", 1, "Grape"],
    ["Yellow", 3, "Lemon"],
    ["Red", 3, "Apple"],
    ["Green", 3, "Pear"],
    ["Yellow", 2, "Pear"],
    ["Purple", 1, "Grape"],
    ["Green", 1, "Grape"],
    ["Yellow", 3, "Lemon"],
    ["Green", 2, "Lime"],
    ["Yellow", 2, "Lemon"],
    ["Red", 2, "Plum"],
    ["Purple", 2, "Plum"],
]

# Create DataFrame
df = pd.DataFrame(training_data, columns=["color", "diameter", "label"])

df.head()

In [None]:
# Initialize label encoders for features
color_encoder = LabelEncoder()
label_encoder = LabelEncoder()

In [None]:
# Encode features and labels from strings to integers to be used in the model
df["color"] = color_encoder.fit_transform(df["color"])
df["label"] = label_encoder.fit_transform(df["label"])

df.head()

In [None]:
# Separate features and target
X = df[["color", "diameter"]]
y = df["label"]

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [None]:
# Create and train the model
dt_clf = DecisionTreeClassifier(random_state=42, max_depth=3)
dt_clf.fit(X_train, y_train)

In [None]:
# Make predictions
y_pred_dt = dt_clf.predict(X_test)

In [None]:
# Feature importance
importance = pd.DataFrame(
    {"feature": X.columns, "importance": dt_clf.feature_importances_}
)

importance

## 3. Random Forests

### Introduction
Random Forests is an ensemble learning method used for classification and regression tasks. It operates by building a collection (or "forest") of decision trees and combining their predictions to produce a more robust, accurate model. Random forests improve on individual decision trees by reducing variance, making the model less prone to overfitting and more capable of handling complex datasets.

<img src="images/random_forest_classifier_fruit.jpg" alt="ML Workflow Diagram" width="1920"/>

### How Random Forests Work

1. **Creating Multiple Decision Trees**:
   A random forest creates multiple decision trees using different subsets of the training data. For each tree, a random sample of the dataset is selected with replacement, a process called **bootstrap sampling**. This means some instances may appear more than once in a tree's training set, while others may be left out.

2. **Feature Randomness**:
   In addition to sampling data, random forests also introduce randomness in feature selection. When splitting nodes, each tree considers only a random subset of features rather than all features. This "feature randomness" helps to make the individual trees less correlated with each other, reducing overfitting.

3. **Building the Trees**:
   Each decision tree is built independently using the selected subset of data and features. The trees are grown to their maximum depth (or another stopping criterion) without pruning, which allows each tree to capture patterns in the data.

4. **Combining the Predictions**:
   Once all trees are built, the random forest combines their predictions. For **classification tasks**, it uses **majority voting**: each tree votes for a class, and the class with the most votes is the final prediction. For **regression tasks**, it averages the predictions of all trees.

5. **Out-of-Bag (OOB) Error**:
   An additional advantage of random forests is that it can estimate the error without needing a separate validation set. Since each tree is trained on a bootstrap sample, about one-third of the instances are left out (called "out-of-bag" data). These OOB instances are used to evaluate the model's accuracy, providing an unbiased error estimate.

### Mathematical Formulation of Random Forests

1. **Bootstrap Sampling**:
   Let the dataset have $( N )$ instances. For each tree $( T_i )$, draw a random sample of size $( N )$ with replacement. This bootstrap sample is used to train tree $( T_i )$.

2. **Random Feature Selection**:
   At each node in the tree, a subset of $( m )$ features is randomly selected from the total $( p )$ features (where typically $( m \approx \sqrt{p} )$ for classification and $( m \approx \frac{p}{3} )$ for regression). The feature that best splits the data is chosen from this subset.

3. **Combining Predictions**:
   - **For Classification**: If there are $( B )$ trees, each tree $( T_i )$ provides a class prediction $( y_i )$. The random forest's final prediction $( \hat{y} )$ is determined by majority voting:
     $$
     \hat{y} = \text{mode}(y_1, y_2, \ldots, y_B)
     $$
   - **For Regression**: Each tree $( T_i $) produces a predicted value $( \hat{y}_i )$. The final prediction $( \hat{y} $) of the random forest is the average of all tree predictions:
     $$
     \hat{y} = \frac{1}{B} \sum_{i=1}^B \hat{y}_i
     $$

4. **Out-of-Bag Error (OOB Error)**:
   For each instance in the dataset, predictions are made only by the trees that did not use it in their bootstrap sample. The OOB error is calculated as the average error on these OOB predictions, which serves as an unbiased estimate of the model's performance.

### Summary

Random Forests combine the predictions of multiple decision trees trained on different subsets of data and features. By introducing randomness in data sampling and feature selection, random forests produce a more generalized and robust model than individual decision trees, reducing the risk of overfitting. This ensemble method works well with large datasets and is highly effective for both classification and regression tasks.

### Implementation

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Create and train the model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

In [None]:
import numpy as np

np.random.seed(42)
np.random.randint(0, 100)

In [None]:
# Make predictions
y_pred = rf_clf.predict(X_test)

# Compare predictions and actual values
for i, j, k in zip(X_test.values, y_pred, y_test):
    print(
        f"Input: Color: {i[0]} ({color_encoder.inverse_transform([i[0]])[0]}), Diameter: {i[1]}"
    )
    print(f"Predicted: {j} ({label_encoder.inverse_transform([j])[0]})")
    print(f"Actual: {k} ({label_encoder.inverse_transform([k])[0]})")

    print("_" * 50)

## Model Comparison

Each model has its strengths:

1. **Logistic Regression**:
   - Simple and interpretable
   - Provides probability scores
   - Works well with linear relationships
   - Fast to train and predict
   - Requires feature scaling

2. **Decision Trees**:
   - Highly interpretable
   - Handles non-linear relationships
   - No scaling required
   - Can capture feature interactions
   - Prone to overfitting

3. **Random Forests**:
   - Generally better accuracy than single trees
   - Less prone to overfitting
   - Provides reliable feature importance
   - Handles non-linear relationships
   - Less interpretable than single trees

# Practical: Real world Example

# Heart Disease Dataset

**About the Dataset**

**Context**  
This [dataset](https://www.kaggle.com/code/megoooo/heart-disease-logistecregression) was created in 1988 and combines information from four sources: Cleveland, Hungary, Switzerland, and Long Beach V. It includes 76 attributes, but studies typically use only 14 of them. The "target" field shows if the patient has heart disease, with 0 meaning no disease and 1 meaning disease.

**Content**  
Here’s a list of the 14 key attributes:

- Age
- Sex
- Chest pain type (4 types)
- Resting blood pressure
- Serum cholesterol level (mg/dl)
- Fasting blood sugar > 120 mg/dl
- Resting electrocardiographic results (0, 1, or 2)
- Maximum heart rate achieved
- Exercise-induced angina
- Oldpeak (ST depression caused by exercise vs. rest)
- Slope of the peak exercise ST segment
- Number of major vessels (0-3) colored by fluoroscopy
- Thalassemia type (0 = normal, 1 = fixed defect, 2 = reversible defect)

Patient names and social security numbers have been replaced with anonymous values.

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    classification_report,
)
import matplotlib.pyplot as plt
import seaborn as sns

### Load the Dataset

In [None]:
data = pd.read_csv("data/heart.csv")

data.head(5)

### Data Exploration

In [None]:
# Summary statistics
data.describe()

In [None]:
# Dataset information
data.info()

In [None]:
# Check for missing values in each column
data.isnull().sum()

In [None]:
# Plot correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Matrix")
plt.show()

In [None]:
# Plot Target variable distribution
sns.countplot(x="target", data=data)
plt.title("Distribution of Target (0 = No Heart Disease, 1 = Heart Disease)")
plt.show()

### Feature Selection

In [None]:
high_corr_features = ["cp", "thalach", "exang", "oldpeak", "slope", "ca", "thal"]

# Create a new DataFrame with only the selected features
X_selected = data[high_corr_features]
y = data["target"]

In [None]:
# X_selected = data
# X_selected = X_selected.drop("target", axis=1)
# y = data["target"]

In [None]:
X_selected

### Feature Scaling

In [None]:
# Initialize the scaler
scaler = StandardScaler()

In [None]:
# Scale the features
X_scaled = scaler.fit_transform(X_selected)

In [None]:
pd.DataFrame(X_scaled, columns=X_selected.columns).head()

### Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=0
)

In [None]:
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print("")
print(f"X_test shape: {X_test.shape}")
print(f"y_test shape: {y_test.shape}")

### Model Training (Logistic Regression)

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

### Model Prediction & Evaluation

In [None]:
# Predict on the test set
y_pred = logreg.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

In [None]:
# Classification Report
class_report = classification_report(y_test, y_pred)
print(class_report)

In [None]:
plt.figure(figsize=(8, 6))
sns.heatmap(
    confusion_matrix(y_test, y_pred),
    annot=True,
    fmt="d",
    cmap="Blues",
    cbar=False,
    xticklabels=["Predicted Negative", "Predicted Positive"],
    yticklabels=["Actual Negative", "Actual Positive"],
)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Heatmap")
plt.show()

This confusion matrix visualizes the performance of the binary classification (Logistic Regression) model. The matrix is divided into four cells that represent different prediction outcomes:

- **True Negatives (Top-left)**: 74 instances were correctly identified as negative (i.e., the model predicted "negative," and the actual label was also "negative").
- **False Positives (Top-right)**: 28 instances were incorrectly predicted as positive when they were actually negative.
- **False Negatives (Bottom-left)**: 13 instances were incorrectly predicted as negative when they were actually positive.
- **True Positives (Bottom-right)**: 90 instances were correctly identified as positive.

This matrix shows that the model correctly identified 74 negatives and 90 positives, while it misclassified 28 negatives as positives and 13 positives as negatives. 

This information can be used to calculate key metrics like accuracy, precision, recall, and F1-score, helping to assess the model's effectiveness in identifying positive and negative cases.