<a href="https://colab.research.google.com/github/chenethanxd/data/blob/main/classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification

Classification is a supervised machine learning method where the model predicts the correct label or category from a set of input data. The model is trained using labeled training data, and then it performs predictions on unseen data.

For example:

    * Customer Churn Prediction: Detecting whether a customer is likely to churn (leave the service) is a classification problem.
    * COVID-19 Detection: During the COVID-19 pandemic, classification models were used to predict whether a person had COVID-19 based on their symptoms and test results.

It's important not to confuse classification with regression, another type of supervised learning. While both involve predicting outcomes based on input data, they differ in the type of output they predict:

    * Classification: Predicts discrete labels or categories (e.g., spam vs. non-spam emails, whether a tumor is malignant or benign).
    * Regression: Predicts continuous values (e.g., predicting house prices, estimating a person's salary based on their experience).
    
Understanding the distinction between these methods is crucial for selecting the appropriate model and evaluation metrics for your specific problem.

* In Classification, the label is from a set of categorical label values that the model learns from training. These are called discrete labels. For example, predicting whether an email is spam or not spam.
* In Regression, the label is much wider and doesn't have any limits. It shows the relationship of the input features. These are called continuous labels. For example, predicting house prices based on various features like size, location, and number of bedrooms.

## Introduction

Classification is a fundamental task in machine learning where the goal is to assign labels to instances based on their features. This process involves training a model on a labeled dataset so that it can predict the correct label for new, unseen instances. Classification is widely used in various applications, including spam detection, medical diagnosis, and image recognition.

Several popular methods employed in classification:

* Logistic Regression: A statistical model used to predict a binary outcome based on one or more predictors (input values). It is widely used for linearly separable data.
* Support Vector Machine (SVM): This method finds a line or hyperplane that best separates the data into different classes and is usually used in high-dimensional spaces.
* Decision Tree: These models use a tree-like graph of decisions and their possible consequences. They are easy to interpret and can handle both categorical and numerical data.
* K-Nearest Neighbors (KNN): A non-parametric method where the class of a sample is determined by the majority class among its k-nearest neighbors in the feature space. It is simple and effective for small datasets.

## Understanding classification

### Definition

Classification is a supervised learning task in machine learning where the goal is to categorize instances into predefined classes based on their features. It is a crucial process because it enables the automation of decision-making and prediction processes across various fields. Classification helps in organizing and interpreting complex data, making it easier to derive actionable insights.

### Types of Classification

There are 3 main types of classification:

* Binary Classification: Involves two classes. Examples include spam detection (spam or not spam) and disease diagnosis (disease or no disease).
* Multiclass Classification: Involves more than two classes. Examples include digit recognition (0-9) and categorizing types of animals in images.
* Multilabel Classification: Each instance can belong to multiple classes simultaneously. Examples include tagging multiple objects in an image or classifying news articles into multiple topics.

## Packages

In this training material, we'll be using mostly `numpy` to perform operations. We'll be also covering some extra packages that simplify the workload. It's important that we understand how the operations perform. In a nutshell, here's what you need to install:

* numpy: A Python library used for working with arrays.
* pandas: A data manipulation and analysis library.
* scikit-learn: A library for machine learning that includes numerous classification algorithms.
* matplotlib and seaborn: Libraries for data visualization.
* math: A standard Python library for mathematical operations.

Some models will include a NumPy version to help us understand how they work. These models are often based on mathematical formulas. A basic understanding of linear Algebra, probability, and statistics is necessary to fully grasp the concepts. You are encouraged to try your best to complete every exercise.

## Models

### K-nearest Neighbors (KNN)

K-Nearest Neighbors (KNN) is a supervised learning algorithm that classifies new data points based on similarity to known data points. Unlike traditional models, KNN doesn't learn parameters during training; instead, it memorizes the entire training dataset. The algorithm calculates distances between the new data point and all existing ones, then assigns the new point to the class most common among its nearest neighbors (determined by a user-defined parameter, K). This simplicity makes KNN suitable for various data types and robust against outliers, though it can be computationally intensive with large datasets and sensitive to the choice of distance metric.

In practice, KNN is effective for small to moderate-sized datasets where the underlying data distribution is not explicitly known. Its ability to handle both numerical and categorical data without assumptions about their distributions makes it versatile. However, its lazy-learning approach, where predictions are made only when needed, can lead to slower prediction times compared to eager-learning algorithms. Understanding these trade-offs helps in choosing KNN appropriately for tasks like image recognition, recommendation systems, and medical diagnostics where similarity-based classification is beneficial.

#### Distance metrics

Distance measures are essential in understanding the proximity of a feature to the center of a cluster or group. Two commonly used distance metrics are Euclidean and Manhattan distances.

#### Euclidean distance

This is the straight-line distance between two points in Euclidean space. It measures the shortest path between two points, which is the geometric interpretation of distance we are most familiar with in everyday space. Here's a general formula:

$$d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}$$

where $p$ and $q$ are two points in the Euclidean space, $n$ is the n-th dimension of the Euclidean space.

#### Manhattan distance (Taxicab distance)

Manhattan Distance: Also known as city block distance or L1 distance, Manhattan distance calculates the distance between two points by summing the absolute differences of their Cartesian coordinates. It resembles the distance a car would travel along city blocks to reach from one point to another. The general formula is:

$$d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i-q_i|$$

where $p$ and $q$ are two points in the Euclidean space, $n$ is the n-th dimension of the Euclidean space.

Install the following packages required to execute the notebook

In [None]:
%pip install --upgrade --quiet matplotlib numpy pandas scikit-learn scipy

Import the necessary libraries

In [None]:
import math
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

from scipy.spatial.distance import cdist
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.tree import DecisionTreeClassifier

from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, auc

from sklearn import datasets
from sklearn import tree

seed = 520

In [None]:
def euclidean_dist(pA, pB):
    """
    Task: calculate the Euclidean distance between two points

    Args:
    - pA (list): coordinates of pA in n-th dimension
    - pB (list): coordinates of pB in n-th dimension

    Return:
    - dist (float): Euclidean distance between two points
    """
    # Hint: you might need to use library "math" to calculate square root
    n_dimension = len(pA)
    sum = 0
    for i in range(0, n_dimension):
        temp = (pA[i] - pB[i]) ** 2
        sum += temp

    dist = math.sqrt(sum)

    return dist

def manhattan_dist(pA, pB):
    """
    Task: calculate the Manhattan distance between two points

    Args:
    - pA (list): coordinates of pA in n-th dimension
    - pB (list): coordinates of pB in n-th dimension

    Return:
    - dist (float): Manhattan distance between two points
    """
    n_dimension = len(pA)
    dist = 0
    for i in range(0, n_dimension):
        temp = abs(pA[i] - pB[i])
        dist += temp

    return dist

In [None]:
assert euclidean_dist([0, 0], [6, 8]) == 10
assert round(euclidean_dist([9, 9], [8, 8]), 3) == 1.414
assert round(euclidean_dist([3, 5, 1], [0, 2, 9]), 3) == 9.055

assert manhattan_dist([0, 0], [6, 8]) == 14
assert manhattan_dist([9, 9], [8, 8]) == 2
assert manhattan_dist([3, 5, 1], [0, 2, 9]) == 14

In K-Nearest Neighbors (KNN), the parameter 𝐾 determines the number of nearest neighbors considered when making predictions. Choosing 𝐾 impacts how the model interprets and predicts based on the data's local structure. A small 𝐾 makes predictions sensitive to the nearest neighbor, while a larger 𝐾 smooths out decisions but risks oversimplifying the data and increasing bias.

Advantages:

* Requires minimal parameters for training: 𝐾 and the distance metric.
* Adapts dynamically to new data points, adjusting predictions as the dataset evolves.

Disadvantages:

* Slower processing time with larger 𝐾 due to increased computational effort.
* High storage requirements as the model stores the entire dataset, impacting memory usage and execution time.

We will create a synthetic dataset from scratch using NumPy. In this exercise, you will create a KNN model.

In [None]:
def generate_data(means, cov, N, K):
    """
    Generate synthetic dataset

    Args:
    - means (list of lists): Containing K mean vectors, each of length d
    - cov (list of lists): The covariance matrix of shape (d, d)
    - N (int): The number of samples to generate for each cluster
    - K (int): The number of clusters

    Returns:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - original_label (np.ndarray): An array of shape (N*K,) containing the true labels for the generated data points
    """
    np.random.seed(seed)
    X0 = np.random.multivariate_normal(means[0], cov, N)
    X1 = np.random.multivariate_normal(means[1], cov, N)
    X2 = np.random.multivariate_normal(means[2], cov, N)
    X = np.concatenate((X0, X1, X2), axis = 0)
    original_label = np.asarray([0]*N + [1]*N + [2]*N).T

    return X, original_label

def kmeans_display(X, label):
    """
    Visualize the dataset

    Args:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - label (np.ndarray): An array of shape (N*K,) containing the true labels for the generated data points
    """
    K = np.amax(label) + 1
    X0 = X[label == 0, :]
    X1 = X[label == 1, :]
    X2 = X[label == 2, :]

    plt.plot(X0[:, 0], X0[:, 1], 'b^', markersize = 4, alpha = .8)
    plt.plot(X1[:, 0], X1[:, 1], 'go', markersize = 4, alpha = .8)
    plt.plot(X2[:, 0], X2[:, 1], 'rs', markersize = 4, alpha = .8)

    plt.axis('equal')
    plt.plot()
    plt.show()

In [None]:
N, K = 500, 3
means = [[2, 2], [7, 3], [3, 6]]
cov = [[1, 0], [0, 1]]

# Generate and visualize data
X, original_label = generate_data(means, cov, N, K)
kmeans_display(X, original_label)

In [None]:
def kmeans_init_centers(X, nc):
    """
    Initialize the centers

    Args:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - K (int): number of centers to be intialized

    Returns:
    - np.ndarray: An array of shape (nc, d) containing the initialized centers.
    """
    # randomly pick k rows of X as initial centers using np.random.choice function
    np.random.seed(seed)

    row = np.random.choice(X.shape[0], nc, replace=False)
    return X[row]

def kmeans_assign_labels(X, centers):
    """
    Assign labels to each data point based on the nearest center.

    Args:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - centers (np.ndarray): An array of shape (k, d) containing the current center coordinates.

    Returns:
    np.ndarray: An array of shape (N*K,) containing the index of the closest center for each data point.
    """

    pair_distance = cdist(X, centers)
    return np.argmin(pair_distance, axis=1)


def kmeans_update_centers(X, labels, K):
    """
    Update the center coordinates based on the assigned labels.

    Args:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - labels (np.ndarray): An array of shape (N*K,) containing the labels for each data point.
    - K (int): Number of clusters.

    Returns:
    np.ndarray: An array of shape (K, d) containing the updated center coordinates.
    """
    centers = np.zeros((K, X.shape[1]))
    for i in range(K):
        # collect all points assigned to the i-th cluster
        # take average

        index = np.where(labels == i)[0]
        centers[i, :] = np.mean(X[index], axis=0)

    return centers

def has_converged(centers, new_centers):
    """
    Check if the centers have converged.

    Args:
    - centers (np.ndarray): An array of shape (k, d) containing the old center coordinates.
    - new_centers (np.ndarray): An array of shape (k, d) containing the new center coordinates.

    Returns:
    - bool: True if the centers have converged, False otherwise.
    """

    og_set = set([tuple(center) for center in centers])
    new_set = set([tuple(center) for center in new_centers])
    return og_set == new_set

def kmeans(X, K):
    """
    Perform K-means clustering.

    Args:
    - X (np.ndarray): An array of shape (N*K, d) containing the generated data points
    - K (int): Number of clusters.

    Returns:
    - tuple: A tuple containing:
        + centers (list of np.ndarray): A list containing the center coordinates at each iteration.
        + labels (list of np.ndarray): A list containing the labels at each iteration.
        + it (int): The number of iterations until convergence.
    """
    # save the center coordinates of each iteration
    centers = [kmeans_init_centers(X, K)]
    # save the labels of each iteration
    labels = []
    it = 0
    while True:
        # at each iteration:
        # 1. assign label for each points and append to labels
        # 2. update the centers
        # 3. check the convergence condition
        #    and append NEW center coordinates to centers
        # 4. update iteration

        labels.append(kmeans_assign_labels(X, centers[-1]))
        new_centers = kmeans_update_centers(X, labels[-1], K)
        if has_converged(centers[-1], new_centers):
            break
        centers.append(new_centers)
        it += 1

    return (centers, labels, it)

In [None]:
K = 3
(centers, labels, it) = kmeans(X, K)
kmeans_display(X, labels[-1])

assert centers[-1].shape == (3, 2)

centers = kmeans_init_centers(X, K)
assert centers.shape == (K, 2)

assigned_labels = kmeans_assign_labels(X, centers)
assert assigned_labels.shape == (N*3, )

We'll be using `sklearn` to create a KNN model to classify types of flowers.

Using the sklearn library to perform each step:
- Step 1: Split the dataset into 2 sets: a train set and a test set (70% train, 30% test)
- Step 2: Create a KNN model with 5 buckets/clusters
- Step 3: Test how accurate the model is with test set

In [None]:
iris = pd.read_csv("https://raw.githubusercontent.com/chenethanxd/data/main/iris.csv")
X = iris.drop('target', axis=1)
y = iris['target']

# Step 1
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=seed)
# Step 2
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_X, train_y)
# Step 3
pred_y = knn.predict(test_X)
accuracy = metrics.accuracy_score(pred_y,test_y)
print('The accuracy of the KNN is', accuracy)

## Logistic Regression

Logistic regression is a binary classification method that utilizes the sigmoid function to predict probabilities between 0 and 1 for each input. In this method, inputs are classified into two classes—Class 0 and Class 1—based on whether the sigmoid output exceeds a threshold (typically 0.5). Despite its name, logistic regression extends from linear regression and is primarily employed for classification tasks. It estimates the probability of an input belonging to a particular class based on its explanatory variables, making it a versatile tool in predictive modeling.

### Sigmoid function

The sigmoid function is an activation function commonly used in machine learning and neural networks. It takes an input value and outputs a probability score between 0 and 1. Here's the formula:

$$\sigma(z)=\dfrac{1}{1+e^{-z}}$$

where $z$ is the weighted sum of the model function

The value from Sigmoid function always lies between 0 and 1, with 0.5 being the threshold for determining the binary label from the features.

### Hypothesis function

Hypothesis function is a function used to predict the output value based on input features and using learned weights and a bias. Here's the formula:

$$H=\sigma(w^TX+b)$$

where $w^T$ is the transpose of the weight matrix, $X$ is the input matrix, and $b$ is the bias matrix.

### Cost Function

A cost function quantifies the error or discrepancy between predicted and actual values in a machine learning model, guiding optimization towards minimizing this error. Here's the formula:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_{\theta}(x^{(i)})) \right]$$

where $m$ is the number of training examples, $y^{(i)}$ is the true label for the i-th training example, $x^{(i)}$ is the feature vector for the i-th training example, $\theta$ is the vector of parameters (weights), and $h_\theta$ is the predicted probability.

We'll be using Sklearn to perform Logistic Regression. Since it's using binary classification, we'll use a different dataset to detect diabetes.

In [None]:
# Define the sigmoid function
def sigmoid(z):
    """
    Compute the sigmoid of z.

    Parameters:
    z (numpy.ndarray): Input array.

    Returns:
    numpy.ndarray: Sigmoid of the input.
    """
    result = 1. / (1 + np.exp(-z))
    return result

# Predict function using the sigmoid
def predict(x, weights):
    """
    Predict the probability using logistic regression model.

    Parameters:
    x (numpy.ndarray): Feature matrix with shape (n_samples, n_features).
    weights (numpy.ndarray): Weights array with shape (n_features,).

    Returns:
    numpy.ndarray: Predicted probabilities.
    """
    result = sigmoid(np.dot(x, weights))
    return result

def compute_cost(weights, x, y):
    """
    Compute the binary cross-entropy cost.

    Parameters:
    weights (numpy.ndarray): Weights array with shape (n_features,).
    x (numpy.ndarray): Feature matrix with shape (n_samples, n_features).
    y (numpy.ndarray): True labels with shape (n_samples,).

    Returns:
    float: The binary cross-entropy cost.
    """
    h = predict(x, weights)
    one_case = np.dot(-y, np.log(h))
    zero_case = np.dot((1 - y), np.log(1 - h))
    return (one_case + zero_case) / len(y)

# Gradient descent function to update weights
def gradient_descent(weights, x, y, learning_rate):
    """
    Perform one step of gradient descent on the weights.

    Parameters:
    weights (numpy.ndarray): Current weights with shape (n_features,).
    x (numpy.ndarray): Feature matrix with shape (n_samples, n_features).
    y (numpy.ndarray): True labels with shape (n_samples,).
    learning_rate (float): Learning rate for gradient descent.

    Returns:
    numpy.ndarray: Updated weights.
    """
    error = predict(x, weights) - y
    gradient = np.dot(x.T, error) / len(x)
    weights -= learning_rate * gradient
    return weights

# Minimize function to train the model
def minimize(weights, x, y, iterations, learning_rate):
    """
    Minimize the cost function using gradient descent.

    Parameters:
    weights (numpy.ndarray): Initial weights with shape (n_features,).
    x (numpy.ndarray): Feature matrix with shape (n_samples, n_features).
    y (numpy.ndarray): True labels with shape (n_samples,).
    iterations (int): Number of iterations for gradient descent.
    learning_rate (float): Learning rate for gradient descent.

    Returns:
    tuple: Updated weights and list of costs over iterations.
    """
    costs = []
    for i in range(iterations):
        weights = gradient_descent(weights, x, y, learning_rate)
        cost = compute_cost(weights, x, y)
        costs.append(cost)
        if i % 100 == 0:
            print(f'Epoch {i}, Loss: {cost}')
    return weights, costs

# Evaluate function to calculate accuracy
def evaluate(X, y, weights):
    """
    Evaluate the logistic regression model using accuracy.

    Parameters:
    X (numpy.ndarray): Feature matrix with shape (n_samples, n_features).
    y (numpy.ndarray): True labels with shape (n_samples,).
    weights (numpy.ndarray): Weights array with shape (n_features,).

    Returns:
    float: Accuracy of the model.
    """
    y_pred = predict(X, weights) >= 0.5
    accuracy = np.mean(y_pred == y)
    return accuracy

# Main code
learning_rate = 0.01
epochs = 1000

diabetes = pd.read_csv("https://raw.githubusercontent.com/chenethanxd/data/main/diabetes.csv")
X = diabetes.drop('target', axis=1)
y = diabetes.target

# Convert target to binary
y = (y >= np.median(y)).astype(int)

# Standardize features
scaler = StandardScaler()

# Add intercept term
X = np.hstack((np.ones((X.shape[0], 1)), X))

# Split the dataset into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=seed)
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)

# Initialize weights
np.random.seed(seed)
weights = np.random.randn(train_X.shape[1])

# Train the model
weights, costs = minimize(weights, train_X, train_y, epochs, learning_rate)

# Evaluate the model
train_accuracy = evaluate(train_X, train_y, weights)
test_accuracy = evaluate(test_X, test_y, weights)

print(f'Training Accuracy: {train_accuracy * 100:.2f}%')
print(f'Test Accuracy: {test_accuracy * 100:.2f}%')

- Step 1: Split the dataset into 2 sets: a train set and a test set (70% train, 30% test, random_state=seed)
- Step 2: Standardize the train_X and test_X
- Step 3: Create a Logisctic Regression model with default parameters
- Step 4: Test how accurate the model is with test set

In [None]:
X = diabetes.drop('target', axis=1)
y = diabetes.target
y = (y >= np.median(y)).astype(int)

# Step 1
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=seed)
# Step 2
scaler = StandardScaler()
train_X = scaler.fit_transform(train_X)
test_X = scaler.transform(test_X)
# Step 3
log_reg = LogisticRegression(penalty=None)
log_reg.fit(train_X, train_y)
# # Step 4
# log_reg.score(test_X, test_y)  #Use score or .predict, followed by accuracy_score
pred_y = log_reg.predict(test_X)
accuracy = metrics.accuracy_score(pred_y,test_y)
print(f'The accuracy of the Logistic Regression is {round(accuracy * 100, 2)}%')

## Support Vector Machine (SVM)

In machine learning, support vector machines (SVMs) are models used for supervised learning tasks like classification and regression. SVM algorithms analyze data by creating a model from a set of training examples, each labeled as belonging to one of two categories. This model then categorizes new examples into one of these categories, effectively creating a binary linear classifier that is nonprobabilistic in nature.

In practical applications, SVMs are advantageous for handling large datasets where accurately determining the correct class for millions of training examples can be computationally intensive.

For instance, imagine using SVMs to predict whether someone enjoys a film based on how much they consumed in a cinema. Initially, a logistic regression model might be used for classification. However, this approach may lead to biased deductions, such as assuming that people who consume a lot of popcorn enjoy the film, and vice versa. SVMs address this issue by transforming the data, often by adding a new dimension (axis), which allows for the creation of a clear boundary (hyperplane) that separates different classes, enhancing the model's classification accuracy.

### Kernel

The process of learning the hyperplane in support vector machines (SVMs) involves transforming the problem using techniques from linear algebra.

### Regularization

The Regularization parameter (C) in SVM controls the trade-off between achieving a low training error and a low testing error, which is achieved by maximizing the margin. For large values of C, the optimization will prioritize a smaller-margin hyperplane that classifies all training points correctly. Conversely, a small value of C will result in a larger-margin hyperplane that may misclassify more points but aims to generalize better by focusing less on the individual points.

### Gamma

The gamma parameter defines how far the influence of a single training example reaches. Low $\gamma$ values mean that the influence reaches far, considering points far away from the separation line in the calculation. High $\gamma$ values mean the influence is close, considering only points near the separation line in the calculation.

### Margin

A margin is a separation of line to the closest class points.

A good margin is one where this separation is larger for both the classes. A good margin allows the points to be in their respective classes without crossing to other class. However, a bad margin will be close or biased towards either classes.

### Types of SVM

* Linear SVM: This is used when the data is perfectly linearly separable, meaning the data points can be classified into two classes using a single straight line.
* Non-linear SVM: This is used when the data is not linearly separable. In such cases, advanced techniques like kernel tricks are used to classify the data points into two classes that cannot be separated by a straight line.

Here's general pseudocode of how SVM executes:
```
1. Input: Training data (X, y), where X is the set of features and y is the set of labels.
2. Initialize the weights (w) and bias (b) to small random values.
3. Set the learning rate (α) and the regularization parameter (λ).
4. Repeat until convergence or for a fixed number of iterations:
    a. For each training example (x_i, y_i):
        1. Compute the prediction: y_pred = w * x_i + b
        2. If y_i * y_pred < 1 (misclassification or within margin):
            - Update the weights: w = w - α * (λ * w - y_i * x_i)
            - Update the bias: b = b - α * (-y_i)
        3. Else (correct classification):
            - Update the weights: w = w - α * (λ * w)
5. The resulting w and b define the hyperplane.
6. To make predictions on new data points x:
    a. Compute the prediction: y_pred = w * x + b
    b. Assign the class based on the sign of y_pred
```

We'll be using SVC to perform Support Vector Machine Classification. We'll use a different dataset to detect cancer.

Using the sklearn library to perform each step:
- Step 1: Split the dataset into 2 sets: a train set and a test set (70% train, 30% test, random_state=seed)
- Step 2: Create an SVM model with kernel is "rbf", gamma is 0.5 and C is 1.0. Assign this to variable `svm`
- Step 3: Test how accurate the model is with test set

In [None]:
cancer = pd.read_csv("https://raw.githubusercontent.com/chenethanxd/data/main/cancer.csv")
X = cancer.iloc[:, :2]
y = cancer.target

# Step 1
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=seed)
# Step 2
svm = SVC(kernel="rbf", gamma=0.5, C=1.0)
svm.fit(train_X, train_y)
# Step 3
accuracy = svm.score(test_X, test_y)
#can also use .predict followed by accuracy metric function
print("Accuracy:", accuracy)

In [None]:
DecisionBoundaryDisplay.from_estimator(
    svm,
    X,
    response_method="predict",
    cmap=plt.cm.Spectral,
    alpha=0.8,
    )

# Scatter plot
plt.scatter(
    X.iloc[:, 0],
    X.iloc[:, 1],
    c=y,s=20,
    edgecolors="k"
    )
plt.show()

## Decision Tree

A decision tree is a non-parametric supervised learning algorithm with a hierarchical, tree structure consisting of a root node, branches, internal nodes, and leaf nodes.

The logic behind a decision tree mimics human decision-making: if A happens, then B is the result; if A doesn't happen, then something different from B is the result. Essentially, it's an if-else tree structure.

In a decision tree, the final result comes from the leaf nodes, while the best variable to start predicting with is the root node.

### Entropy

Entropy is a metric used to measure the impurity or randomness in a given attribute. It quantifies the unpredictability in the data. Higher entropy indicates more disorder and more information content, meaning the data is more mixed or less uniform. Lower entropy suggests uniformity or consistency in the information content across all data points. The formula is:

$$H(X)=-\sum^n_{i=1}p_i \log_2 p_i$$

where $X$ is the total number of samples, $p_i$ is the probability of class p. If the result is 0, the label value from all rows are of same type. If it's 1, the label values from all rows are distributed equally. For instance, if there are 10 dogs in the label column, the result will be 0. On the other hand, 5 dogs and 5 cats will result in a 1.

### Gini Index

The Gini index is a metric used to measure the impurity or purity of a dataset while creating a decision tree. It quantifies the likelihood of a random sample being incorrectly classified if it were randomly labeled according to the distribution of labels in the dataset. A Gini index of 0 indicates perfect purity, where all elements belong to a single class, while higher values indicate greater impurity. Here's the formula:

$$Gini=1-\sum^n_{i=1}p_i^2$$

### Decision Tree Components

* Root Node: The top node that represents the entire dataset and is the starting point for the tree's splits.
* Decision Nodes: Intermediate nodes where the data gets split based on specific features.
* Leaf Nodes: Terminal nodes that represent the final class labels or regression values.
* Branches: The paths that connect decision nodes to leaf nodes or other decision nodes, representing the flow of the dataset through the tree.

1. Select the best attribute to split the records. Initially, we consider all attributes to be root node.
2. Calculate the Entropy of the label column $H(label)$
3. For each feature of an attribute column, we calculate the $H(feature)$. For instance, if there are 2 features: 12 rain rows and 15 sunny rows, we will calculate $H(rain)$ of 12 rows and its corresponding label and so on.
4. Calculate the weighted entropy of that attribute column. The general formula is: $\text{Weighted H(attr)}=\sum^{|features|}_{i=1}\dfrac{\text{Total num. of rows with feature}_i}{\text{Total rows}}*H(i)$. The weighted value can also be represented as a conditional probability $H(attr|label)$
5. We calculate the $InfoGain (=H(attr) - H(attr|label))$ of each attribute and choose the highest one to be the root.
6. Make that attribute a decision node and breaks the dataset into smaller subsets.
7. Start building tree by repeating this process recursively for each child until one of the conditions matches below:
    * All the tuples belong to the same attribute value.
    * There are no more remaining attributes.
    * There are no more instances.

An example decision tree after choosing the root node and recursively building the decision nodes and branches.

We'll be using Sklearn to create a Decision Tree. We will be using the Iris dataset to classify based on different features.

In [None]:
iris = pd.read_csv("https://raw.githubusercontent.com/chenethanxd/data/main/iris.csv")
X = iris.iloc[:, :-1]
y = iris.target
"""
Using sklearn library to perform each step
Step 1: Split the dataset into 2 sets: a train set and a test set (70% train, 30% test)
Step 2: Create a Decision Tree with Gini Index as criteria to classify
Step 3: Test how accurate the model is with test set
"""
# Step 1
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=seed)
# Step 2
gini = DecisionTreeClassifier(criterion='gini')
gini.fit(train_X, train_y)
# Step 3
pred_y = gini.predict(test_X)
#todo: replace metrics.accuracy_score(pred_y, test_y) with gini.score(test_X, test_y)

accuracy = metrics.accuracy_score(pred_y,test_y)
print('The accuracy of the Decision Tree using Gini Index is', accuracy)

In [None]:
plt.figure(figsize=(20,5))
tree.plot_tree(gini.fit(train_X, train_y),
               feature_names=iris.columns[:-1],
               class_names=["setosa", "versicolor", "virginica"],
               rounded=True)
plt.show()

## Evaluation metrics

### Confusion matrix

A confusion matrix is a table that summarizes the classification results of a binary classifier. It provides a detailed analysis of a model's performance by showing the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The definitions of these four components are as follows:

* True Positive (TP): The number of positive instances correctly classified as positive by the model.
* False Positive (FP): The number of negative instances incorrectly classified as positive by the model.
* True Negative (TN): The number of negative instances correctly classified as negative by the model.
* False Negative (FN): The number of positive instances incorrectly classified as negative by the model.

### Accuracy

Accuracy is the most commonly used performance metric for evaluating a binary classification model. It measures the proportion of correct predictions made by the model out of all the predictions. A high accuracy score indicates that the model is making a large proportion of correct predictions whereas a low score indicates the model is making too many incorrect predictions. Accuracy is calculated using the following formula:

$$\text{Accuracy}=\dfrac{TP+TN}{TP+TN+FP+FN}$$

### Precision

Precision is a metric that measures the accuracy of positive predictions made by the model. A high precision indicates that the model is making a low number of false positive predictions whereas a low score indicates tha the model is making too many. Precision is calculated using the following formula:

$$\text{Precision}=\dfrac{TP}{TP+FP}$$

### Recall

Recall, also known as sensitivity or true positive rate (TPR), measures the proportion of TPs among all the actual positive instances. In other words, recall measures the model’s ability to correctly identify positive instances. A high recall score indicates that the model is able to identify a large proportion of positive instances, while a low recall score indicates that the model is missing many positive instances. Recall is an important metric in evaluating the performance of a binary classification model. It measures the model’s ability to correctly identify positive instances. Recall is calculated using the following formula:

$$\text{Recall}=\dfrac{TP}{TP+FN}$$

### F1-score

The F1-score is a metric that combines precision and recall into a single measure to evaluate the overall performance of a binary classification model. It calculates the harmonic mean of precision and recall, providing a balanced assessment of both metrics. A high score indicates that the model has both high precision and high recall, reflecting strong performance in correctly identifying positive instances and minimizing false positives.A low F1-score indicates that the model is either lacking in precision (making too many false positives) or recall (missing many positive instances). F1-score is particularly useful when you want to balance the trade-off between precision and recall, especially in scenarios where both metrics are equally important. F1-score is calculated using the following formula:

$$\text{F1-score}=\dfrac{2 * precision * recall}{precision + recall}$$

### AUC-ROC curve

One of the most widely used performance metrics is the AUC-ROC curve. AUC stands for Area Under the Curve, and ROC stands for Receiver Operating Characteristic. The ROC curve is a graphical representation of the performance of a binary classifier, indicating the tradeoff between true positive rate (TPR) and false positive rate (FPR) at different thresholds. The AUC-ROC curve is the plot of the TPR against FPR at various threshold settings. The AUC represents the area under this curve, which ranges from 0 to 1, with a higher AUC indicating better model performance. The TPR and FPR are defined as follows:

$$TPR = \dfrac{TP}{TP + FN}$$

$$FPR = \dfrac{FP}{FP+TN}$$

The ROC curve is obtained by plotting TPR against FPR at different thresholds. A perfect classifier would have a ROC curve that passes through the top-left corner (TPR = 1 and FPR = 0), while a random classifier would have a ROC curve that passes through the diagonal (TPR = FPR)

Usually we don't need to implement these from scratch since we cannot *explicitly* know every answer for every result. Instead, we'll be using existing packages to help us do the work.

In [None]:
y_true = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0]
y_pred = [0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1]

# Please calculate the accuracy, precision, recall and f1-score

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(accuracy, precision, recall, f1)

In [None]:
assert round(accuracy, 2) == 0.75
assert round(precision, 2) == 0.7
assert round(recall, 2) == 0.78
assert round(f1, 2) == 0.74

In [None]:
# True labels
y_true = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0]

# Predicted probabilities (for ROC curve, we need probabilities rather than binary predictions)
# For the purpose of this example, we will create some example probabilities
# In a real scenario, these probabilities would be obtained from the model's predict_proba method
y_pred_prob = [0.1, 0.9, 0.2, 0.3, 0.8, 0.6, 0.4, 0.3, 0.7, 0.8, 0.2, 0.5, 0.9, 0.4, 0.3, 0.2, 0.8, 0.1, 0.7, 0.6]

# Compute ROC curve and ROC area
fpr, tpr, _ = roc_curve(y_true, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

## References

+ [Evaluation metrics in Classification](https://medium.com/@impythonprogrammer/evaluation-metrics-for-classification-fc770511052d)
+ [Decision Tree](https://medium.com/intro-to-artificial-intelligence/decision-tree-learning-e153b5b4ecdf)
+ [SVM pseudo code](https://www.kaggle.com/code/prashant111/svm-classifier-tutorial)
+ [SVM theory](https://medium.com/@AnasBrital98/introduction-to-support-vector-machine-2a2091401858)
+ [Logistic Regression](https://github.com/TrishlaM/Logistic-regression-from-scratch-using-NumPy/blob/master/Logistic_regression_from_scratch.ipynb)
+ [Logistic Regression from scratch](https://towardsdatascience.com/logistic-regression-from-scratch-with-numpy-da4cc3121ece)
+ [KNN](https://www.geeksforgeeks.org/k-nearest-neighbours/)