## **1. Classification**
---
### Types of Supervised Learning
- With regression, the output is a continuous numerical value
- With classification, the output is a category (often a probability between zero and 1)

### What is needed for classification
- Model data with features that can be quantified
- Labels that are known
- Methods to measure similarity between classification and desired value

### Examples of models used for classification
- Logistic Regression: extension of linear regression
- K-Nearest neighbours: non-linear simplistic approach to categorise according to similarity of past examples nearest to feature space
- Suppor vector machines: linear classifier using the "kernel trick"
- Neural Networks: combines non-linear and linear intermediate steps to come up with a complex final decision boundary
- Decisions tree, random forests, gradient boosting and ensemble models: all build off other classifiers to leverage other classifiers

## **2. Mathematics Behind Logistic Regression**
---
### Introduction
- Could predict two possible outcomes using linear regression model, we could fit a straight line to data (<0.5 predict one outcome, >0.5 predict the other)
- However the threshold of 0.5 can cause data to be skewed if we use a straight line (ie. larger spread of points and one boundary)
- This gives rise to instead of fitting a linear function, we fit a **logistic function**

### The Sigmoid Function
- Now instead of fitting a simple linear function, we fit the sigmoid of our linear function, defined by:
$$\frac{1}{1+e^{-x}}$$
- Here, x is our linear regression function, so:
$$y_{\beta}(x) = \frac{1}{1+e^{-(\beta_0 + \beta_{1}x + \epsilon)}} = P(x)$$
- This remaining function described logistic regression - and is used as a classification algorithm

- Key property of the sigmoid is that it clamps all data to the [0-1] range, and it's "middle" point is at p=0.5

### Probability Link
- This function can also be seen as the "Probability of being in one class over the other", with P(x) = sigmoid above
    - We can get from the above that the "odds ratio" (relative probability of two classes) is:
$$\frac{P(x)}{1-P(x)} = e^{(\beta_0 + \beta_{1}x + \epsilon)}$$
$$log\left[\frac{P(x)}{1-P(x)}\right] = \beta_0 + \beta_{1}x + \epsilon$$

- looking at the log-odds, we can see that a unit increase of $\beta_0$ or $\beta_1$, will linearly affect the log-odds



## **3. Logistic Regression with Multi-Classes**
---
### One vs All
- To deal with multiple classes with a fucntion (like sigmoid), that predicts binary outcomes, we use one vs all
- For each class, we run through the data estimating the binary logistic regression and considering all other classes and "the other option"
- We then do the same for all other classes, and end up with 3 decision bounaries
- The final decision bounaries are decided by the regions where each class has the high probabiliy out of all the classes of occuring

## **4. Implementing Logistic Regression**
---

In [None]:
# Import class containing the classification method
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

# Create an instance of the class
lr = LogisticRegression(penalty='l2', C=10.0)

# Fit the instance of the class on the training data and then predict the expected values on the test data
lr = lr.fit(X_train, y_train)       # type: ignore
y_predicted = lr.predict(X_test)    # type: ignore

# Can now view the output fitted coefficents
lr.coef_

# Tune regularisation parameters with cross-validation
LogisticRegressionCV()

- In addition to prediction, we may also want to evaluate the importance of each factor in influencing outcomes (interpretability)

## **5. Classification Error Metrics**
---
- If we have 1% of patients with leukemia and 99% healthy, a simple model could be built that is 99% accurate just by predicting that no one ever has leukemia
- So, accuracy is not always the best measurement of performance, and we need others!
### Confusion Matrix
|                 | Predicted Positive | Predicted Negative |
|-----------------|--------------------|--------------------|
| **Actual Positive** | True Positive (TP)  | False Negative (FN) ***(Type II Error)***|
| **Actual Negative** | False Positive (FP) ***(Type I error)***| True Negative (TN)  |

### Accuracy
- Accuracy is the most common error measure, and can be calculated as the sum of both correct predictions, over the total number of samples:
$$\text{Accuracy} = \frac{TP + TN}{TP + FN + FP + TN}$$

### Recal (sensitivity)
- Our ability to identify all the actual positive instances, otherwise known as the capture rate of true instances
- Note we can easily acheive 100% recall, just by predicting everything to be positive
$$\text{Recall} = \frac{TP}{TP+FN}$$

### Precision
- Precision is where we identify out of all our positive predictions, how many did we get right?
$$\text{Precision} = \frac{TP}{TP+FP}$$
- Note the trade-off between precision and recall

### Specificity
- Avoiding false alarms, ie. looking at how correctly the actual negative class is predicted (ie. recall for class zero)

$$\text{Specificity} = \frac{TN}{FP+TN}$$

### F1 Score:
- Somtimes called the harmonic mean
- Tries to optimise the trade-off between recall and precision
- This score will more heavily weight if precision or recall are two low
$$F1 = 2\frac{\text{Precision} * \text{Recall}}{\text{Precision}+\text{Recall}}$$

### Roceiver Operating Characteristic Curve (ROC)
- Plots True positive Rate (Recall) against the False Positive rate (1-Specificity)
- If we have all of our negatives correctly identified, we have a False Positive rate of zero
- If we have all of our positives correctly identified, then we have a true positive rate of 1
- A point of (FP, TP) is plotted for may different thresholds between 0 and 1 (threshold is the line above/below which we assign a value to a certain class)
- A diagonal line from (0, 0) to (1, 1) represents what the outcome would be for random guessing, if our model curve is above this then we are doing better (ideal is perfectly in top right at 1 TP, 0 FP)
- If our model is below this diagonal line then we are doing worse than random

### ROC-AUC (area under the curve)
- Gives a measure of how well we are separating the two classes (just the area under the ROC curve)
- For a perfect model, this area would be 1 as we have 1 TP, 0 FP, so a straight line up from zero then a straight line to the right
- If we have a ROC-AUC of 0.5, this is essentially as good as random

### Presicion-Recall Curve
- Plots precision against recall, measuring the trade-off between the two
- The area under the curve measures how unbalanced the dataset it


### Choosing the right metric
1. ROC Curve - better for data with balanced classes
2. Precision-Recall Curve - better for data with imbalanced classes

The right curve depends on tying results to outcomes (ie. relative cost of False Positive vs False Negative)

### Multi-class error metrics
$$\text{Accuracy} = \frac{TP1 + TP2 +...}{\text{Total}}$$

- For the rest of the metrics, we can still use all of the metrics/curves above in a one vs all way

In [None]:
# Syntax
from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy_score(y_test, y_predicted) # type: ignore

# Import other metrics and diagnostic tools
from sklearn.metrics import (
    precision_score, recall_score, f1_score, roc_auc_score,
    roc_curve, precision_recall_curve,
    confusion_matrix,
    )