# Projet - Application au diagnostic du cancer du sien 

## Conceptual Development of Binary Classification Modeling

### Basic Concept of Binary Classification

Binary classificartion is the task of classifying a set into two groups on the basis of classification rule. Givr, a population whose members each belonging to one of a number of different sets, the elements of the population set are each predicted to belong to one another of the classes.

The most popular algorithms used in binary classification are:

1. Logistic Progression
2. k-Nearest Neighbors
3. Decision Trees
4. Support Vector Machine
5. Naive Bayes

In this project; k-Nearest Neighbors(KNN) Algorithm will used. Logistic Regression and Support Vector Machines algorithms do not natively support more than two classes.

### Matrix Confusion

It is a specific table lay-out that allows visualization of the performance of an algorithm, typically a supervised learning one, representing the summary of the prediction results on a classification problem(<a href="https://www.sciencedirect.com/topics/engineering/confusion-matrix">Kulkarni, et.al, [2020]</a>).

* `Each row` represents the instances in an actual class - `Actual` 
* `Each column` represnts the instances in a predicted class - `Predicted`

Confusion matrix is an `N*N` matrix used for evaluating the performance of a classification model, where `N` is the number of target classes. The matrix compares the actual target values with those predicted by machine learning model.

### True Positive vs. False Negative

In a sample data from a population, a True Positive`(TP)` is an outcome where the model `correctly` predicts the Positive Class, while a False Negative(`FN`) is an outcome where the model `incorrectly` predicts the positive class.


<table>
<thead>
<td></td>
<td>Predicted Values</td>
<thead>
<tbody>
</td></td>
<tr>
<td>Actual Values</td>
<td>Positive</td>
<td>Negative</td>
</tr>
<tr>
<td>Negative </td>
<td>Number of True Psitives(TP)</td>
<td>Number of False Negatives(FN)</td>
</tr>
<tr>
<td>Negative </td>
<td>Number of False Psitives(FP)</td>
<td>Number of True Negatives(TN)</td>
</body>
</table>

### Terminologies

* `True Positive(TP)`: when the actual value is Positive and predicted is also Positive
* `True negatives(TN)`: when the actual value is Negative and prediction is also Negative
* `False Positives(FP)`: also called `Type 1 error`, when the actual is negative but prediction is Positive.
* `False Negatives(FN)`: also called `Type 2 error`when the actual is Positive but the prediction is Negative.

### Accuracy

Accuracy measures how often the classifier makes the correct prediction. It's the ratio between the number of correct predictions and the total number of predictions(<a href=" https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5">Suresh[2020]</a>). It is a measure of correctness that is achieved in true predictions. In simple words, it tells us how many predictions are actually positive out of all the total positive predicted.

Mathematically,
$$
Accuracy = \frac {TP + TN}{TP + TN + FP + FN}
$$

### Recall

Recall is the ratio of correctly predicted values(`TP`) divided by total number of  actual values(<a href="https://en.wikipedia.org/wiki/Precision_and_recall#:~:text=In%20pattern%20recognition%2C%20information%20retrieval%20and%20Classification%20%28machine,amount%20of%20relevant%20instances%20that%20were%20actually%20retrieved.">Wikipedia</a>). It recalls the correct prediction out of the total positive actual classes.

Mathematically, recall is defined as:

$$ 
Recall = \frac{TP}{TP+FN}
$$

### Precision
Precision is a measure of correctness that is achieved in True observation. It tells how many predictions are actually positive out of the total positive predicted(<a href=" https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5">Suresh[2020]</a>, <a href="https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/">Huilgol[2020]</a>).

Mathematically, 

$$
Precision = \frac {TP}{TP + FP}
$$

### F1-score

Also called `F-score`, F1-score is a measure of a model's accuracy on a dataset(<a href="https://deepai.org/machine-learning-glossary-and-terms/f-score">Wood</a>). It combines precision and recallof the model to measure the harmonic mean. Its value ranges between 0 and 1`(0 < F1-score < 1)`. This harmonic measurement is used to have an understanding when there is a case where there is no distiniction between the importance of precision or recall., hence they are combined.

$$
F1-Score = 2*\frac {(Recall*Precision)}{(Recall + Precision)}
$$
*For a harmonous model, F-score should be high(ideally 1).*

### ROC (Receiver Operating Characteristics) Curve

The ROC (`Receiver operating Characteristic`) curve is a graphical plot that shows the performance of a machine learning model(<a href="https://medium.com/analytics-vidhya/what-is-roc-curve-1f776103c998">Jadhav, 2020</a>). ROC Curve is needed to evaluate whether the model truely represents the dataset or not.  For example, the accuracy of the model implemented could produce high accuracy but it may fail to visualize the dataset in the real world-samples. Therefore, it is advisable to evaluate further. This is where the need of ROC curve comes. 

ROC curve is an evaluation for binary classification problems(<a href="https://neptune.ai/blog/f1-score-accuracy-roc-auc-pr-auc">Czakan, 2021</a>). It is constrcucted by plotting the True Positive Rate(`TPR`). against the False Positive Rate(`FPR`)(<a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">Wikipedia</a>). 

Mathematically, it is expressed by the equation given below:
$$
TPR = \frac {TP}{TP + FN}
$$
$$
FPR = \frac {FP}{TN + FP}
$$
The ROC curve plots TPR(y-axis) vs. FPR(x-axis) at different classification thresholds(<a href="https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/">Bhandari, 2020</a>). The applied classifier threshold is inversely proportional to TPR and FPR, thus lowering the classification classifier increases both TPR and FPR. 

These curves are important assistants in evaluating and fine-tuning classification models(<a href="https://towardsdatascience.com/demystifying-roc-curves-df809474529a">Toshniwal, 2020</a>). Historically, ROC curve was developed in the 1940s by US Army to measure the ability of a radar's detective power for incoming signals.

##### The four main purpose of ROC Curve
* Analysing the strength/predictive power of a model
* Determing optimal threshold
* Comparing two models

For more detailed explanation about these purposes, please check an article written by <a href="https://towardsdatascience.com/demystifying-roc-curves-df809474529a">Toshniwal, (2020)</a>.

### AUC(Area under the ROC Curve)

To compute the values in an ROC curve, the model applied should be evaluated many times with different classification thresholds. However, this is not sufficient enough bytiself. Therefore, an efficient sorting based algorithm is required to provide the information. and this is called AUC("<a href="https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc">ROC Curve and AUC</a>").

AUC is the area under the ROC curve. To plot AUC plot, an an aggregate measurement of performance across all possible classification thresholds is computed(<a href="https://towardsdatascience.com/an-understandable-guide-to-roc-curves-and-auc-and-why-and-when-to-use-them-92020bc4c5c1">Agarwal, 2021</a>). Its value ranges between `0 and 1`. The objective is to maximize AUC so that the highest TPR and lowest FPR values for some thresholds can be determined. A model whose predictions are 100% wrong has an `AUC` of 0.0; and one with 100% correct has ab AUC of 1.0. Thus, the choice of of the threshold depends on the ability of the model to balance between `False Positive` and `False Negative`.