### 4- Examples of well-known machine learning algorithms used to solve classification problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve classification problems:

- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Neural Networks (Deep Learning)
- AdaBoost
- Gradient Boosting Machines (GBM)
- XGBoost
- CatBoost
- LightGBM

### 14- How to evaluate a Classification model?

- Many metrics are commonly used to evaluate the performance of classification models in machine learning.
- The choice of metrics depends on the specific goals and characteristics of the classification problem.
- Here are some classification metrics:
    - Confusion matrix
    - Accuracy
    - Precision
    - F1 Score
    - Recall (Sensitivity or True Positive Rate)
    - Specificity or True Negative Rate
    - Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC)
    - Area Under the Precision-Recall Curve (AUC-PR) 
- The choice of metrics depends on the specific requirements of the classification problem (binary classification or multiclass classification).
- For example, in imbalanced datasets, where one class significantly has large number of samples than the second class, precision, recall, and F1 score are often more informative than accuracy.

#### 14. 1- What is confusion matrix in classification problems?

- Confusion matrix is a table used to measure the performance of classification model
- It gives more details regarding the number of instances that were correctly or incorrectly classified for each class.
- The confusion matrix is a valuable tool for assessing the strengths and weaknesses of a classification model and guiding further optimization efforts.
- Here is an example of confusion matrix for a binary classification problem : 
![title](images/confusion-matrix1.jpeg)
##### a. True Positive : 
- samples that are from the positive class and were correctly classified or predicted as positive by the model.
##### b. True Negative :  
- samples that are from the negative class and were correctly classified or predicted as negative by the model.
##### c. False Positive : 
- samples that are from  the negative class but were incorrectly classified or predicted as positive by the model.
##### d. False Negative : 
- samples that are from  the positive class but were incorrectly classified or predicted as negative by the model

#### 14. 2- How to define Accuracy?

- An evaluation metric used to evaluate the performance of classification model.

- Divides the number of correctly classified observations by the total number of samples.

- **Formula:** $$Accuracy ={ Number  of Correct Predictions \over Total number of predictions }$$


- Here a second formula : $$Accuracy ={ TP + TN \over TP + TN + FP + FN }$$

#### 14. 3- How to define Precision ?
- An evaluation metric that measures the accuracy of the positive predictions made by the model. 
- It divides the number of true positive predictions by the sum of true positives and false positives.
- It belongs to [0,1] interval, 0 corresponds to no precision and 1 corresponds to perfect precision.
- Precision = Positive Predictive Power
- **Formula:** $$Precision = {True Positives \over True Positives + False Positives}$$ 

#### 14. 4- How to define Recall, Sensitivity or True Positive Rate?
- An evaluation metric that measures the ability of the model to capture all the positive samples.
- It divides number of true positives samples by the sum of true positives and false negatives.
- Recall = Sensitivity = True Positive Rate. 
- **Formula:** $$ Recall= {True Positives \over True Positives + False Negatives}$$
#### 14. 5- How to define F1-score? 
- An evaluation metric that combines both Precision and Recall.
- Wighted average of Precision and Recall.
- It can be calculated using the `f1_score()` function of `scikit-learn`
- F1 belongs to [0,1]: 0 is the worst case and 1 is the best.
- **Formula :** $$F1= {2×Precision×Recall \over Precision+Recall}$$

#### 14. 5- How to define Specificity or True Negative Rate ?
- Specificity measures the ability of the model to correctly identify negative instances.
- It divides the true negatives samples by the sum of true negatives observations and false positives observations.
- True Negative Rate = Specificity
- **Formula:** $$Specificity={True Negatives \over True Negatives + False Positives}$$ 
#### 14. 6- What is Receiver Operating Characteristic (ROC) and Area under-ROC curve (AUC-ROC)?
- ROC curve is a graphical representation of the model's performance across different classification thresholds.
- The shape of the curve contains a lot of information
- Area under the ROC curve : AUC-ROC provides a single metric indicating the model's ability to distinguish between classes.
- Here is ROC and AUC-ROC illustration:

![title](images/roc-curve-original.png)

- If AUC-ROC is high, then we have better model. Else, we have poor model performance.
- Smaller values on the x-axis of the curve point out lower false positives and higher true negatives.
- Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.
- We can plot the ROC curve using the `roc_curve()` scikit-learn function.
- To calculate the accuracy, we use `roc_auc_score()` function of `scikit-learn`.
* Note: False Positive Rate = 1- Specificity



*source: https://sefiks.com/2020/12/10/a-gentle-introduction-to-roc-curve-and-auc/

#### 14. 7- What is Area Under the Precision-Recall Curve (AUC-PR)?
- Similar to AUC-ROC, AUC-PR represents the area under the precision-recall curve.
- It provides a summary measure of a model's performance across various levels of precision and recall.
- It can be calculated using the `precision_recall_curve()` function of `scikit-learn`.
- The area under the precision-recall curve can be calculated using the `auc()` function of `scikit-learn` taking the recall and precision as input.

![title](images/precision_recall_curve.png)

*source: https://analyticsindiamag.com/complete-guide-to-understanding-precision-and-recall-curves/

- The same here if AUC-PR is high, then we have better model. Else, we have poor model performance.
- The recall is provided as the x-axis and precision is provided as the y-axis.
#### a. When to Use ROC vs. Precision-Recall Curves?
- Choosing either the ROC curves or precision-recall curves depends on your data distribution:
    - ROC curves: preferable to be used when there are roughly equal numbers of observations for each class.
    - ROC curves provide a good picture of the model when the dataset has large class imbalance.
    - Precision-Recall curves should be used when there is a moderate to large class imbalance.

#### 14. 8 - Classification Report Scikit-learn? 
- The `classification_report` function of `scikit-learn` provides a detailed summary of classification metrics for each class in a classification problem. 
- The report contains the next metrics:
    - Precision
    - Recall- sensitivity
    - F1-score
    - Specificity
    - Support
- Support: the number of actual instances of each class in the dataset.
#### 14. 9- How do we evaluate a classification report?
- High recall + high precision ==> the class is perfectly handled by the model. 
- Low recall + high precision ==> the model can not detect the class well but is highly trustable when it does.
- High recall + low precision ==> the class is well detected but model also includes points of other class in it. 
- Low recall + low precision ==> class is poorly handled by the model
#### 14. 10 What is log loss fucntion?
- It is an evaluation metric used in logistic regression
- Called logistic regression loss or cross-entropy loss
- Input of this loss function is probability value that belongs to [0,1].
- It measures the uncertaintly of our prediction based on how much it varies from the actual label.  

### 16. how to choose a classifier based on training dataset size?
- If training set is small ==> it is better to use simple model with high bias and low variance seems to work better because they are less likely to overfit. 
- If training set is large ==> it is better to use model with low bias and high variance as this model type will tend to perform better with complex relationships. Example: Naive Bayes.
-  Balancing variance and bias is essential for developing models that perform well on both training and unseen data.

#### 16. 1-  What is data bias ?
- It is when the available data used in the training phase is not representative of the real-world population or phenomen of study.
- For example: when training data used to create a ml model has unfair discrepancies or inaccuracies. 
- The information provided by the data does not truly represent the situation.
- The existence of biased data can lead to undesired and often unfair outcomes (discriminatory results) when the model is applied to testing data because the model will learn these biases too. 
- Various types of bias are existing : selection bias, measurement bias and confirmation bias.
- Addressing data bias is an ongoing challenge in the field of machine learning, and researchers and practitioners are actively working to develop methods and tools to identify, measure, and mitigate bias in models.
- To mitigate data bias in machine learning, it's crucial to accomplish well studied steps: collecting diverse and representative data, thoroughly processing it, and regularly checking model predictions to ensure fairness.
- Example: a biased facial recognition model may perform poorly for certain demographic groups.

#### 16. 2-  What is variance? 

- Understanding variance is crucial in assessing the stability and generalization capability of models.
- It refers to the degree of spread or dispersion in a set of values.
- It measures the variability of each individual data points (observation) from the mean (average) of the dataset:
    - Higher variance: data points are more spread out from the mean ==> more dispersed distribution.
    - Lower variance:  data points are closer to the mean ==> more concentrated distribution.
- Formula:  $\sigma^2 = { \sum \limits _{i=1} ^{n}(X_{i} - \overline{X}) \over {n-1}}$
- The standard deviation ( $\sigma$) is the square root of the variance.
- If the predictions variance is :
    - Low: predictions varying little from each other. 
    - High: overfitting + reading too deelpy into the noise+ good performance on training data +poor performance on testing data
- Do not forget the bias-variance trade-off.

### How Logistic regression works ?
- It is a classification algorithm used to predict a discret output.
- Types of outputs: 
    - Binary (2 classes)
    - Multiple (>2 classes)
    - Ordianl (Low, medium, High)
- It uses the sigmoid activation function to map predictions to probabilities
- Output:mx+b
- Sigmoid function formula: $$S(z)={1\over 1+ e^{-z}}$$
<div>
<img src="images/sigmoid-function.png" width="500"/>
</div>

### What is 'naive' in the Naive Bayes classifier?
- The classifier is called 'naive' because it makes assumptions that may or may not turn out to be correct
- The algorithm assumes the absolute independence of features==>the presence of one feature of a class is not related to the presence of any other feature
- Example: any fruit that is red and round is cherry ==> it can be true or false
### How to knwo which ML algorithm to use for your classification problem ?
- There is no fixed rule to choose. However, you can follow these guidelines: 
    - If accuracy is a concern ==> test different algorithms and cross-validate them
    - If the training dataset is small ==> use models that have low varaiance and high bias
    - If the training dataset is large ==> use models that have high variance and littke bias
### How to choose which ML algorithm tu use given a dataset?
- No master algorithm it all depends on the situation
- Answer the next questions: 
    - How much data?
    - Output: Continous, Categorical?
    - Is it classification, regression or clustering?
    - Is all output variables labled or mixed?