# Data Scientist Interview Questions
## Part 3 : ML : modelling and evaluation

This Jupyter notebook serves as a focused resource for individuals gearing up for technical interviews in the fields of machine learning engineering and data science. It specifically delves into questions related to all phases of machine learning model evaluation and deployment. Whether you're a candidate looking to sharpen your interview skills or an interviewer seeking insightful questions, this notebook provides valuable content for honing your understanding of machine learning evaluation and deployment.

### 0- What does Machine Learning means ? 
Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit programming. The core idea behind machine learning is to allow machines to learn patterns, make predictions, or optimize decisions based on data.

Key concepts in machine learning include:
- Types of Machine Learning: supervised, unsupervised and semi-supervised
- Types of machine learning problems: classification, regression and clustering
- Split data into Training, validation and testing sets (case of supervised)
- Choose the right algorithm depends on the problem you want to solve
- Model Evaluation
- Hyperparameter Tuning
- Deployment

### 1- What are three stages of building a machine learning model ? 
- The process of building a machine learning model includes three main stages, These stages are:
    - **Training phase:** after splitting the data into training and testing sets, training data is used to train our model on a labeled dataset. During the training phase, the model tries to learn relationships between input data and the corresponding output target values while adjusting its internal parameters. Throughout this phase, the model aims to maximise the accuracy of making precise predictions or classifications when exposed to unseen data.
    - **Validation phase:** after the model is well trained, we evaluate it on a seperate dataset known as the validation set (maximum 10% of our data). This dataset is not used during the training process. Validation stage helps identify the existence of certain overfitting (model performing well on training data but poorly on new data) or certain underfitting (model needs more training to capture the underlying patterns in the data).
    - **Testing (Inference) phase:** during this phase, the trained and validated model is applied to unseen dataset, called test dataset. This phase aims to evaluate the model's performance and provides a measure regarding the model's effectiveness and its ability to make accurate predictions in a production environment.
    
#### 1. 1- How to split your data while building a machine learning model ?    
- During the model building phase, it is required to split the data into three main sets to evaluate the model's performance and effectiveness. The three sets are: 
    - Training: used to train the model and learn relationship between inputs and outputs, contains 70-80% of our total dataset
    - Validation: used to validate the model, fine-tune the model's hyperparameters and assess its performance during training, it helps to prevent overfitting and underfitting. It contains 10-15% of the total data
    - Testing: used to test and evaluate the model's performance against unseen data and after validation phase. It is used to measure how effective will our built model be in a production environment. It contains 10-15% of the total data.

- Splitting data is accomplished after the preprocessing phase (handle missing values, categorical features, scale features, etc.). 
- It is important to ensure that the split is representative of the overall distribution of the data to avoid biased results.
- It is favorable to use cross-validation technique. 
- No fixed rule to split data between training, validation and testing, portions can vary based on individual preferences.

### 2- What are the types of ML algorithms ? 
Machine learning algorithms can be categorized into several types based on their learning styles and the nature of the task they are designed to solve.

Here are some common types of machine learning algorithms:
- **Supervised Learning** 
- **Unsupervised Learning**
- **Semi-Supervised Learning**
- **Deep Learning** 
- **Reinforcement Learning** 
- **Ensemble learning**  
- **Ranking**
- **Recommendation system** 


#### 2.1 - What does supervised, unsupervised and semi-supervised means in ML? 

In machine learning, the terms "supervised learning," "unsupervised learning," and "semi-supervised learning" refer to different approaches based on the type of training data available and the learning task at hand:

- **Supervised Learning :** training a model on a labeled dataset, where the algorithm learns the relationship between input features and corresponding target labels. Can be used for Regression (continous output) or Classification (discrete output). 
- **Unsupervised Learning :** Deals with unlabeled data and aims to find patterns, structures, or relationships within the data. Can be used for Clustering (Groups similar data points together) or association
- **Semi-Supervised Learning:** Utilizes a combination of labeled and unlabeled data to improve learning performance, often in situations where obtaining labeled data is challenging and expensive.

#### 2.2 - What are Unsupervised Learning techniques ?
 We have two techniques, Clustering and association: 
 - Custering :  involves grouping similar data points together based on inherent patterns or similarities. Example: grouping customers with similar purchasing behavior for targeted marketing.. 
 - Association : identifying patterns of associations between different variables or items. Example: e-commerse website suggest other items for you to buy based on prior purchases.
#### 2.3 - What are Supervised Learning techniques ? 
We have two techniques: classfication and regression: 
- Regression : involves predicting a continuous output or numerical value based on input features. Examples : predicting house prices, temperature, stock prices etc.
- Classification : is the task of assigning predefined labels or categories to input data. We have two types of classification algorithms: 
    - Binary classification (two classes). Example: identifying whether an email is spam or not.
    - Multiclass classification (multiple classes). Example: classifying images of animals into different species.

### 3- Examples of well-known machine learning algorithms used to solve regression problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve regression problems:

- Linear Regression
- Lasso Regression
- Ridge Regression
- Decision Trees
- Random Forest
- Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Bayesian Regression
- Neural Networks (Deep Learning):

### 4- Examples of well-known machine learning algorithms used to solve classification problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve classification problems:

- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Neural Networks (Deep Learning)
- AdaBoost
- Gradient Boosting Machines (GBM)
- XGBoost
- CatBoost
- LightGBM


### 5- Examples of well-known machine learning algorithms used to solve clustering problems

Several well-known machine learning algorithms are commonly used for solving clustering problems. Here are some examples:

- K-Means Clustering 
- Hierarchical Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Mean Shift
- Gaussian Mixture Model (GMM)
- Agglomerative Clustering

These algorithms address different types of clustering scenarios and have varying strengths depending on the nature of the data and the desired outcomes. The choice of clustering algorithm often depends on factors such as the shape of clusters, noise in the data, and the number of clusters expected.

#### 5.1- K-Means 
- Most known and used clsutering algorithm
- Has two version : Hard-Kmeans and Soft Kmeans
- Steps of Hard- Kmeans:
    - Choose number of clusters K
    - start with initial guess : xk(0), k=1.....K
    - etc.
- Formula : y(n)=x(n)+v(n) :
    - y(n): data point
    - x(n): centroid 
    - v(n): Gausian noise (Gaussian, statistically idependent , mean=0 and ${\sigma_{k}^2}$=variance
- This algorithm has two disadvantages (problem with linear complexity):
    - Number of clusters 
    - Random choice for the centroid : different choices can led to different results
#### a- What soft Kmeans mean?
- Does not require hard decision in which y(n) belongs to one and only one decision region
- y(n) has different probabilityy that belongs to a cluster K
- Centroid xk is calculated in function of weights assigned to each point
#### b- What is the K-meadian
- It is a clustering algorithm that uses the median to calculate the updated center/ centroid of group
- why??? ==> median is less affected by outliers but this method is much slower for large datasets because sorting is required on each iteration to compute median

#### 5.2- What is Hierarchical Clustering ? 
- It is a clustering technique used in data analysis and machine learning.
- It starts with each data point as a separate cluster and then iteratively merges or splits clusters based on their similarity, forming a dendrogram.
- Dendrogram is representation of clusters hierarchy, with the vertical lines indicating the merging or splitting points.

<div>
<img src="images/Dendrogram.png" width="500"/>
</div>

- Here are the key characteristics and steps used in hierarchical clustering:

    1. Each data point starts as a singleton cluster
    2. The similarity or dissimilarity between data points is calculated using a chosen distance metric. 
    3. Based on the similarity/dissimilarity values, the algorithm either merges clusters or splits data points into new clusters.
    4. After merging or splitting clusters, a dendrogram is constructed.
    5. The algorithm continues until all data points belong to a single cluster or until a predefined stopping criterion is met.

- Hierarchical clustering can be classified into two main types:
    - Agglomerative (Bottom-Up) Hierarchical Clustering: it starts with individual data points as separate clusters and merges them iteratively to form larger clusters.
    - Divisive (Top-Down) Hierarchical Clustering: it starts with all data points in a single cluster and recursively splits them into smaller clusters.

#### a. Advantages : 
- Does not require number of clusters in advance.
- Easy to implement.
- Produces a dendrogram which helps with understanding the data.
- Intuitive visualization with dendrograms
- It is possible to cut dendrogram if we want to change the number of clusters. 
- Captures the hierarchical structure of the data
#### b. Disadvantages:
- Computationally more intensive, especially for large datasets. 
- Need to identify distance between two observations or between two clusters
- sometimes it is difficult to identify number of clusters. 
- Lack of flexibility when dealing with non-globular shapes.

#### 5.3. What distance metrics are used as similarity measures between two samples in clustering? 
- Several distance metrics are commonly used as similarity measures between two samples in data analysis and clustering.
- The choice of distance metric depends on the nature of the data and the characteristics of the analysis.
- Here are some commonly used distance metrics:
    - Eculidian Distance 
    - Manhattan Distance 
    - Maximum Distance  
    - Minkowski Distance 
    - Chebyshev Distance
    - Hamming Distance
    - Cosine Similarity
    - Correlation Distance
    - Jaccard Similarity Coefficient
    
Here are more details regarding each distance metric : 
 - **Eculidian Distance:**
     - Measures the straight-line distance between two points in Euclidean space.
     - Suitable for continuous numerical data
     - Formula: 
- **Manhattan Distance:**
    - Calculates the sum of the absolute differences between the coordinates of two points.
    - Suitable for data with attributes that have different units or scales.
    - Formula: 
- **Minkowski Distance:** 
    - Generalizes both Euclidean and Manhattan distances.
    - Parameterized by a parameter 'p,' where p = 1 corresponds to Manhattan distance, and p = 2 corresponds to Euclidean distance.
    - Formula : 
- **Chebyshev Distance:**
    - Measures the maximum absolute difference between coordinates.
    - Suitable for data where outliers might have a significant impact.
    - Formula: 
- **Hamming Distance:**
    - Measures the number of positions at which corresponding symbols differ in two binary strings.
    - Suitable for categorical data or binary data.
    - Formula : 
- **Cosine Similarity:**
    - Measures the cosine of the angle between two vectors.
    - Suitable for text data, document clustering, and cases where the magnitude of the data points is less important.
- **Correlation Distance:**
    - Measures the similarity in shape between two vectors, taking into account the correlation between variables.
    - Suitable for datasets where the relative changes in variables are more important than their absolute values.
- **Jaccard Similarity Coefficient:**
    - Measures the similarity between two sets by comparing the size of their intersection to the size of their union.
    - Suitable for binary or categorical data.
- **Mahalanobis Distance:**
    - Takes into account the correlation between variables and the scales of the variables.
    - Suitable for datasets with correlated variables and different variances.

#### d. How to calculate distance between two clusters ?
- The distance between two clusters, often referred to as linkage or proximity.
- These distance measures are used during the agglomeration process in hierarchical clustering.
- The algorithm iteratively merges clusters based on the chosen linkage method until all data points belong to a single cluster.
- The choice of linkage method can impact the resulting dendrogram and the interpretation of the cluster structure.
- Different linkage methods may be suitable for different types of data and clustering objectives.
- Here are some commonly used linkage methods:
    - Single Linkage (Minimum Linkage)
    - Complete Linkage (Maximum Linkage)
    - Average Linkage
    - Centroid Linkage (UPGMA - Unweighted Pair Group Method with Arithmetic Mean)
    - Ward's Method

Here are more details regarding each method:
- **Single Linkage (Minimum Linkage):**
    - The distance between two clusters is the minimum distance between any two points in the two clusters.
    - Formula: $d(C_{1},C_{2})={min_{i \in C_{1}, j \in C_{2}} distance(i,j)}$
   
- **Complete Linkage (Maximum Linkage):**
    - The distance between two clusters is the maximum distance between any two points in the two clusters.
    - Formula: $d(C_{1},C_{2})={max{i \in C_{1}, j \in C_{2}} distance(i,j)}$ 
- **Average Linkage:**
    - The distance between two clusters is the average distance between all pairs of points from the two clusters.
    - Formula: $d(C_{1},C_{2})={1 \over |C_{1}|\times|C_{2}|}{\sum \limits _{i\in C_{1}} \sum \limits _{j\in C_{2}} distance(i,j)}$
- **Centroid Linkage (UPGMA - Unweighted Pair Group Method with Arithmetic Mean):**
    - The distance between two clusters is the distance between their centroids (mean vectors).
    - Formula: d($C_{1}$,$C_{2}$)=distance(centroid($C_{1}$), centroid($C_{2}$))
- **Ward's Method:**
    - Minimizes the increase in variance within the clusters when merging.
    - It considers the sum of squared differences within each cluster and the sum of squared differences between the centroids of the clusters.
    - Formula: $d(C_{1},C_{2})= {\sqrt{{|C_{1}|\times|C_{2}|}\over {|C_{1}|+|C_{2}|}}}distance(centroid(C_{1}), centroid(C_{2}))$

### 6 -What is Ensemble learning?

Ensemble learning is a machine learning technique that involves combining the predictions of multiple individual models to improve overall performance and accuracy. Instead of relying on a single model, ensemble methods leverage the strengths of diverse models to compensate for each other's weaknesses. The idea is that by aggregating the predictions of multiple models, the ensemble can achieve better generalization and make more robust predictions than any individual model.

There are several ensemble learning methods, with two primary types being:
- **Bagging (Bootstrap Aggregating) :** 
    - Involves training multiple instances of the same model on different subsets of the training data, typically sampled with replacement. 
    - Examples : Random Forest, Bagged Decision Trees, Bagged SVM (Support Vector Machines), Bagged K-Nearest Neighbors, Bagged Neural Networks
- **Boosting :**
    - Focuses on sequentially training models, with each subsequent model giving more attention to the instances that the previous models misclassified. 
    - Examples: AdaBoost (Adaptive Boosting), Gradient Boosting, XGBoost (Extreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), CatBoost, GBM (Gradient Boosting Machine)


### 8- What is Recommender Systems
Also known as recommendation systems or engines, are applications or algorithms designed to suggest items or content to users based on their preferences and behavior. These systems leverage data about users and items to make personalized recommendations, aiming to enhance user experience and satisfaction. There are two main types of recommender systems:

- Content-Based Recommender Systems
- Collaborative Filtering Recommender Systems
Recommender systems are widely used in various industries, including e-commerce, streaming services, social media, and more. They help users discover new items, increase user engagement, and contribute to business success by promoting relevant content and products

#### 8. 1- What is Content-Based Recommender Systems ? 
#### 8. 2- What is Collaborative Filtering Recommender Systems ?

### 9 -What is Reinforcement Learning?

Is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, allowing it to learn optimal strategies over time to maximize cumulative rewards. It is inspired by the way humans and animals learn from trial and error.

Here are some applications of Reinforcement Learning : 
- Automated Robots
- Natural Language Processing
- Marketing and Advertising 
- Image Processing
- Recommendation Systems
- Traffic Control 
- Healthcare 
- Etc.

### 10- What is Ranking ? 

Ranking in machine learning refers to the process of assigning a meaningful order or ranking to a set of items based on their relevance or importance. This is often used in scenarios where the goal is to prioritize or sort items based on their predicted or observed characteristics.

Ranking problems are common in various applications, including information retrieval, recommendation systems, and search engines.

### 11- Overfitting and Underfitting

Overfitting and underfitting are common challenges in machine learning that relate to the performance of a model on unseen data.
- Overfitting : occurs when a machine learning model learns the training data too well, capturing noise and random fluctuations in addition to the underlying patterns (as concept). High error on testing dataset

- Underfitting : happens when a model is too simplistic and cannot capture the underlying patterns in the training data. High error rate on both training and testing datasets.

#### 11. 1 - Overfitting Causes and Mitigation:
- Causes:
    - Too many features or parameters.
    - Complex model architectures.
    - Limited training data.
- Consequences
- Mitigation:
    - Regularization techniques (e.g., L1 or L2 regularization).
    - Feature selection or dimensionality reduction.
    - Increasing the amount of training data.
    - Using simpler model architectures.==> Less variables and parameters so variance can be reduced.
    - Use of Cross-validation method 
    
#### 11. 2 - Underfitting Causes and Mitigation:
- Causes:
    - Too few features or parameters.
    - Insufficient model complexity.
    - Inadequate training time or data.
- Mitigation:
    - Increasing the complexity of the model.
    - Adding relevant features.
    - Training for a longer duration.
    - Considering more sophisticated model architectures.

Achieving a balance between overfitting and underfitting is crucial. This balance, often referred to as the model's "sweet spot," results in a model that generalizes well to new, unseen data. Techniques like cross-validation, hyperparameter tuning, and monitoring learning curves can help strike this balance during the model development process.

### 12- What are the types of Regularization in Machine Learning

- Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the objective/loss function.
- It consists on adding a cost term that penalize the large weights of model.
- There are mainly two types of regularization commonly used: L1 regularization (Lasso) and L2 regularization (Ridge). - Additionally, Elastic Net is a combination of both L1 and L2 regularization. 

Here are all the used techniques in ML : 
- L1 Regularization (Lasso)
- L2 Regularization (Ridge)
- Elastic Net

#### 12. 1 - L1 Regularization (Lasso) : 
L1 regularization tends to shrink some coefficients exactly to zero, effectively excluding the corresponding features from the model. It is often used when there is a belief that some features are irrelevant. The penalty term is the sum of the absolute values of the regression coefficients.

#### 12. 2 - L2 Regularization (Ridge) : 

L2 regularization tends to shrink coefficients toward zero without eliminating them entirely. It is effective in dealing with multicollinearity (high correlation between predictors) and preventing overfitting. The penalty term is the sum of the squared values of the regression coefficients.


#### 12. 3 - Elastic Net: 

Elastic Net combines both L1 and L2 penalties in the objective function. It has two control parameters, alpha (which controls the overall strength of regularization) and the mixing parameter, which determines the ratio between L1 and L2 penalties. It is useful when there are many correlated features, and it provides a balance between Lasso and Ridge.


These regularization techniques help improve the generalization performance of machine learning models by preventing them from becoming too complex and fitting noise in the training data. The choice between L1, L2, or Elastic Net depends on the specific characteristics of the dataset and the modeling goals.


### 13- What is Model Validation Technique?

Validation techniques in machine learning are essential for assessing the performance of a model and ensuring its ability to generalize well to unseen data. 

Here are some common validation techniques:
- Train-Test Split 
- K-Fold Cross-Validation 
- Stratified K-Fold Cross-Validation
- Leave-One-Out Cross-Validation (LOOCV)
- Holdout Validation
- Time Series Cross-Validation

#### 13. 1 - What is train-test-validation split?
- It is an important step to indicate how well a model will perform with real-world, unseen data.
- A good train-test-validation split helps mitigate overfitting and ensures that evaluation is not biased.
- It consists on dividing input dataset into three subsets:
    - Training: 70-80% of the data
    - Validation: 10-15% of the data
    - Testing: 10-15% of the data
- This split aims to ensure that the model is trained on a sufficiently large dataset, validated on a separate set to fine-tune parameters, and tested on a completely independent set to provide an unbiased evaluation of its performance.
#### 13. 2 - What is K-Fold Cross-Validation?
- It is a technique used to assess the performance and generalization ability of a model. 
- The input dataset will be divided into k equally sized folds/groups.
- (K-1) folds are used for training and one fold is used for testing. Then, we evaluate the model. 
- Repeating the training and evaluation K times.
- Each time a different fold is taken as the test set while the remaining data is used for training.
- Here are the steps of the process :
    - Data Splitting
    - Model Training and Evaluation : iteration
    - Performance Metrics : error, accuracy, recall, precision etc is evaluated for each iteration.
    - Average Performance : average performance (error) is evaluated across all K iterations ==> provide a more reliable estimate of the model's performance.
- Error formula : $e(n)={y(n)-\hat y(n)}$ is calculated for each iteration where $\hat y$ is the predicted value.
- Ideally, K is 5 or 10. The optimal value may depend on the size and nature of the dataset.
- A higher K value can result in a more reliable performance estimate but may increase computational costs.
- K-fold is very helpful to limit issues related to the variability of a single train-test split.==> It provides a more robust evaluation of a model's performance by ensuring that every data point is used for testing exactly once.

#### 13. 3 - What is Stratified K-Fold Cross-Validation? 
- It is an extension of K-Fold Cross-Validation that ensures the distribution of the target variable's classes is approximately the same in each fold as it is in the entire dataset.
- In case of imbalanced datasets, this technique is prefered because some classes may be underrepresented.
- It helps in addressing issues related to  overrepresented or underrepresented classes in specific folds, which could lead to biased model evaluations.
- Here are the main steps for Stratified K-Fold Cross-Validation :
    - Data Splitting : ensuring that each fold has an equal distribution for each class samples.
    - Model Training and Evaluation: the same K-fold cross-validation, steps repeated K times
    - Average Performance : the average performance is calculated at the end of all K iterations to provide a robust performance estimate.
#### 13. 4 - What is Leave-One-Out Cross-Validation (LOOCV)?
- It is a specific case of k-fold cross-validation where the number of folds (K) is set equal to the number of data points in the dataset. 
- Each iteration one point is dedicated to testing while the remaining samples are dedicated for training
- The same as k-fold, we calculate the performance metric for each iteration then we evaluate the average.
- The process is repeated until each data point has been used as a test set exactly once.
- It has the next advantages: 
    - It minimizes bias introduced by the choice of a specific train-test split.
    - It provides a robust estimate of a model's performance since each data point serves as both training and test data.
    - It is computationally expensive, especially for large datasets, as we are going to create a model for each sample.
    - It is preferably to be used with only small datasets.
#### 13. 5 - What is Holdout Validation ?
- It is known as a train-test split. 
- The input dataset will be divided into two subsets: a training set (70-80%) and a testing set (20-30%).
- The exact split ratio depends on factors such as the size of the dataset and the nature of the machine learning task.
- The testing set is called Holdout Set also and it helps gathering an initial estimate of a model's performance.
- The performance metrics are accuracy, precision, recall, error, etc
- This technique is suitable if the input dataset is large enough to provide sufficient data for both training and testing, and when computational resources are limited compared to more computationally intensive methods like cross-validation.
- This technique could be not too reliable as the model performance can be influenced by the specific random split of data into training and testing sets. 
- To address this variability, multiple iterations of the holdout process can be performed, and the results can be averaged.

### 14- How to evaluate a Classification model?

- Many metrics are commonly used to evaluate the performance of classification models in machine learning.
- The choice of metrics depends on the specific goals and characteristics of the classification problem.
- Here are some classification metrics:
    - Confusion matrix
    - Accuracy
    - Precision
    - F1 Score
    - Recall (Sensitivity or True Positive Rate)
    - Specificity or True Negative Rate
    - Area Under the Receiver Operating Characteristic (ROC) Curve (AUC-ROC)
    - Area Under the Precision-Recall Curve (AUC-PR) 
- The choice of metrics depends on the specific requirements of the classification problem (binary classification or multiclass classification).
- For example, in imbalanced datasets, where one class significantly has large number of samples than the second class, precision, recall, and F1 score are often more informative than accuracy.

#### 14. 1- What is confusion matrix in classification problems?

- Confusion matrix is a table used to measure the performance of classification model
- It gives more details regarding the number of instances that were correctly or incorrectly classified for each class.
- The confusion matrix is a valuable tool for assessing the strengths and weaknesses of a classification model and guiding further optimization efforts.
- Here is an example of confusion matrix for a binary classification problem : 
![title](images/confusion-matrix1.jpeg)
##### a. True Positive : 
- samples that are from the positive class and were correctly classified or predicted as positive by the model.
##### b. True Negative :  
- samples that are from the negative class and were correctly classified or predicted as negative by the model.
##### c. False Positive : 
- samples that are from  the negative class but were incorrectly classified or predicted as positive by the model.
##### d. False Negative : 
- samples that are from  the positive class but were incorrectly classified or predicted as negative by the model

#### 14. 2- How to define Accuracy?

- An evaluation metric used to evaluate the performance of classification model.

- Divides the number of correctly classified observations by the total number of samples.

- **Formula:** $$Accuracy ={ Number  of Correct Predictions \over Total number of predictions }$$


- Here a second formula : $$Accuracy ={ TP + TN \over TP + TN + FP + FN }$$

#### 14. 3- How to define Precision ?
- An evaluation metric that measures the accuracy of the positive predictions made by the model. 
- It divides the number of true positive predictions by the sum of true positives and false positives.
- It belongs to [0,1] interval, 0 corresponds to no precision and 1 corresponds to perfect precision.
- Precision = Positive Predictive Power
- **Formula:** $$Precision = {True Positives \over True Positives + False Positives}$$ 

#### 14. 4- How to define Recall, Sensitivity or True Positive Rate?
- An evaluation metric that measures the ability of the model to capture all the positive samples.
- It divides number of true positives samples by the sum of true positives and false negatives.
- Recall = Sensitivity = True Positive Rate. 
- **Formula:** $$ Recall= {True Positives \over True Positives + False Negatives}$$
#### 14. 5- How to define F1-score? 
- An evaluation metric that combines both Precision and Recall.
- Wighted average of Precision and Recall.
- It can be calculated using the `f1_score()` function of `scikit-learn`
- F1 belongs to [0,1]: 0 is the worst case and 1 is the best.
- **Formula :** $$F1= {2×Precision×Recall \over Precision+Recall}$$

#### 14. 5- How to define Specificity or True Negative Rate ?
- Specificity measures the ability of the model to correctly identify negative instances.
- It divides the true negatives samples by the sum of true negatives observations and false positives observations.
- True Negative Rate = Specificity
- **Formula:** $$Specificity={True Negatives \over True Negatives + False Positives}$$ 
#### 14. 6- What is Receiver Operating Characteristic (ROC) and Area under-ROC curve (AUC-ROC)?
- ROC curve is a graphical representation of the model's performance across different classification thresholds.
- The shape of the curve contains a lot of information
- Area under the ROC curve : AUC-ROC provides a single metric indicating the model's ability to distinguish between classes.
- Here is ROC and AUC-ROC illustration:

![title](images/roc-curve-original.png)

- If AUC-ROC is high, then we have better model. Else, we have poor model performance.
- Smaller values on the x-axis of the curve point out lower false positives and higher true negatives.
- Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.
- We can plot the ROC curve using the `roc_curve()` scikit-learn function.
- To calculate the accuracy, we use `roc_auc_score()` function of `scikit-learn`.
* Note: False Positive Rate = 1- Specificity



*source: https://sefiks.com/2020/12/10/a-gentle-introduction-to-roc-curve-and-auc/

#### 14. 7- What is Area Under the Precision-Recall Curve (AUC-PR)?
- Similar to AUC-ROC, AUC-PR represents the area under the precision-recall curve.
- It provides a summary measure of a model's performance across various levels of precision and recall.
- It can be calculated using the `precision_recall_curve()` function of `scikit-learn`.
- The area under the precision-recall curve can be calculated using the `auc()` function of `scikit-learn` taking the recall and precision as input.

![title](images/precision_recall_curve.png)

*source: https://analyticsindiamag.com/complete-guide-to-understanding-precision-and-recall-curves/

- The same here if AUC-PR is high, then we have better model. Else, we have poor model performance.
- The recall is provided as the x-axis and precision is provided as the y-axis.
#### a. When to Use ROC vs. Precision-Recall Curves?
- Choosing either the ROC curves or precision-recall curves depends on your data distribution:
    - ROC curves: preferable to be used when there are roughly equal numbers of observations for each class.
    - ROC curves provide a good picture of the model when the dataset has large class imbalance.
    - Precision-Recall curves should be used when there is a moderate to large class imbalance.

#### 14. 8 - Classification Report Scikit-learn? 
- The `classification_report` function of `scikit-learn` provides a detailed summary of classification metrics for each class in a classification problem. 
- The report contains the next metrics:
    - Precision
    - Recall- sensitivity
    - F1-score
    - Specificity
    - Support
- Support: the number of actual instances of each class in the dataset.
#### 14. 9- How do we evaluate a classification report?
- High recall + high precision ==> the class is perfectly handled by the model. 
- Low recall + high precision ==> the model can not detect the class well but is highly trustable when it does.
- High recall + low precision ==> the class is well detected but model also includes points of other class in it. 
- Low recall + low precision ==> class is poorly handled by the model
#### 14. 10 What is log loss fucntion?
- It is an evaluation metric used in logistic regression
- Called logistic regression loss or cross-entropy loss
- Input of this loss function is probability value that belongs to [0,1].
- It measures the uncertaintly of our prediction based on how much it varies from the actual label.  

### 15- What are the performance metrics for Regression? 
- Several performance metrics are commonly used to evaluate the accuracy and goodness of fit of regression models.
- Here are some common performance metrics for regression:
    - **Mean Absolute Error (MAE)**
    - **Mean Squared Error (MSE)**
    - **Root Mean Squared Error (RMSE)**
    - **Mean Absolute Percentage Error (MAPE)**
    - **R-squared (R2)**
- The choice of metric is related to several goals and characteristics of the regression problem to solve.
- It is possible to use one of the above metrics next to accuracy, precision, and the ability to explain variance.
- Considering multiple metrics is better solution to gain a comprehensive understanding about the model performance.
- Almost, all regression tasks uses error to evaluate the model: if error is high ==> we need either to change the model or retrain it with more data.

#### 15. 1- What is Mean Absolute Error (MAE) ? 

- As its name indicates, it represents the average absolute difference between the predicted values and the actual values.
- **Formula :** $$MAE = {1\over n} {\sum \limits _{i=1} ^{n}|y_{i}-\hat{y}_{i}|}$$

#### 15. 2- What is Mean Squared Error (MSE) ?
- It represents the average squared difference between the predicted values and the actual values.
- It penalizes larger errors more heavily than MAE.
- **Formula:** $$MSE = {1\over n} {\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2}$$ 
#### 15. 3- What is Root Mean Squared Error (RMSE) ? 
- It represents the square root of the MSE
- It provides a measure of the average magnitude of errors in the same units as the target variable.
- **Formula:** $$RMSE= {\sqrt MSE} $$

#### 15. 4- What is Mean Absolute Percentage Error (MAPE) ? 
- It calculates the average percentage difference between the predicted and actual values.
- It is a relative error metric
- **Formula:** $$MAPE={1\over n} {\sum \limits _{i=1} ^{n}({|y_{i}-\hat{y}_{i}| \over |y_{i}|})} \times 100$$
#### 15. 5- What is R-squared (R2)
- It measures the proportion of the variance in the target variable that is predictable from the independent variables.
- It represents the correlation between true value and predicted value
- **Formula:** $$ R^2= 1 - {MSE \over Var(y) }$$
- $$ R^2= 1- {{\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2} \over {\sum \limits _{i=1} ^{n}(y_{i}-\overline{y})^2}}$$
- $\overline{y}$: is the mean of the target variable
- It is possible to use **Adjusted R-squared**, which provides a penalized version of R-squared that adjusts the model complexity
#### a. Correlation :
- It is a measure of linear relationship between two quantitative variables. 
- It belongs to [0,1]

### 16. how to choose a classifier based on training dataset size?
- If training set is small ==> it is better to use simple model with high bias and low variance seems to work better because they are less likely to overfit. 
- If training set is large ==> it is better to use model with low bias and high variance as this model type will tend to perform better with complex relationships. Example: Naive Bayes.
-  Balancing variance and bias is essential for developing models that perform well on both training and unseen data.

#### 16. 1-  What is data bias ?
- It is when the available data used in the training phase is not representative of the real-world population or phenomen of study.
- For example: when training data used to create a ml model has unfair discrepancies or inaccuracies. 
- The information provided by the data does not truly represent the situation.
- The existence of biased data can lead to undesired and often unfair outcomes (discriminatory results) when the model is applied to testing data because the model will learn these biases too. 
- Various types of bias are existing : selection bias, measurement bias and confirmation bias.
- Addressing data bias is an ongoing challenge in the field of machine learning, and researchers and practitioners are actively working to develop methods and tools to identify, measure, and mitigate bias in models.
- To mitigate data bias in machine learning, it's crucial to accomplish well studied steps: collecting diverse and representative data, thoroughly processing it, and regularly checking model predictions to ensure fairness.
- Example: a biased facial recognition model may perform poorly for certain demographic groups.

#### 16. 2-  What is variance? 

- Understanding variance is crucial in assessing the stability and generalization capability of models.
- It refers to the degree of spread or dispersion in a set of values.
- It measures the variability of each individual data points (observation) from the mean (average) of the dataset:
    - Higher variance: data points are more spread out from the mean ==> more dispersed distribution.
    - Lower variance:  data points are closer to the mean ==> more concentrated distribution.
- Formula:  $\sigma^2 = { \sum \limits _{i=1} ^{n}(X_{i} - \overline{X}) \over {n-1}}$
- The standard deviation ( $\sigma$) is the square root of the variance.
- If the predictions variance is :
    - Low: predictions varying little from each other. 
    - High: overfitting + reading too deelpy into the noise+ good performance on training data +poor performance on testing data
- Do not forget the bias-variance trade-off.

### 17- What are the performance metrics for Clustering ?
- Evaluating the performance of clustering algorithms is less straightforward compared to supervised learning tasks like classification.
- Clustering is often exploratory, and there may not be explicit labels for assessing correctness.
- Several metrics and methods are commonly used to assess the quality of clustering results. Here are some performance metrics for clustering:
    - Silhouette Score
    - Davies-Bouldin Index
    - Calinski-Harabasz Index (Variance Ratio Criterion)
    - Inertia (Within-Cluster Sum of Squares)
    - Normalized Mutual Information (NMI)
    - Cluster Purity
#### 17.1 How to compare two different clustering ?
- We can use the SSE: SUM of Squared Error 
- Formula: $SSE={\sum \limits _{k=1} ^{K} \sum \limits _{y_{i} \in C_{k}} ||y_{i}-x_{k}||^2}$
- $y_{i}$ : is the ith vector belonging to cluster $C_{k}$ and $x_{k}$ is the centroid 
- Formula of centroid: $$x_{k}={1\over N_{k}}{\sum \limits _{y_{i} \in C_{k}} y_{i}}$$
- If SSE is small, clusters are compact and well separated
- Cluster with smallest SSE is the best one

### 18 - Hyperparameters tuning or hyperparameter optimization
#### 18. 1- What does hyperparameter mean?
- Hyperparameters are external configuration settings that are not learned from the data but are set before the training process begins.
- These settings influence the learning process and the overall behavior of the model.
- Examples of hyperparameters :
    - Learning rates
    - Regularization parameters
    - Hidden layers number
    - Nodes number
    - Decision tree depth
- The choice of hyperparameters, which is called hyperparameter tuning can influence the performance of a machine learning model. 
- It is crucial to find the optimal values and achieve the best possible predictive performance.

#### 18. 2- What does hyperparameter tuning mean? 
- It is called hyperparameter optimization or model selection.
- It corresponds to finding the best set of hyperparameters for a machine learning model.
- Here are common steps of Hyperparameter tuning :
    - Define a Search Space
    - Choose a Search Method
    - Choose the right Objective Function
    - Search for Optimal Hyperparameters
    - Evaluate Performance
    - Select Best Hyperparameters
    - Final Model Training
- **Define a Search Space :** select the set of hyperparameters to be tuned and define a range of possible values for each.
- **Choose a Search Method:** choose a Search Method : Grid Search, Random Search, and more advanced techniques like Bayesian optimization.
- **Choose the right Objective Function:** select an objective function that evluates the performance of the model for a given set of hyperparameters. Examples: accuracy, precision, recall, or any other relevant measure.
- **Select Best Hyperparameters:** it involves training and evaluating the model with various hyperparameter combinations. Then, choose the optimal values.

- Hyperparameter tuning is essential for improving the generalization performance of a machine learning model.
- It helps to avoid overfitting and ensures that the model is well-configured to handle new, unseen data effectively.

#### 18. 3- What is Grid Search? 

- Gridsearch :
    - Performed using `GridSearchCV` of `scikit-learn`.
    - It consists on performing an exhaustive search for selecting a model using a predefined hyperparameter grid.
    - The data scientist set up a grid of hyperparameters values and for each combination, trains a model and evaluate performance on testing data ==> to select, at the end, the optimal parameters.
    - It explores the entire search space by following a grid pattern. 
    - The search space is defined by specifying discrete values or ranges for each hyperparameter
    - It is deep as it guarantees that every combination is evaluated.
    - However, it is computationally intensive especially when dealing with a large number of hyperparameters or a broad range of values.
    
#### 18. 4- What is Random search?
    
- Randomsearch: 
    - Set up a grid of hyperparameter values and selects random combinations to train the model and score.  
    

- Method: Random search randomly samples a specified number of hyperparameter combinations from the defined search space.
- Exploration: It explores the hyperparameter space randomly, which can be more efficient in some cases.
- Search Space: The search space is defined similarly to grid search but does not require discretization; it can handle continuous and discrete hyperparameters.
- Computational Efficiency: Random search is often more computationally efficient than grid search because it does not exhaustively evaluate every combination.
    
    
#### 18. 5- How to choose between Random Search and Grid Search  ?

- Comprehensive but Computationally Intensive: Grid search is thorough and guarantees that every combination is evaluated, but it can be computationally intensive, 

### 19-  What is the difference between a parameter and an hyperparameter? (check)
- Each machine learning model has : 
    - Parameters
    - Hyperparameters
- **Model parameters:**
    - It is a configuration variables that is internal to the model
    - It is estimated or learned by the model and not set manually
    - It is required to the model to make prediction
    - Examples:
        - $y=mx+c$ : m and c are parameters
        - $y=ax^2+bx+c$ : a, b,c are parameters
        - Support vectors in SVM 
        - Weights in ANN and Linear regression
- **Model hyperparameters:**
    - They are set before training the model ==> hyperparameters tuning
    - They are external to the model 
    - Can be found using (optimal solution):
        - GridSearch 
        - RandomSearch
        - Copy from previous problems
    - Or they can be set manually
    - Examples:
        - Learning rate of NN
        - C and *sigma* in SVM
        - K in KNN
    

### 20- What does interpolation and extrapolation mean?
- **Interpolation :** 
    - It is a mathematical and statistical technique used to estimate values that fall between known, observed, or measured data points.
    - The goal is to predict values within the range of the existing data
- **Extrapolation :**
    - Extrapolation comes with more uncertainty compared to interpolation,as it relies on the assumption that the underlying pattern persists outside the known range. 
    - Extrapolation can be risky, especially when the data may exhibit behavior that deviates from the observed pattern.

### 21- Correlation matrix versus Convariance matrix ? 
- Correlaion:
    - It is is normalized form of covariance.
    - It measures the linear relationship of variables.
    - Correlation values belongs [-1,1] : negative and positive relations.
    - It measures when a change in one variable can result a change in another
    - How strongly two random variables are related to each other

- Covariance : 
    - Tells us the direction of the linear relationship between two random variables
    - It is used to determine how much two random variables vary together

 

### 22- Distributed computing versus parralel computing ? 

- **Parallel computing:**
    - Allows breaking down a large computational task into smaller subtasks that can be executed simultaneously.
    - Subtasks execution is done simultaneously in parallel using multiple processors or cores within a single machine.
    - Characteristics: Shared Memory, Data Sharing, Single Machine (with multiple processors), lower communication overhead due to accessing shared memory directly. 
    - Applications : problems that can be divided into independent subtasks such as image processing, numerical simulations, scientific computing.
- **Distributed computing:**
    - Distributed computing divides a single task between multiple computers (nodes) to achieve a common goal.
    - Each computer used in distributed computing has its own processor.
    - Machines are often connected over a network, to work together on a computational task
    - Characteristics: multiple machines, communication over network, data distribution, designed with fault tolerance mechanisms since individual nodes may fail
    - Applications :
        - Large-scale data processing (e.g., big data analytics).
        - Web services, cloud computing, and distributed databases.
        - Solving problems that require the coordination of multiple machines.
        
- The choice between them depends on the nature of the problem, scale requirements, and communication considerations.

### 23 What does multicolinearity means?
- It is a statistical concept where several independent variables in model are correlated
- If correlation coefficient is +/- 1 ==> those two variables "perfectly collinear".
- 

### What does Bias variance trade off mean?
- The bias error is an error from the erroneous assumption in the learning algorithm
- High bias ==> underfitting : algorithm missunderstand the relevant relations between features and target outputs.
- The variance is an error from sensitivity to small fluctuations in the training set. 
- High variance==> overfitting : algorithm learns also the noise from the training data 

![title](images/bias_variance_tradeoff.jpeg)

### What does cardinality mean?
- The number of unique values in a column
###

### 10- What does Instance-Based Learning means : 
Also known as instance-based reasoning or memory-based learning, is a type of machine learning approach that makes predictions based on the similarity between new instances and instances in the training dataset. Instead of learning an explicit model during training, instance-based learning stores the entire training dataset and uses it to make predictions for new, unseen instances. K-Nearest Neighbors (KNN) is a classic example. 

It is suited for tasks where the relationships between input features and output labels are not easily captured by a simple model. It can be robust in the presence of noise and is capable of handling complex decision boundaries. However, it may be computationally expensive, especially when dealing with large datasets.

### Why should we create a ML pipeline: 
A ML pipeline is an end to end construct that orchestrates the flow of data into and output from a ml model(set of multiple models). 
It is a way to modify and automate  the workflow it takes to produce a ml model 
multiple sequential steps from data extraction, preprocessing to model training and deplyment.
### What is the difference between Inductive ML and Deductive ML?
- **Inductive Learning:**
    - Observes instances based on defined principles to draw a conclusion 
    - Example: explain to child to stay away of the fire and show him video
- **Deductive Learning:**
    - Conclude experiences 
    - Example: allow child to play with fire
### What is Maximum Information Criterion (MIC)?
- It is used to identify relationship between pairs of variables
- It measures the strength of linear and non-linear association between two variables x and y
- It captures a wide range of associations both functional and non-functional
- For functional: it provides $R^2$: coefficient of determination

### How Logistic regression works ?
- It is a classification algorithm used to predict a discret output.
- Types of outputs: 
    - Binary (2 classes)
    - Multiple (>2 classes)
    - Ordianl (Low, medium, High)
- It uses the sigmoid activation function to map predictions to probabilities
- Output:mx+b
- Sigmoid function formula: $$S(z)={1\over 1+ e^{-z}}$$
<div>
<img src="images/sigmoid-function.png" width="500"/>
</div>

### What is 'naive' in the Naive Bayes classifier?
- The classifier is called 'naive' because it makes assumptions that may or may not turn out to be correct
- The algorithm assumes the absolute independence of features==>the presence of one feature of a class is not related to the presence of any other feature
- Example: any fruit that is red and round is cherry ==> it can be true or false
### How to knwo which ML algorithm to use for your classification problem ?
- There is no fixed rule to choose. However, you can follow these guidelines: 
    - If accuracy is a concern ==> test different algorithms and cross-validate them
    - If the training dataset is small ==> use models that have low varaiance and high bias
    - If the training dataset is large ==> use models that have high variance and littke bias
### How to choose which ML algorithm tu use given a dataset?
- No master algorithm it all depends on the situation
- Answer the next questions: 
    - How much data?
    - Output: Continous, Categorical?
    - Is it classification, regression or clustering?
    - Is all output variables labled or mixed?

### What does decision tree means ? 
- decision tree can be used for :
    - Classification
    - Regression 
- We build a tree with datasets broken up into smaller subsets while developing the decision tree
- It can handle both categorical and numerical data 

### What does prunning decision tree means?
- Pruning is a technique in ML that reduces the size of DT ==> to reduce the complexity of final classifier 
- Pruning helps improve the predictive accuracy by reducing overfitting 
- Pruning can occur in :
    - Top-down fashion 
    - Bottom-down fashion
- Top-down fashion : it will traverse nodes and train subsets starting at the root
- Bottom-up fashion : it will begin at the leaf nodes 
### What are the popular pruning algorithms?
- Reduced error pruning :
    - starts with leaves, each node is replaced with its most popular class
    - if the prediction accuracy is not affected the change is kept 
    - There is an advantage of simplicity and speed 

- In supervised learning called matching matrix - cm
- check logloss in classification report 

### What is Bias term ?
- represents patterns that do not pass through the origin y=ax+b
