# Data Scientist Interview Questions
## Part 3 : ML : modelling and evaluation

This Jupyter notebook serves as a focused resource for individuals gearing up for technical interviews in the fields of machine learning engineering and data science. It specifically delves into questions related to all phases of machine learning model evaluation and deployment. Whether you're a candidate looking to sharpen your interview skills or an interviewer seeking insightful questions, this notebook provides valuable content for honing your understanding of machine learning evaluation and deployment.

## 0- What does Machine Learning mean ? 
Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on the development of algorithms and statistical models that enable computers to perform tasks without explicit programming. The core idea behind machine learning is to allow machines to learn patterns, make predictions, or optimize decisions based on data.

Key concepts in machine learning include:
- Types of Machine Learning: supervised, unsupervised and semi-supervised
- Types of machine learning problems: classification, regression and clustering
- Split data into Training, validation and testing sets (case of supervised)
- Choose the right algorithm depends on the problem you want to solve
- Model Evaluation
- Hyperparameter Tuning
- Deployment

## 1- What are three stages of building a machine learning model ? 
- The process of building a machine learning model includes three main stages, These stages are:
    - **Training phase:** after splitting the data into training and testing sets, training data is used to train our model on a labeled dataset. During the training phase, the model tries to learn relationships between input data and the corresponding output target values while adjusting its internal parameters. Throughout this phase, the model aims to maximise the accuracy of making precise predictions or classifications when exposed to unseen data.
    - **Validation phase:** after the model is well trained, we evaluate it on a seperate dataset known as the validation set (maximum 10% of our data). This dataset is not used during the training process. Validation stage helps identify the existence of certain overfitting (model performing well on training data but poorly on new data) or certain underfitting (model needs more training to capture the underlying patterns in the data).
    - **Testing (Inference) phase:** during this phase, the trained and validated model is applied to unseen dataset, called test dataset. This phase aims to evaluate the model's performance and provides a measure regarding the model's effectiveness and its ability to make accurate predictions in a production environment.
 

## 2- How to split your data while building a machine learning model ?    
- During the model building phase, it is required to split the data into three main sets to evaluate the model's performance and effectiveness. The three sets are: 
    - Training: used to train the model and learn relationship between inputs and outputs, contains 70-80% of our total dataset
    - Validation: used to validate the model, fine-tune the model's hyperparameters and assess its performance during training, it helps to prevent overfitting and underfitting. It contains 10-15% of the total data
    - Testing: used to test and evaluate the model's performance against unseen data and after validation phase. It is used to measure how effective will our built model be in a production environment. It contains 10-15% of the total data.

- Splitting data is accomplished after the preprocessing phase (handle missing values, categorical features, scale features, etc.). 
- It is important to ensure that the split is representative of the overall distribution of the data to avoid biased results.
- It is favorable to use cross-validation technique. 
- No fixed rule to split data between training, validation and testing, portions can vary based on individual preferences.

## 3- What are the types of ML algorithms ? 
Machine learning algorithms can be categorized into several types based on their learning styles and the nature of the task they are designed to solve.

Here are some common types of machine learning algorithms:
- **Supervised Learning** 
- **Unsupervised Learning**
- **Semi-Supervised Learning**
- **Deep Learning** 
- **Reinforcement Learning** 
- **Ensemble learning**  
- **Ranking**
- **Recommendation system** 


## 4- What does supervised, unsupervised and semi-supervised mean in ML? 

In machine learning, the terms "supervised learning," "unsupervised learning," and "semi-supervised learning" refer to different approaches based on the type of training data available and the learning task at hand:

- **Supervised Learning :** training a model on a labeled dataset, where the algorithm learns the relationship between input features and corresponding target labels. Can be used for Regression (continous output) or Classification (discrete output). 
- **Unsupervised Learning :** Deals with unlabeled data and aims to find patterns, structures, or relationships within the data. Can be used for Clustering (Groups similar data points together) or association
- **Semi-Supervised Learning:** Utilizes a combination of labeled and unlabeled data to improve learning performance, often in situations where obtaining labeled data is challenging and expensive.

### 4.1- What are Unsupervised Learning techniques ?
 We have two techniques, Clustering and association: 
 - Custering :  involves grouping similar data points together based on inherent patterns or similarities. Example: grouping customers with similar purchasing behavior for targeted marketing.. 
 - Association : identifying patterns of associations between different variables or items. Example: e-commerse website suggest other items for you to buy based on prior purchases.
 
### 4.2- What are Supervised Learning techniques ? 
We have two techniques: classfication and regression: 
- Regression : involves predicting a continuous output or numerical value based on input features. Examples : predicting house prices, temperature, stock prices etc.
- Classification : is the task of assigning predefined labels or categories to input data. We have two types of classification algorithms: 
    - Binary classification (two classes). Example: identifying whether an email is spam or not.
    - Multiclass classification (multiple classes). Example: classifying images of animals into different species.

## 5- Examples of well-known machine learning algorithms used to solve Regression problems

Here are some well-known machine learning algorithms commonly used to solve regression problems:

- Linear Regression
- Decision Trees
- Random Forest
- Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
- K-Nearest Neighbors (KNN)
- Bayesian Regression
- Lasso Regression
- Ridge Regression
- Neural Networks (Deep Learning)

More details regarding each algorithm could be found in `DS_ML_Interview_Questions_Regression_Analysis` 

## 6- Examples of well-known machine learning algorithms used to solve Classification problems

Here are some well-known machine learning algorithms commonly used to solve classification problems:

- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes
- AdaBoost
- Gradient Boosting Machines (GBM)
- XGBoost
- CatBoost
- LightGBM
- Neural Networks (Deep Learning)

More details regarding each algorithm could be found in `DS_ML_Interview_Questions_Classification_Analysis`

## 7- Examples of well-known machine learning algorithms used to solve Clustering problems

Several well-known machine learning algorithms are commonly used for solving clustering problems. Here are some examples:

- K-Means Clustering 
- Hierarchical Clustering
- Agglomerative Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Mean Shift
- Gaussian Mixture Model (GMM)

These algorithms address different types of clustering scenarios and have varying strengths depending on the nature of the data and the desired outcomes. The choice of clustering algorithm often depends on factors such as the shape of clusters, noise in the data, and the number of clusters expected.

## 8- How to choose which ML algorithm to use given a dataset?
- Choosing the right machine learning algorithm for a given dataset involves considering various factors related to the nature of the data and the problem at hand.
- No master algorithm it all depends on the situation
- Here's a step-by-step guide to take the right decision : 
    - **Understand the Problem or the situation :**
        - Understanding the nature of taregt variable (continous, categorical?, is all output variables labled or mixed?). 
        - Determine the problem we are trying to solve ( is it classification, regression or clustering?)
    - **Domain Knowledge:**
        - Trying to find any domain-specific knowledge that might influence the choice of algorithm. 
        - Certain algorithms may be well-suited to specific industries or types of problems.
    - **Explore the Data:**
        - Determine data dimension 
        - Perform exploratory data analysis (EDA) to understand the characteristics of the dataset.
        - Understand features distribution, identify patterns and detect outliers etc.
    - **Consider the Size of the Dataset:**
        - Small: simpler models or models with fewer parameters may be more suitable to avoid overfitting.
        - Large: more complex models can be considered.
    - **Check for Linearity:** 
        - Studying the relationships between features and the target variable are linear or nonlinear.
        - If linear: then use linear models as they are more effective in this case.
        - If nonlinear: then non linear models (e.g., decision trees, support vector machines) may be suitable for more complex relationships.
    - **Data pre-processing :**
        - Handle Categorical Data : some algorithms handle categorical data naturally (e.g., decision trees), while others may require encoding.
        - Dealing with Missing Values: some algorithms can handle missing data, while others may require imputation or removal of missing values.
        - Check for Outliers: some algorithms may be sensitive to outliers, while others are more robust.
    - **Consider the Speed and Scalability:**
        - Take into account the computational requirements of the algorithm.
        - Some algorithms are faster and more scalable than others, making them suitable for large datasets.
    - **Evaluate Model Complexity:**
        - Simple models like linear regression are interpretable but may not capture complex patterns. 
        - More complex models like ensemble methods (e.g., random forests, gradient boosting) can capture intricate relationships but may be prone to overfitting.
    - **Validation and Cross-Validation:**
        - Use validation techniques, such as cross-validation, to assess the performance of different algorithms.
        - This helps you choose the one that generalizes well to new, unseen data.
    - **Experiment and Iterate:**
        - It's often beneficial to experiment with multiple algorithms and compare their performance.
        - Iterate on the choice of algorithm based on performance metrics and insights gained during the modeling process.

**Note:**
- There is no one-size-fits-all solution, and the choice of the algorithm may involve some trial and error.
- Additionally, ensemble methods, which combine multiple models, can sometimes provide robust solutions.
- Keep in mind that the performance of an algorithm depends on the specific characteristics of your dataset and the goals of your analysis.

## 9- What is Ensemble learning?

Ensemble learning is a machine learning technique that involves combining the predictions of multiple individual models to improve overall performance and accuracy. Instead of relying on a single model, ensemble methods leverage the strengths of diverse models to compensate for each other's weaknesses. The idea is that by aggregating the predictions of multiple models, the ensemble can achieve better generalization and make more robust predictions than any individual model.

There are several ensemble learning methods, with two primary types being:
- **Bagging (Bootstrap Aggregating) :** 
    - Involves training multiple instances of the same model on different subsets of the training data, typically sampled with replacement. 
    - Examples : Random Forest, Bagged Decision Trees, Bagged SVM (Support Vector Machines), Bagged K-Nearest Neighbors, Bagged Neural Networks
- **Boosting :**
    - Focuses on sequentially training models, with each subsequent model giving more attention to the instances that the previous models misclassified. 
    - Examples: AdaBoost (Adaptive Boosting), Gradient Boosting, XGBoost (Extreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), CatBoost, GBM (Gradient Boosting Machine)


## 10- What is Reinforcement Learning?

Is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, allowing it to learn optimal strategies over time to maximize cumulative rewards. It is inspired by the way humans and animals learn from trial and error.

Here are some applications of Reinforcement Learning : 
- Automated Robots
- Natural Language Processing
- Marketing and Advertising 
- Image Processing
- Recommendation Systems
- Traffic Control 
- Healthcare 
- Etc.

## 11- What is Recommender Systems
Also known as recommendation systems or engines, are applications or algorithms designed to suggest items or content to users based on their preferences and behavior. These systems leverage data about users and items to make personalized recommendations, aiming to enhance user experience and satisfaction. There are two main types of recommender systems:

- Content-Based Recommender Systems
- Collaborative Filtering Recommender Systems
Recommender systems are widely used in various industries, including e-commerce, streaming services, social media, and more. They help users discover new items, increase user engagement, and contribute to business success by promoting relevant content and products

### 11. 1- What is Content-Based Recommender Systems ? 
### 11. 2- What is Collaborative Filtering Recommender Systems ?

## 12- What is Ranking ? 

Ranking in machine learning refers to the process of assigning a meaningful order or ranking to a set of items based on their relevance or importance. This is often used in scenarios where the goal is to prioritize or sort items based on their predicted or observed characteristics.

Ranking problems are common in various applications, including information retrieval, recommendation systems, and search engines.

## 13- What is Overfitting, causes and mitigation? 

- Overfitting is a common challenges in machine learning that relate to the performance of a model on unseen data.
- It occurs when a machine learning model learns the training data too well, capturing noise and random fluctuations in addition to the underlying patterns (as concept).
- High error on testing dataset.

### 13. 1-Key characteristics of Overfitting :

- Excellent Performance on Training Data.
- Poor Generalization to New Data
- Low Bias, High Variance: the model is not biased toward any particular assumption, but its predictions are highly sensitive to variations in the training data.
- Complex Model: overfit models are often overly complex and may capture noise or outliers as if they were significant patterns.
- Memorization of Training Data: instead of learning the underlying patterns, an overfit model may memorize specific details of the training data, including noise and outliers.

### 13. 2- Causes of Overfitting : 
- Too many features or parameters.
- Model is too complex for the available data.
- Training on a small dataset or training for too many iterations

### 13. 3- Overfitting mitigation :
- Regularization techniques (e.g., L1 or L2 regularization).
- Feature selection or dimensionality reduction.
- Increasing the amount of training data.
- Using simpler model architectures with less variables and parameters so variance can be reduced.
- Use of Cross-validation method

**Note:**
- It is important to find balance between model complexity and the ability to generalize to new, unseen data.

## 14- What is Underfitting, causes and mitigation? 
- It is the case when the model is too simple to capture the underlying patterns in the training data.
- Besides, the model performs poorly not only on the training data but also on new, unseen data.
- High error rate on both training and testing datasets.
- It occurs when the model lacks the complexity or flexibility to represent the underlying relationships between the features and the target variable.

### 14. 1-Key characteristics of underfitting :
- Poor Performance on Training Data
- Poor Generalization to New Data
- High Bias, Low Variance : The model is biased toward making overly simple assumptions about the data.
- Inability to Capture Patterns
- Simplistic Model: underfit models are often too simplistic and may not capture the nuances or complexities present in the data.

### 14. 2- Causes of Underfitting: 
- Too few features or parameters: inadequate feature representation.
- Insufficient model complexity: using a model that is too basic for the complexity of the data.
- Inadequate training time or data.


### 14. 3- Underfitting mitigation: 
- Increasing the complexity of the model
- Adding relevant features.
- Training for a longer duration.
- Considering more sophisticated model architectures.

**Note:**
- Increasing model complexity excessively may lead to overfitting. 
- Achieving a balance between overfitting and underfitting is crucial.
- This balance, often referred to as the model's "sweet spot", results in a model that generalizes well to new, unseen data. 
- Techniques like cross-validation, hyperparameter tuning, and monitoring learning curves can help strike this balance during the model development process.

## 15- What are the types of Regularization in Machine Learning

- Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the objective/loss function.
- It consists on adding a cost term that penalize the large weights of model.
- There are mainly two types of regularization commonly used: L1 regularization (Lasso) and L2 regularization (Ridge). - Additionally, Elastic Net is a combination of both L1 and L2 regularization. 

Here are all the used techniques in ML : 
- L1 Regularization (Lasso)
- L2 Regularization (Ridge)
- Elastic Net

### 15. 1 - L1 Regularization (Lasso) : 
L1 regularization tends to shrink some coefficients exactly to zero, effectively excluding the corresponding features from the model. It is often used when there is a belief that some features are irrelevant. The penalty term is the sum of the absolute values of the regression coefficients.

### 15. 2 - L2 Regularization (Ridge) : 

L2 regularization tends to shrink coefficients toward zero without eliminating them entirely. It is effective in dealing with multicollinearity (high correlation between predictors) and preventing overfitting. The penalty term is the sum of the squared values of the regression coefficients.


### 15. 3 - Elastic Net: 

Elastic Net combines both L1 and L2 penalties in the objective function. It has two control parameters, alpha (which controls the overall strength of regularization) and the mixing parameter, which determines the ratio between L1 and L2 penalties. It is useful when there are many correlated features, and it provides a balance between Lasso and Ridge.

**Note:**
- These regularization techniques help improve the generalization performance of machine learning models by preventing them from becoming too complex and fitting noise in the training data.
- The choice between L1, L2, or Elastic Net depends on the specific characteristics of the dataset and the modeling goals.


## 16- What is Model Validation Technique?

Validation techniques in machine learning are essential for assessing the performance of a model and ensuring its ability to generalize well to unseen data. 

Here are some common validation techniques:
- Train-Test Split 
- K-Fold Cross-Validation 
- Stratified K-Fold Cross-Validation
- Leave-One-Out Cross-Validation (LOOCV)
- Holdout Validation
- Time Series Cross-Validation

### 16. 1 - What is train-test-validation split?
- It is an important step to indicate how well a model will perform with real-world, unseen data.
- A good train-test-validation split helps mitigate overfitting and ensures that evaluation is not biased.
- It consists on dividing input dataset into three subsets:
    - Training: 70-80% of the data
    - Validation: 10-15% of the data
    - Testing: 10-15% of the data
- This split aims to ensure that the model is trained on a sufficiently large dataset, validated on a separate set to fine-tune parameters, and tested on a completely independent set to provide an unbiased evaluation of its performance.

### 16. 2 - What is K-Fold Cross-Validation?
- It is a technique used to assess the performance and generalization ability of a model. 
- The input dataset will be divided into k equally sized folds/groups.
- (K-1) folds are used for training and one fold is used for testing. Then, we evaluate the model. 
- Repeating the training and evaluation K times.
- Each time a different fold is taken as the test set while the remaining data is used for training.
- Here are the steps of the process :
    - Data Splitting
    - Model Training and Evaluation : iteration
    - Performance Metrics : error, accuracy, recall, precision etc is evaluated for each iteration.
    - Average Performance : average performance (error) is evaluated across all K iterations ==> provide a more reliable estimate of the model's performance.
- Error formula : $e(n)={y(n)-\hat y(n)}$ is calculated for each iteration where $\hat y$ is the predicted value.
- Ideally, K is 5 or 10. The optimal value may depend on the size and nature of the dataset.
- A higher K value can result in a more reliable performance estimate but may increase computational costs.
- K-fold is very helpful to limit issues related to the variability of a single train-test split.==> It provides a more robust evaluation of a model's performance by ensuring that every data point is used for testing exactly once.

### 16. 3 - What is Stratified K-Fold Cross-Validation? 
- It is an extension of K-Fold Cross-Validation that ensures the distribution of the target variable's classes is approximately the same in each fold as it is in the entire dataset.
- In case of imbalanced datasets, this technique is prefered because some classes may be underrepresented.
- It helps in addressing issues related to  overrepresented or underrepresented classes in specific folds, which could lead to biased model evaluations.
- Here are the main steps for Stratified K-Fold Cross-Validation :
    - Data Splitting : ensuring that each fold has an equal distribution for each class samples.
    - Model Training and Evaluation: the same K-fold cross-validation, steps repeated K times
    - Average Performance : the average performance is calculated at the end of all K iterations to provide a robust performance estimate.
    
### 16. 4 - What is Leave-One-Out Cross-Validation (LOOCV)?
- It is a specific case of k-fold cross-validation where the number of folds (K) is set equal to the number of data points in the dataset. 
- Each iteration one point is dedicated to testing while the remaining samples are dedicated for training
- The same as k-fold, we calculate the performance metric for each iteration then we evaluate the average.
- The process is repeated until each data point has been used as a test set exactly once.
- It has the next advantages: 
    - It minimizes bias introduced by the choice of a specific train-test split.
    - It provides a robust estimate of a model's performance since each data point serves as both training and test data.
    - It is computationally expensive, especially for large datasets, as we are going to create a model for each sample.
    - It is preferably to be used with only small datasets.
### 16. 5 - What is Holdout Validation ?
- It is known as a train-test split. 
- The input dataset will be divided into two subsets: a training set (70-80%) and a testing set (20-30%).
- The exact split ratio depends on factors such as the size of the dataset and the nature of the machine learning task.
- The testing set is called Holdout Set also and it helps gathering an initial estimate of a model's performance.
- The performance metrics are accuracy, precision, recall, error, etc
- This technique is suitable if the input dataset is large enough to provide sufficient data for both training and testing, and when computational resources are limited compared to more computationally intensive methods like cross-validation.
- This technique could be not too reliable as the model performance can be influenced by the specific random split of data into training and testing sets. 
- To address this variability, multiple iterations of the holdout process can be performed, and the results can be averaged.

## 17- What does correlation mean in ML?
- It corresponds to the statistical relationship between two variables x and y. 
- It measures the degree to which changes in one variable correspond to changes in another variable.
- It does not imply causation, it only quantifies the strength and direction of the linear relationship between variables.
- The Pearson correlation coefficient r, is the most used correlation measure and it ranges from -1 to 1: 
    - If **r > 0: Positive Correlation**, if one variable increases, the other variable tends to increase as well.
    - If **r < 0: Negative Correlation**, if one variable increases, the other variable tends to decrease.
    - If **r = 0: Zero Correlation**, indicates no linear relationship between the variables.
- Here are some key points about correlation: 
    - **Relationship Strength:** 
        - The magnitude of the correlation coefficient (closer to 1 or -1) indicates the strength of the relationship.
        - A value of 1 or -1 suggests a perfect linear relationship.
    - **Correlation Matrix:** 
        - Displays the correlations between multiple variables, providing insights into the relationships among them.
        - Here is an example of correlation matrix: 
        <div>
        <img src="images/corr_matrix_CHSI.png" width="400"/>
        </div>
    - **Feature Selection:**
        - Correlation analysis can be used in feature selection to identify and remove highly correlated features. 
        - Redundant features may not provide additional information and can lead to overfitting.
    - **Model Interpretability:**
        - Understanding the correlation between input features and the target variable can aid in model interpretability. 
        - Features with high correlation to the target may have stronger predictive power.
    
### 17. 1- What are the correlation limitations ? 

- It captures linear relationships but may not detect non-linear associations.
- Additionally, correlation does not imply causation, and spurious correlations may exist.
- For non-linear relationships or cases where data is not normally distributed, Spearman's rank correlation coefficient is an alternative measure.




## 18- What does cardinality mean in ML?

## 19- What is Variance? 

- Understanding variance is crucial in assessing the stability and generalization capability of models.
- It refers to the degree of spread or dispersion in a set of values.
- It measures the variability of each individual data points (observation) from the mean (average) of the dataset:
    - Higher variance: 
        - Data points are more spread out from the mean
        - More dispersed distribution
        - A broader spread
    - Lower variance:
        - Data points are closer to the mean
        - More concentrated distribution.
        
- Formula:  $$\sigma^2 = { \sum \limits _{i=1} ^{n}(X_{i} - \overline{X}) \over {n-1}}$$
- Where :
    - $\hat X$: the mean
    - $n$: the number of data points
    - $X_{i}$: represents each individual data point.
    
- The standard deviation ($\sigma$) is the square root of the variance.
- Understanding the variance of a model's predictions is essential. If the predictions variance is :
    - Low: predictions varying little from each other. 
    - High: 
        - Can indicate overfitting
        - Reading too deelpy into the noise, good performance on training data while poor performance on testing data

**Note:**  
- Outliers can have a significant impact on the variance, making it sensitive to extreme values.
- Understanding and managing variance is crucial for building models that generalize well to new data and avoiding overfitting.
- Do not forget the bias-variance trade-off.

## 20- What does data bias mean?

- It is when the available data used in the training phase is not representative of the real-world population or phenomen of study.
- It refers to the presence of systematic errors or inaccuracies in a dataset that can lead to unfair, unrepresentative, or skewed results when analyzing or modeling the data.
- The information provided by the data does not truly represent the situation.
- The existence of biased data can lead to undesired and often unfair outcomes (discriminatory results) when the model is applied to testing data because the model will learn these biases too. 
- Example: a biased facial recognition model may perform poorly for certain demographic groups.
- Various types of bias are existing :
    - Selection bias : when the process of selecting data points for the dataset is not random, leading to a non-representative sample
    - Measurement bias : errors or inconsistencies in how data is measured or recorded. Examples: errors in sensors, discrepancies in data collection methods, or differences in measurement standards.
    - Sampling Bias : if the method used to collect or sample data introduces a bias. 
    - Observer Bias: occurs when the individuals collecting or interpreting the data have subjective opinions or expectations that influence the data
    - Historical Bias : when historical data includes biased decisions or reflects societal biases, machine learning models trained on such data may perpetuate or even exacerbate those biases.
    - Algorithmic Bias : occurs when machine learning algorithms learn and perpetuate biases present in the training data. 

**Note:**
    - Addressing data bias is an ongoing challenge in the field of machine learning, and researchers and practitioners are actively working to develop methods and tools to identify, measure, and mitigate bias in models.
    
### 20. 1- How to mitigate Bias in ML?      
- To mitigate bias it's crucial to accomplish well studied steps: 
    - Collecting diverse and representative data.
    - Implement ethical data collection practices.
    - Thoroughly processing it to detect and identify biases in the data. 
    - Regularly checking model predictions to ensure fairness and to detect and rectify biases in the model predictions.
    - Develop and use algorithms that are designed to be aware of and mitigate biases : `Fairness-aware Algorithms`

### 20. 2- What is Bias term ?
- Represents patterns that do not pass through the origin
- Example: y=ax+b, b is the bias term

## 21- What does Bias-variance trade off mean?
- It is a fundamental concept in machine learning that involves finding the right balance between two sources of error, namely bias and variance, when building predictive models.
- The tradeoff arises because decreasing bias often increases variance, and vice versa. 
- Key points about the bias-variance tradeoff: 
    - **High bias : underfitting:** a model is too simple and it missunderstand the relevant relations between features and target outputs. It leads to systematic errors on both the training and test data.
    - **High variance : overfitting:** a model is too complex and fits the training data too closely, capturing noise as if it were a real pattern. It perform poorly on new, unseen data. 
- The goal is to find the optimal model complexity that minimizes both bias and variance, resulting in good generalization to new data.

![title](images/bias_variance_tradeoff.jpeg)

### 21. 1- How to find the right balance between variance and bias?
- Cross-validation techniques, such as k-fold cross-validation, can be used to estimate a model's bias and variance and guide the selection of an appropriate model complexity.
- Techniques like regularization can be employed to penalize overly complex models, helping to mitigate overfitting and find a better bias-variance tradeoff.

**Note:**
- Balancing bias and variance is a central challenge in machine learning, and understanding this tradeoff is essential for model selection, training, and evaluation. 

### 18 - Hyperparameters tuning or hyperparameter optimization
#### 18. 1- What does hyperparameter mean?
- Hyperparameters are external configuration settings that are not learned from the data but are set before the training process begins.
- These settings influence the learning process and the overall behavior of the model.
- Examples of hyperparameters :
    - Learning rates
    - Regularization parameters
    - Hidden layers number
    - Nodes number
    - Decision tree depth
- The choice of hyperparameters, which is called hyperparameter tuning can influence the performance of a machine learning model. 
- It is crucial to find the optimal values and achieve the best possible predictive performance.

#### 18. 2- What does hyperparameter tuning mean? 
- It is called hyperparameter optimization or model selection.
- It corresponds to finding the best set of hyperparameters for a machine learning model.
- Here are common steps of Hyperparameter tuning :
    - Define a Search Space
    - Choose a Search Method
    - Choose the right Objective Function
    - Search for Optimal Hyperparameters
    - Evaluate Performance
    - Select Best Hyperparameters
    - Final Model Training
- **Define a Search Space :** select the set of hyperparameters to be tuned and define a range of possible values for each.
- **Choose a Search Method:** choose a Search Method : Grid Search, Random Search, and more advanced techniques like Bayesian optimization.
- **Choose the right Objective Function:** select an objective function that evluates the performance of the model for a given set of hyperparameters. Examples: accuracy, precision, recall, or any other relevant measure.
- **Select Best Hyperparameters:** it involves training and evaluating the model with various hyperparameter combinations. Then, choose the optimal values.

- Hyperparameter tuning is essential for improving the generalization performance of a machine learning model.
- It helps to avoid overfitting and ensures that the model is well-configured to handle new, unseen data effectively.

#### 18. 3- What is Grid Search? 

- Gridsearch :
    - Performed using `GridSearchCV` of `scikit-learn`.
    - It consists on performing an exhaustive search for selecting a model using a predefined hyperparameter grid.
    - The data scientist set up a grid of hyperparameters values and for each combination, trains a model and evaluate performance on testing data ==> to select, at the end, the optimal parameters.
    - It explores the entire search space by following a grid pattern. 
    - The search space is defined by specifying discrete values or ranges for each hyperparameter
    - It is deep as it guarantees that every combination is evaluated.
    - However, it is computationally intensive especially when dealing with a large number of hyperparameters or a broad range of values.
    
#### 18. 4- What is Random search?
    
- Randomsearch: 
    - Set up a grid of hyperparameter values and selects random combinations to train the model and score.  
    

- Method: Random search randomly samples a specified number of hyperparameter combinations from the defined search space.
- Exploration: It explores the hyperparameter space randomly, which can be more efficient in some cases.
- Search Space: The search space is defined similarly to grid search but does not require discretization; it can handle continuous and discrete hyperparameters.
- Computational Efficiency: Random search is often more computationally efficient than grid search because it does not exhaustively evaluate every combination.
    
    
#### 18. 5- How to choose between Random Search and Grid Search  ?

- Comprehensive but Computationally Intensive: Grid search is thorough and guarantees that every combination is evaluated, but it can be computationally intensive, 

### 19-  What is the difference between a parameter and an hyperparameter? (check)
- Each machine learning model has : 
    - Parameters
    - Hyperparameters
- **Model parameters:**
    - It is a configuration variables that is internal to the model
    - It is estimated or learned by the model and not set manually
    - It is required to the model to make prediction
    - Examples:
        - $y=mx+c$ : m and c are parameters
        - $y=ax^2+bx+c$ : a, b,c are parameters
        - Support vectors in SVM 
        - Weights in ANN and Linear regression
- **Model hyperparameters:**
    - They are set before training the model ==> hyperparameters tuning
    - They are external to the model 
    - Can be found using (optimal solution):
        - GridSearch 
        - RandomSearch
        - Copy from previous problems
    - Or they can be set manually
    - Examples:
        - Learning rate of NN
        - C and *sigma* in SVM
        - K in KNN
    

### 20- What does interpolation and extrapolation mean?
- **Interpolation :** 
    - It is a mathematical and statistical technique used to estimate values that fall between known, observed, or measured data points.
    - The goal is to predict values within the range of the existing data
- **Extrapolation :**
    - Extrapolation comes with more uncertainty compared to interpolation,as it relies on the assumption that the underlying pattern persists outside the known range. 
    - Extrapolation can be risky, especially when the data may exhibit behavior that deviates from the observed pattern.

### 21- Correlation matrix versus Convariance matrix ? 
- Correlaion:
    - It is is normalized form of covariance.
    - It measures the linear relationship of variables.
    - Correlation values belongs [-1,1] : negative and positive relations.
    - It measures when a change in one variable can result a change in another
    - How strongly two random variables are related to each other

- Covariance : 
    - Tells us the direction of the linear relationship between two random variables
    - It is used to determine how much two random variables vary together

 


### What is the difference between Type I error and Type II error ?
- Type I error :
    - Occurs when the null hypothesis is true and we reject it
    - False positive, you think something has happened while it really does not.
- Type II error :
    - Occurs when the null hypothesis is false and we accept it 
    - False negative something has happened and we are missing it 

### 22- Distributed computing versus parralel computing ? 

- **Parallel computing:**
    - Allows breaking down a large computational task into smaller subtasks that can be executed simultaneously.
    - Subtasks execution is done simultaneously in parallel using multiple processors or cores within a single machine.
    - Characteristics: Shared Memory, Data Sharing, Single Machine (with multiple processors), lower communication overhead due to accessing shared memory directly. 
    - Applications : problems that can be divided into independent subtasks such as image processing, numerical simulations, scientific computing.
- **Distributed computing:**
    - Distributed computing divides a single task between multiple computers (nodes) to achieve a common goal.
    - Each computer used in distributed computing has its own processor.
    - Machines are often connected over a network, to work together on a computational task
    - Characteristics: multiple machines, communication over network, data distribution, designed with fault tolerance mechanisms since individual nodes may fail
    - Applications :
        - Large-scale data processing (e.g., big data analytics).
        - Web services, cloud computing, and distributed databases.
        - Solving problems that require the coordination of multiple machines.
        
- The choice between them depends on the nature of the problem, scale requirements, and communication considerations.


### What is Maximum Information Criterion (MIC)?
- It is used to identify relationship between pairs of variables
- It measures the strength of linear and non-linear association between two variables x and y
- It captures a wide range of associations both functional and non-functional
- For functional: it provides $R^2$: coefficient of determination
### 23 What does multicolinearity means?
- It is a statistical concept where several independent variables in model are correlated
- If correlation coefficient is +/- 1 ==> those two variables "perfectly collinear".
- 

### 10- What does Instance-Based Learning means : 
Also known as instance-based reasoning or memory-based learning, is a type of machine learning approach that makes predictions based on the similarity between new instances and instances in the training dataset. Instead of learning an explicit model during training, instance-based learning stores the entire training dataset and uses it to make predictions for new, unseen instances. K-Nearest Neighbors (KNN) is a classic example. 

It is suited for tasks where the relationships between input features and output labels are not easily captured by a simple model. It can be robust in the presence of noise and is capable of handling complex decision boundaries. However, it may be computationally expensive, especially when dealing with large datasets.

## 10- Why should we create a ML pipeline: 
A ML pipeline is an end to end construct that orchestrates the flow of data into and output from a ml model(set of multiple models). 
It is a way to modify and automate  the workflow it takes to produce a ml model 
multiple sequential steps from data extraction, preprocessing to model training and deplyment.
## 9- What is the difference between Inductive ML and Deductive ML?
- **Inductive Learning:**
    - Observes instances based on defined principles to draw a conclusion 
    - Example: explain to child to stay away of the fire and show him video
- **Deductive Learning:**
    - Conclude experiences 
    - Example: allow child to play with fire