# Data Scientist Interview Questions
## Part 3 : ML : modelling and evaluation

This Jupyter notebook serves as a focused resource for individuals gearing up for technical interviews in the fields of machine learning engineering and data science. It specifically delves into questions related to all phases of machine learning model evaluation and deployment. Whether you're a candidate looking to sharpen your interview skills or an interviewer seeking insightful questions, this notebook provides valuable content for honing your understanding of machine learning evaluation and deployment.

### 0- What does Machine Learning means ? 

### What are three stages of building a machine learning model ? 
Here are the three stages :

- Model building
- Model Testing 
- Model applying

### 1- What are the types of ML algorithms ? 
Machine learning algorithms can be categorized into several types based on their learning styles and the nature of the task they are designed to solve.

Here are some common types of machine learning algorithms:
- **Supervised Learning** 
- **Unsupervised Learning**
- **Semi-Supervised Learning**
- **Deep Learning** 
- **Reinforcement Learning** 
- **Ensemble learning**  
- **Ranking**
- **Recommendation system** 


#### 1.1- What does supervised, unsupervised and semi-supervised means in ML? 

In machine learning, the terms "supervised learning," "unsupervised learning," and "semi-supervised learning" refer to different approaches based on the type of training data available and the learning task at hand:

- **Supervised Learning :** training a model on a labeled dataset, where the algorithm learns the relationship between input features and corresponding target labels. Can be used for Regression (continous output) or Classification (discrete output). 
- **Unsupervised Learning :** Deals with unlabeled data and aims to find patterns, structures, or relationships within the data. Can be used for Clustering (Groups similar data points together) or association
- **Semi-Supervised Learning:** Utilizes a combination of labeled and unlabeled data to improve learning performance, often in situations where obtaining labeled data is challenging and expensive.

#### 1.2- What are Unsupervised Learning techniques ?
 We have two techniques, Clustering and association: 
 - Custering :  involves grouping similar data points together based on inherent patterns or similarities. Example: grouping customers with similar purchasing behavior for targeted marketing.. 
 - Association : identifying patterns of associations between different variables or items. Example: e-commerse website suggest other items for you to buy based on prior purchases.
#### 1.3- What are Supervised Learning techniques ? 
We have two techniques: classfication and regression: 
- Regression : involves predicting a continuous output or numerical value based on input features. Examples : predicting house prices, temperature, stock prices etc.
- Classification : is the task of assigning predefined labels or categories to input data. We have two types of classification algorithms: 
    - Binary classification (two classes). Example: identifying whether an email is spam or not.
    - Multiclass classification (multiple classes). Example: classifying images of animals into different species.

### 2- Examples of well-known machine learning algorithms used to solve regression problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve regression problems:

- Linear Regression
- Lasso Regression
- Ridge Regression
- Decision Trees
- Random Forest
- Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Bayesian Regression
- Neural Networks (Deep Learning):

### 3- Examples of well-known machine learning algorithms used to solve classification problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve classification problems:

- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Naive Bayes
- Neural Networks (Deep Learning)
- AdaBoost
- Gradient Boosting Machines (GBM)
- XGBoost
- CatBoost
- LightGBM


### 4- Examples of well-known machine learning algorithms used to solve clustering problems

Several well-known machine learning algorithms are commonly used for solving clustering problems. Here are some examples:

- K-Means Clustering 
- Hierarchical Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Mean Shift
- Gaussian Mixture Model (GMM)
- Agglomerative Clustering

!!! Density / Distribution!!!

These algorithms address different types of clustering scenarios and have varying strengths depending on the nature of the data and the desired outcomes. The choice of clustering algorithm often depends on factors such as the shape of clusters, noise in the data, and the number of clusters expected.

#### 4.1- K-Means 

#### 4.2- Hierarchical Clustering Versus Agglomerative Clustering
Hierarchical clustering is a type of clustering algorithm, and agglomerative clustering is a specific approach within hierarchical clustering: 
- **Hierarchical Clustering :** starts with each data point as a separate cluster and then iteratively merges or splits clusters based on their similarity, forming a dendrogram. It can be broadly classified into two types: agglomerative (bottom-up) and divisive (top-down).
- **Agglomerative Clustering :** is a specific approach within hierarchical clustering. Here is how it works : 
    - Each data point is initially a separate cluster.
    - The closest pair of clusters is merged into a single cluster.
    - Steps 1 and 2 are repeated until all data points belong to a single cluster.

### 5 -What is Ensemble learning?

Ensemble learning is a machine learning technique that involves combining the predictions of multiple individual models to improve overall performance and accuracy. Instead of relying on a single model, ensemble methods leverage the strengths of diverse models to compensate for each other's weaknesses. The idea is that by aggregating the predictions of multiple models, the ensemble can achieve better generalization and make more robust predictions than any individual model.

There are several ensemble learning methods, with two primary types being:
- **Bagging (Bootstrap Aggregating) :** 
    - Involves training multiple instances of the same model on different subsets of the training data, typically sampled with replacement. 
    - Examples : Random Forest, Bagged Decision Trees, Bagged SVM (Support Vector Machines), Bagged K-Nearest Neighbors, Bagged Neural Networks
- **Boosting :**
    - Focuses on sequentially training models, with each subsequent model giving more attention to the instances that the previous models misclassified. 
    - Examples: AdaBoost (Adaptive Boosting), Gradient Boosting, XGBoost (Extreme Gradient Boosting), LightGBM (Light Gradient Boosting Machine), CatBoost, GBM (Gradient Boosting Machine)


### 6 -What is Deep Learning?

Is a subset of machine learning that involves training artificial neural networks with multiple layers (deep neural networks) to model complex patterns and representations. The term "deep" refers to the depth of the neural network, which consists of multiple hidden layers through which data is processed.

It has achieved remarkable success in various domains, including image and speech recognition, natural language processing, and game playing. It eliminates the need for manual feature engineering by automatically learning hierarchical representations from raw data during the training process. Deep learning models are trained using large amounts of data and are capable of capturing complex patterns and representations.

We have two main examples :
- Convolutional Neural Networks (CNNs) for image processing
- Recurrent Neural Networks (RNNs) for sequence data.

#### 6. 1- What is neural network?
#### 6. 2- What is Convolutional Neural Networks : CNNs? 
#### 6. 3- What is Recurrent Neural Networks : RNNs? 

### 7- What is Recommender Systems
Also known as recommendation systems or engines, are applications or algorithms designed to suggest items or content to users based on their preferences and behavior. These systems leverage data about users and items to make personalized recommendations, aiming to enhance user experience and satisfaction. There are two main types of recommender systems:

- Content-Based Recommender Systems
- Collaborative Filtering Recommender Systems
Recommender systems are widely used in various industries, including e-commerce, streaming services, social media, and more. They help users discover new items, increase user engagement, and contribute to business success by promoting relevant content and products

#### 7. 1- What is Content-Based Recommender Systems ? 
#### 7. 2- What is Collaborative Filtering Recommender Systems ?

### 8 -What is Reinforcement Learning?

Is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on its actions, allowing it to learn optimal strategies over time to maximize cumulative rewards. It is inspired by the way humans and animals learn from trial and error.

Here are some applications of Reinforcement Learning : 
- Automated Robots
- Natural Language Processing
- Marketing and Advertising 
- Image Processing
- Recommendation Systems
- Traffic Control 
- Healthcare 
- Etc.

### 9- What is Ranking ? 

Ranking in machine learning refers to the process of assigning a meaningful order or ranking to a set of items based on their relevance or importance. This is often used in scenarios where the goal is to prioritize or sort items based on their predicted or observed characteristics.

Ranking problems are common in various applications, including information retrieval, recommendation systems, and search engines.

### 10- Overfitting and Underfitting

Overfitting and underfitting are common challenges in machine learning that relate to the performance of a model on unseen data.
- Overfitting : occurs when a machine learning model learns the training data too well, capturing noise and random fluctuations in addition to the underlying patterns (as concept).

- Underfitting : happens when a model is too simplistic and cannot capture the underlying patterns in the training data.

#### a- Overfitting Causes and Mitigation:
- Causes:
    - Too many features or parameters.
    - Complex model architectures.
    - Limited training data.
- Consequences
- Mitigation:
    - Regularization techniques (e.g., L1 or L2 regularization).
    - Feature selection or dimensionality reduction.
    - Increasing the amount of training data.
    - Using simpler model architectures.
    
#### b- Underfitting Causes and Mitigation:
- Causes:
    - Too few features or parameters.
    - Insufficient model complexity.
    - Inadequate training time or data.
- Mitigation:
    - Increasing the complexity of the model.
    - Adding relevant features.
    - Training for a longer duration.
    - Considering more sophisticated model architectures.

Achieving a balance between overfitting and underfitting is crucial. This balance, often referred to as the model's "sweet spot," results in a model that generalizes well to new, unseen data. Techniques like cross-validation, hyperparameter tuning, and monitoring learning curves can help strike this balance during the model development process.

### 11- What are the types of Regularization in Machine Learning

Regularization is a technique used in machine learning to prevent overfitting by adding a penalty term to the objective function. There are mainly two types of regularization commonly used: L1 regularization (Lasso) and L2 regularization (Ridge). Additionally, Elastic Net is a combination of both L1 and L2 regularization. 

Here are all the used techniques in ML : 
- L1 Regularization (Lasso)
- L2 Regularization (Ridge)
- Elastic Net

#### - L1 Regularization (Lasso) : 
L1 regularization tends to shrink some coefficients exactly to zero, effectively excluding the corresponding features from the model. It is often used when there is a belief that some features are irrelevant. The penalty term is the sum of the absolute values of the regression coefficients.

#### - L2 Regularization (Ridge) : 

L2 regularization tends to shrink coefficients toward zero without eliminating them entirely. It is effective in dealing with multicollinearity (high correlation between predictors) and preventing overfitting. The penalty term is the sum of the squared values of the regression coefficients.


#### - Elastic Net: 

Elastic Net combines both L1 and L2 penalties in the objective function. It has two control parameters, alpha (which controls the overall strength of regularization) and the mixing parameter, which determines the ratio between L1 and L2 penalties. It is useful when there are many correlated features, and it provides a balance between Lasso and Ridge.


These regularization techniques help improve the generalization performance of machine learning models by preventing them from becoming too complex and fitting noise in the training data. The choice between L1, L2, or Elastic Net depends on the specific characteristics of the dataset and the modeling goals.


### 12- What is Model Validation ?

Validation techniques in machine learning are essential for assessing the performance of a model and ensuring its ability to generalize well to unseen data. 

Here are some common validation techniques:
- Train-Test Split 
- K-Fold Cross-Validation 
- Stratified K-Fold Cross-Validation
- Leave-One-Out Cross-Validation (LOOCV)
- Holdout Validation
- Time Series Cross-Validation

#### What are Performance metrics for Classification

#### What are Performance metrics for Regression

#### What are performance metrics for Clustering 

### What is tranfert learning

model convergence 

Bias metric
central limit theorem
correlation matrix versus convariance matrix

Grid Search versus Random search
why should we create a ML pipeline

    Distributed computing versus parrallel computing 
    
    interpolation and extrapolation 
    
    
    reduction techniques
    
    Grid Search versus Rnadom Search 
    
    Why should we cre

### 10- What does Instance-Based Learning means : 
Also known as instance-based reasoning or memory-based learning, is a type of machine learning approach that makes predictions based on the similarity between new instances and instances in the training dataset. Instead of learning an explicit model during training, instance-based learning stores the entire training dataset and uses it to make predictions for new, unseen instances. K-Nearest Neighbors (KNN) is a classic example. 

It is suited for tasks where the relationships between input features and output labels are not easily captured by a simple model. It can be robust in the presence of noise and is capable of handling complex decision boundaries. However, it may be computationally expensive, especially when dealing with large datasets.

### Why should we create a ML pipeline: 
A ML pipeline is an end to end construct that orchestrates the flow of data into and output from a ml model(set of multiple models). 
It is a way to modify and automate  the workflow it takes to produce a ml model 
multiple sequential steps from data extraction, preprocessing to model training and deplyment.