# Data Scientist Interview Questions
## Part 4 : ML : Regression Analysis

This Jupyter notebook serves as a comprehensive resource for machine learning enthusiasts and those preparing for technical interviews in the fields of data science and machine learning. It encompasses essential information and detailed insights regarding regression analysis.

Whether strengthening foundational knowledge or delving into specific regression concepts, this notebook provides a concise yet thorough guide. For those aspiring to excel in data science interviews, exploring this resource offers a valuable edge in understanding regression analysis within the machine learning domain.

### Q0- What does regression analysis mean?

- It is a statistical technique used in data science and machine learning fields. 
- It aims to model the relationship between a dependent variable and one or more independent variables.
- By modeling the relationship between inputs and output, it is easy to understand the nature and strength of the relationship and to make predictions based on that understanding.

- Mainly, we use regression analysis to resolve problems and answer questions such as:
    - How does a change in one variable (independent variable) impact another variable (dependent variable)?
    - Can we predict the value of the dependent variable based on the values of one or more independent variables?
- It is widely used in various fields, including economics, finance, biology, psychology, and machine learning.
 

### Q1- Examples of well-known machine learning algorithms used to solve regression problems

Here are some well-known machine learning algorithms commonly used to solve regression problems:

- Linear Regression
- Decision Trees
- Bayesian Regression
- Lasso Regression
- Ridge Regression
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Random Forest
- Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
- Neural Networks (Deep Learning)

### Q2- What is linear regression, and how does it work?

- It is the easiest and one of the most popular Machine Learning algorithms for predictive analysis
- LR, is a statistical method to model the relationship between a dependent variable (target) and one or more independent variables (inputs).
- Called "linear" because we assume the existence of linear relationship between previous variables.
- It aims to predict continuous/real or numeric variables such as temperature, salary, quantity, price etc. using the remaining features
- It can be classified into two main types : 
    - Simple Linear Regression : to model relationship between an independent variable (x) and a dependent variable (y).
    - Multiple Linear Regression : involves using more than one independent variable (X) to model the relationship with the dependent variable (y).
- It can be used for both continuous and categorical dependent variables (y) and can handle multiple independent variables. 
    
 

### Q3- How Simple Linear Regression works? 

- It is used to model relationship between an independent variable (x) and a dependent variable (y).
- Example: Predicting the price of a house based on its size.
<img src="images/lin_reg.png" width="300">   

- The line of regression, is a line of best fit is plotted on a scatter plot of the data points as it is shown in the Figure below
- The equation of this line is : $$y=w \times x + b$$

    - Where : 
        - y: dependent/response/target variable, we want to predict it or explain it.
        - x: independent/input/predictor variable(s), it is (they are) used to predict or explain the variability of y
        - w: regression coefficients: the parameters in the regression equation that indicate the strength and direction of the relationship between variables.
        - b:bias term which represents patterns that do not pass through the origin
        
- The line is determined by finding the values of the slope (w) and intercept (b) that minimize the sum of residuals.
- Residuals: 
    - Corresponds to the prediction error which is differences between the observed (y) and predicted values ($\hat y$), .
    - Formula : $e=y-\hat y$
    - We calculate the Sum
- Our main goal is to find the best fit line where the error between predicted values and actual values should be minimized.

*Source:https://www.javatpoint.com/linear-regression-in-machine-learning

### Q4- How Multiple Linear Regression works? 
- The unique difference between Simple and Multiple Linear Regression lies in the number of independent variables used in the regression model.
- We have multiple independent variables $x_1, x_2, ..., x_n$
- New equation: $y=b_0+b_1 x_1 + b_2x_2+ ...+b_n x_n$
- Where $b_0$ represents the intercept, and $b_1, b_2, ..., b_n$ represent the coefficients of the independent variables.
- Simple linear regression involves one independent variable, while multiple linear regression involves two or more independent variables.
- Example: Predicting the performance of a student based on their age, gender, IQ, etc.

### Q5-  What assumptions should you consider before starting a linear regression analysis?

Here are some important assumptions of Linear Regression: 

- Linear relationship between the independent and dependent variables.
- No or little multicolinearity between the features: independent variables are not correlated with each other
- Normal distribution of error terms: residuals (errors), are normally distributed with a mean of zero and a constant variance.
- The residuals are independent of each other, no autocorrelations in error terms
- The model includes all the relevant independent variables needed to accurately predict the dependent variable.
- Homoscedasticity Assumption: it is a situation when the error term is the same for all the values of independent variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot

**Note:**
- Multicollinearity involves high-correlation between the independent variables.
- In this situation, it become difficult to find the true relationship between the predictors and target variables.
- More precisely, it is challenging to point which predictor variable has the major influence on the target variable. 

### Q6- What are the performance metrics for Regression Analysis? 
- Several performance metrics are commonly used to evaluate the accuracy and goodness of fit of regression models.
- Here are some common performance metrics for regression:
    - **Mean Absolute Error (MAE)**
    - **Mean Squared Error (MSE)**
    - **Root Mean Squared Error (RMSE)**
    - **Mean Absolute Percentage Error (MAPE)**
    - **R-squared (R2)**
- The choice of metric is related to several goals and characteristics of the regression problem to solve.
- It is possible to use one of the above metrics next to accuracy, precision, and the ability to explain variance.
- Considering multiple metrics is better solution to gain a comprehensive understanding about the model performance.
- Almost, all regression tasks uses error to evaluate the model: if error is high ==> we need either to change the model or retrain it with more data.

### Q7- What is Mean Absolute Error (MAE) ? 

- As its name indicates, it represents the average absolute difference between the predicted values and the actual values.
- **Formula :** $$MAE = {1\over n} {\sum \limits _{i=1} ^{n}|y_{i}-\hat{y}_{i}|}$$

### Q8- What is Mean Squared Error (MSE) ?
- It represents the average squared difference between the predicted values and the actual values.
- It penalizes larger errors more heavily than MAE.
- **Formula:** $$MSE = {1\over n} {\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2}$$ 

### Q9- What is Root Mean Squared Error (RMSE) ? 
- It represents the square root of the MSE
- It provides a measure of the average magnitude of errors in the same units as the target variable.
- **Formula:** $$RMSE= {\sqrt MSE} $$

### Q10- What is Mean Absolute Percentage Error (MAPE) ? 
- It calculates the average percentage difference between the predicted and actual values.
- It is a relative error metric
- **Formula:** $$MAPE={1\over n} {\sum \limits _{i=1} ^{n}({|y_{i}-\hat{y}_{i}| \over |y_{i}|})} \times 100$$
### Q11- What is R-squared (R2)
- It measures the proportion of the variance in the target variable that is predictable from the independent variables.
- It represents the correlation between true value and predicted value
- **Formula:** $$ R^2= 1 - {MSE \over Var(y) }= 1- {{\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2} \over {\sum \limits _{i=1} ^{n}(y_{i}-\overline{y})^2}}$$
- $\overline{y}$: is the mean of the target variable
- It is possible to use **Adjusted R-squared**, which provides a penalized version of R-squared that adjusts the model complexity


## 4- How Decision Trees are used for Regression analysis ?
### What does decision tree means ? 
- Decision tree can be used for :
    - Classification
    - Regression 
- We build a tree with datasets broken up into smaller subsets while developing the decision tree
- It can handle both categorical and numerical data 

### What does prunning decision tree means?
- Pruning is a technique in ML that reduces the size of DT ==> to reduce the complexity of final classifier 
- Pruning helps improve the predictive accuracy by reducing overfitting 
- Pruning can occur in :
    - Top-down fashion 
    - Bottom-down fashion
- Top-down fashion : it will traverse nodes and train subsets starting at the root
- Bottom-up fashion : it will begin at the leaf nodes 
### What are the popular pruning algorithms?
- Reduced error pruning :
    - starts with leaves, each node is replaced with its most popular class
    - if the prediction accuracy is not affected the change is kept 
    - There is an advantage of simplicity and speed 
### What are the different menthods to split a tree in decision tree method?
Here is the list of methods :
- Variance :
    - Splitting using the variance
    - Prefered when the target variable is continous
    - Variance is ......
- Information gain :
    - Splitting the nodes using information gain
    - Prefered when the target variable is categorical
    - Formula : IG=1- Entropy 
    - $Entropy =-{\sum p_{i} log_{2}p_{i}}$
- Gini Impurity :
    - Splitting the nodes of a decision tree using Gini impurity 
    - Prefered When the target variable is categorical
    

## 5- What does Random Forest mean and how it works?

## 6- What does Bayesian Regression mean and how it works?

## 7- How Gradient Boosting Algorithms (e.g., XGBoost, LightGBM) are used to resolve regression problem?

## 8- How can we use Neural Network to resolve regression problem?

## 9- How determine whether a predictive model is underfitting or overfitting the training data?

## 10-What is Lasso Regression and how it works?

## 11- What is Ridge Regression and how it works?