# Data Scientist Interview Questions
## Part 3 : ML : Regression Analysis

This Jupyter notebook serves as a comprehensive resource for machine learning enthusiasts and those preparing for technical interviews in the fields of data science and machine learning. It encompasses essential information and detailed insights into regression analysis, making it a valuable reference for interview preparation. Whether strengthening foundational knowledge or delving into specific regression concepts, this notebook provides a concise yet thorough guide. For those aspiring to excel in data science interviews, exploring this resource offers a valuable edge in understanding regression analysis within the machine learning domain.

### 0- What does regression analysis mean?
- Regression analysis is a statistical technique used in data science and statistics to model the relationship between a dependent variable and one or more independent variables.
- The primary goal of regression analysis is to understand the nature and strength of the relationship between variables and to make predictions based on that understanding.
- It helps answering questions such as:
    - How does a change in one variable (independent variable) impact another variable (dependent variable)?
    - Can we predict the value of the dependent variable based on the values of one or more independent variables?
- Resgression formula: $Y=W \times X + b$
- Where : 
    - Y: dependent variable (response variable): the variable that we want to predict or explain.
    - X: independent variable(s) (predictor variable(s)): the variable(s) used to predict or explain the variability in the dependent variable.
    - W: regression coefficients: these are the parameters in the regression equation that indicate the strength and direction of the relationship between variables.
    - b:bias term 
- Residuals: 
    - Differences between the observed (y(n)) and predicted values ($y\hat(n)$), representing the model's error.
    - Formula : $e(n)=y(n)-y\hat(n)$

How does a change in one variable (independent variable) impact another variable (dependent variable)?
Can we predict the value of the dependent variable based on the values of one or more independent variables?

### 3- Examples of well-known machine learning algorithms used to solve regression problems

Certainly! Here are some well-known machine learning algorithms commonly used to solve regression problems:

- Linear Regression
- Lasso Regression
- Ridge Regression
- Decision Trees
- Random Forest
- Gradient Boosting Algorithms (e.g., XGBoost, LightGBM)
- Support Vector Machines (SVM)
- K-Nearest Neighbors (KNN)
- Bayesian Regression
- Neural Networks (Deep Learning):

### 15- What are the performance metrics for Regression? 
- Several performance metrics are commonly used to evaluate the accuracy and goodness of fit of regression models.
- Here are some common performance metrics for regression:
    - **Mean Absolute Error (MAE)**
    - **Mean Squared Error (MSE)**
    - **Root Mean Squared Error (RMSE)**
    - **Mean Absolute Percentage Error (MAPE)**
    - **R-squared (R2)**
- The choice of metric is related to several goals and characteristics of the regression problem to solve.
- It is possible to use one of the above metrics next to accuracy, precision, and the ability to explain variance.
- Considering multiple metrics is better solution to gain a comprehensive understanding about the model performance.
- Almost, all regression tasks uses error to evaluate the model: if error is high ==> we need either to change the model or retrain it with more data.

#### 15. 1- What is Mean Absolute Error (MAE) ? 

- As its name indicates, it represents the average absolute difference between the predicted values and the actual values.
- **Formula :** $$MAE = {1\over n} {\sum \limits _{i=1} ^{n}|y_{i}-\hat{y}_{i}|}$$

#### 15. 2- What is Mean Squared Error (MSE) ?
- It represents the average squared difference between the predicted values and the actual values.
- It penalizes larger errors more heavily than MAE.
- **Formula:** $$MSE = {1\over n} {\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2}$$ 
#### 15. 3- What is Root Mean Squared Error (RMSE) ? 
- It represents the square root of the MSE
- It provides a measure of the average magnitude of errors in the same units as the target variable.
- **Formula:** $$RMSE= {\sqrt MSE} $$

#### 15. 4- What is Mean Absolute Percentage Error (MAPE) ? 
- It calculates the average percentage difference between the predicted and actual values.
- It is a relative error metric
- **Formula:** $$MAPE={1\over n} {\sum \limits _{i=1} ^{n}({|y_{i}-\hat{y}_{i}| \over |y_{i}|})} \times 100$$
#### 15. 5- What is R-squared (R2)
- It measures the proportion of the variance in the target variable that is predictable from the independent variables.
- It represents the correlation between true value and predicted value
- **Formula:** $$ R^2= 1 - {MSE \over Var(y) }$$
- $$ R^2= 1- {{\sum \limits _{i=1} ^{n}(y_{i}-\hat{y}_{i})^2} \over {\sum \limits _{i=1} ^{n}(y_{i}-\overline{y})^2}}$$
- $\overline{y}$: is the mean of the target variable
- It is possible to use **Adjusted R-squared**, which provides a penalized version of R-squared that adjusts the model complexity
#### a. Correlation :
- It is a measure of linear relationship between two quantitative variables. 
- It belongs to [0,1]