# Regression

**Regression** is a statistical technique in machine learning (Supervised ML) that helps predict continuous values by finding correlations between independent and dependent variables. e.g., house prices, market trends, weather patterns, and oil and gas prices.

The goal of a regression algorithm is to plot a best-fit line or curve between the data. The trained model can then be used to predict the target for new data points.

**Terminologies Related to Regression Analysis:**
1. Response Variable: The primary factor to predict or understand in regression, also known as the dependent variable or target variable.
2. Predictor Variable: Factors influencing the response variable, used to predict its values; also called independent variables.
3. Outliers: Observations with significantly low or high values compared to others, potentially impacting results and best avoided.
4. Multicollinearity: High correlation among independent variables, which can complicate the ranking of influential variables.
5. Underfitting and Overfitting: Overfitting occurs when an algorithm performs well on training but poorly on testing, while underfitting indicates poor performance on both datasets.


**Regression Types**

**1. Simple Regression** - Used to predict a continuous dependent variable based on a **single independent variable**.

    Y = a + bX + c

- Y = the (dependent) variable that is to be predicted - to be put on Y-axis
- X = the (independent) variable to be used to predict Y -  - to be put on X-axis
- a = the intercept
- b = slope
- c = regression residual

example: Weight will increase as per height ==> Height is dependent variable (on Y axis) and Weight, the independent variable on X-axis

**2. Multiple/Multi variate Regression** - Used to predict a continuous dependent variable based on **multiple independent variables**.

    Y = a + b1X1 + b2X2 + b3X3 + ... + bnXn + c

X1, X2, X3 etc are variable. e.g., to predict the salary of a person number of variables to be considered like, years of experience, skills, geo-location etc.

**3. NonLinear Regression** - Relationship between the dependent variable and independent variable(s) follows a nonlinear pattern.


**Regression Algorithms**

**1. Linear Regression**
- Linear regression is one of the simplest and most widely used statistical models.
- This means that the change in the dependent variable is proportional to the change in the independent variables.

**2. Logistic regression** 
- Logistic regression is used when the target variable is binary or has two classes. 
- It models the probability of an event occurring -- for example, yes/no or success/failure -- based on predictor variables. 
- Logistic regression is commonly used in business contexts for binary classification tasks such as customer churn prediction or transaction fraud detection.

**3. Polynomial Regression**
- Polynomial regression is used to model nonlinear relationships between the dependent variable and the independent variables. 
- It adds polynomial terms to the linear regression model to capture more complex relationships.

**4. Support Vector Regression (SVR)**
- SVM is a type of algorithm that is used for classification tasks, but it can also be used for regression tasks. 
- SVR works by finding a hyperplane that minimizes the sum of the squared residuals between the predicted and actual values.

**5. Time series regression**
- Time series regression, such as autoregressive integrated moving average -- Arima -- models, incorporate time dependencies and trends to forecast future values based on past observations. 
- These are useful for business applications such as sales forecasting, demand prediction and stock market analysis.

**6. Decision Tree Regression**
- Decision tree regression is a type of regression algorithm that builds a decision tree to predict the target value. 
- A decision tree is a tree-like structure that consists of nodes and branches. Each node represents a decision, and each branch represents the outcome of that decision. 
- The goal of decision tree regression is to build a tree that can accurately predict the target value for new data points.

**7. Random Forest Regression**
- Random forest regression is an ensemble method that combines multiple decision trees to predict the target value. 
- Ensemble methods are a type of machine learning algorithm that combines multiple models to improve the performance of the overall model. 
- Random forest regression works by building a large number of decision trees, each of which is trained on a different subset of the training data. 
- The final prediction is made by averaging the predictions of all of the trees.

**Regularized Linear Regression Techniques** (Prevent overfitting)
Overfitting occurs when the model learns the training data too well and is unable to generalize to new data.

**1. Ridge Regression** 
- the cost of function is altered by adding a penalty equivalent to square of the magnitude of the coefficients.

**2. Lasso regression** 
- It uses shrinkage, where data valiues are shrunk towards a central point, like the mean.
- It does this by adding a penalty term to the loss function that forces the model to use some weights and to set others to zero.


**Characteristics of Regression**

1. Continuous Target Variable: 
- Regression deals with predicting continuous target variables that represent numerical values. 
- Examples include predicting house prices, forecasting sales figures, or estimating patient recovery times.

2. Error Measurement: 
- Regression models are evaluated based on their ability to minimize the error between the predicted and actual values of the target variable. 
- Common error metrics include mean absolute error (MAE), mean squared error (MSE), and root mean squared error (RMSE).

3. Model Complexity: 
- Regression models range from simple linear models to more complex nonlinear models. 
- The choice of model complexity depends on the complexity of the relationship between the input features and the target variable.

4. Overfitting and Underfitting: Regression models are susceptible to overfitting and underfitting.

5. Interpretability: Simple linear models are highly interpretable, while more complex models may be more difficult to interpret.


**Examples of regression**
1. Sales forecasting
2. Customer lifetime value prediction
3. Churn prediction - Predicting the likelihood of customers leaving the company's services based on their usage patterns, customer interactions and other related features.
4. Employee performance prediction - Predicting the performance of employees based on various factors such as training, experience and demographics.
5. Financial performance analysis - Understanding the relationship between financial metrics (e.g., revenue, profit) and key drivers (e.g., marketing expenses, operational costs).
6. Risk analysis and fraud detection - Predicting the likelihood of events such as credit defaults, insurance claims or fraud based on historical data and risk indicators.
7. Maintenance prediction - Predicting time to failure of critical parts and machinery.


**Regression Evaluation Metrics**

1. Mean Absolute Error (MAE): The average absolute difference between the predicted and actual values of the target variable.

2. Mean Squared Error (MSE): The average squared difference between the predicted and actual values of the target variable.

3. Root Mean Squared Error (RMSE): The square root of the mean squared error.

4. Huber Loss: A hybrid loss function that transitions from MAE to MSE for larger errors, providing balance between robustness and MSE’s sensitivity to outliers.

5. Root Mean Square Logarithmic Error

6. R2 – Score: Higher values indicate better fit, ranging from 0 to 1.