# Linear Regression in Machine Learning

Linear regression is a statistical method used to examine the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictor or explanatory variables).

A linear regression model assumes that there is a linear relationship between the dependent variable and the independent variables. That is, it assumes that a change in the independent variables will result in a proportional change in the dependent variable.

**Example:** predicting housing prices based on features such as the number of bedrooms, square footage, and location.

<img src="Linear-reg1.png" width="350" height="300" />

Mathematically, we can represent a linear regression as:

$y=a_0+a_1 x + \epsilon$

where

- $y$= Dependent Variable (Target Variable)
- $x$= Independent Variable (predictor Variable)
- $a_0$= intercept of the line (Gives an additional degree of freedom)
- $a_1$ = Linear regression coefficient (scale factor to each input value).
- $\epsilon$ = random error


## Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

1. **Simple Linear Regression:** If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
2. **Multiple Linear regression:** If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.

### Relationship of regression lines

- A linear line showing the relationship between the dependent and independent variables is called a regression line. 
- A regression line can show two types of relationship:

1. **Positive Linear Relationship:** If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-in-machine-learning2.png" width="300" height="250" />

2. **Negative Linear Relationship:** If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/linear-regression-in-machine-learning3.png" width="300" height="250" />

##  Steps in building a regression model

1. **STEP 1: Collect/Extract Data:** The first step in building a regression model is to collect or extract data on the dependent (outcome) variable and independent (feature) variables from different data sources. Data collection in many cases can be time-consuming and expensive, even when the organization has well-designed enterprise resource planning (ERP) system.

2. **STEP 2: Pre-Process the Data:** Before the model is built, it is essential to ensure the quality of the data for issues such as reliability, com- pleteness, usefulness, accuracy, missing data, and outliers.
    - Data imputation techniques may be used to deal with missing data. Use of descriptive statistics  and visualization (such as box plot and scatter plot) may be used to identify the existence of outliers and variability in the dataset.
    - Many new variables (such as the ratio of variables or product of variables) can be derived (aka feature engineering) and also used in model building.
    - Categorical data has must be pre-processed using dummy variables (part of feature engineering) before it is used in the regression model.


3. **STEP 3: Dividing Data into Training and Validation Datasets:** In this stage the data is divided into two subsets (sometimes more than two subsets): training dataset and validation or test dataset. The proportion of training dataset is usually between 70% and 80% of the data and the remaining data is treated as the validation data. The subsets may be created using random/ stratified sampling procedure. This is an important step to measure the performance of the model using dataset not used in model building. It is also essential to check for any overfitting of the model. In many cases, multiple training and multiple test data are used (called cross-validation).