# Multiple Linear Regression

## Assumptions of Linear Regression

![Assumptions of Linear Regression](Multiple-Linear-Regression-Intuition-00.PNG)

We have to first check whether these assumptions hold for our dataset before going with a Linear Regression Model.

## Intuition behind Multiple Linear Regression

Linear Regression is useful when we want to predict a continuous numerical value for linear datasets (i.e. datasets with linear relationships between Features and Target Variable). Depending on the no: of features, Linear Regression can be

1.   Simple Linear Regression - only a single feature
2.   Multiple Linear Regression - multiple features


Multiple Linear Regression involves finding the best fitting line that correlates multiple features with the target variable.
![Multiple Linear Regression Equation](Multiple-Linear-Regression-Intuition-01.PNG)

### Problem Statement

Lets say we have a dataset containing profit data of 50 startups with vastly different proportional expenditure in R&D, Administration and Marketing. All of these are situated in one of the 3 states - New York, California or Florida. Here,


*   Independent Variable/Feature = R&D Spend, Administration Spend, Marketing Spend, State
*   Dependent/Target Variable = Profit

The given data is to be analyzed for a Venture Capitalist Fund to find if there is any correlation between the Profit, the different amounts spent in different heads (R&D, Administration, Marketing) and the state where the Startup is situated. Since we have multiple features and the Target Variable (Profit) is a continuous variable, this is a Multiple Linear Regression problem.

How this scenario fits into Multiple Linear Regression can be visualized as below:
![Multiple Linear Regression for Profit Prediction: Problem](Multiple-Linear-Regression-Intuition-02.PNG)

### Dummy Variables

As you can see, we encounter difficulty in mapping 'State' column into the Multiple Linear Regression Equation. We first need to convert such categorical columns into numerical values using $Dummy$ $Variables$. One-hot Encoding can be used for achieving that.

Suppose we have only two States, New York and California. Then we can apply Dummy Variables as below:
![Multiple Linear Regression for Profit Prediction: Dummy Variables](Multiple-Linear-Regression-Intuition-03.PNG)

Only one Dummy Column needs to be retained in this case since it represents our State info fully.

$D_1 = 0 : State = California$

$D_1 = 1 : State = New York$

The phenomenon where one or several independent variables in a Linear Regression Model predict another is called $Multicollinearity$. If we use two dummy variables $D_1$ and $D_2$, we fall into this $Dummy$ $Variable$ $Trap$. As a general rule, always omit one dummy variable irrespective of the no: of dummy variables. Also, if you have two sets of dummy variables, then you need to apply the same rule to each set.

![Multiple Linear Regression for Profit Prediction: Dummy Variable Trap](Multiple-Linear-Regression-Intuition-04.PNG)

But, in our dataset, we have three categories for 'State' column - New York, California and Florida. Hence we need to use two Dummy Variables ()

### Building a Model

Our requirement is to build a model that best predicts the general trend from the dataset. Hence, we need to choose the variables carefully so as not to fall into the trap of $Overfitting$ or $Underfitting$.

![Multiple Linear Regression for Profit Prediction: Building a Model](Multiple-Linear-Regression-Intuition-05.PNG)

The different methods for building a model are as below:

![Multiple Linear Regression for Profit Prediction: Methods for Building a Model](Multiple-Linear-Regression-Intuition-06.PNG)

#### Method 1: All-in

This is applicable if you are sure that you have to use all the independent variables. This can be the scenario in the following cases:

![Multiple Linear Regression - Building a Model: All-in Method](Multiple-Linear-Regression-Intuition-07.PNG)

#### Method 2: Backward Elimination

Backward Elimination Method is the approach that yields results fast. The steps for Backward Elimination Method are as below:

![Multiple Linear Regression - Building a Model: Backward Elimination Method](Multiple-Linear-Regression-Intuition-08.PNG)

**Important Note:** Each time a variable is removed, the model should be rebuilt with new coefficients and constants in Step 5.

#### Method 3: Forward Selection

The steps for Forward Selection Method are as below:

![Multiple Linear Regression - Building a Model: Forward Selection Method](Multiple-Linear-Regression-Intuition-09.PNG)

#### Method 4: Bidirectional Elimination

The steps for Bidirectional Elimination Method are as below:

![Multiple Linear Regression - Building a Model: Bidirectional Elimination Method](Multiple-Linear-Regression-Intuition-10.PNG)

#### Method 5: Score Comparison of All Possible Models

The steps for Score Comparison Method are as below:

![Multiple Linear Regression - Building a Model: Score Comparison Method](Multiple-Linear-Regression-Intuition-11.PNG)

**Important Note:** Here, we do not try to remove any non-significant columns. But, given a dataset, we go all in for all the possible models and then find the best one. Hence, it is a very resource consuming approach and is not recommended for datasets with a large no: of columns.

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Encoding categorical data

In [None]:
# One-hot encoding of 'State' column
# Replace 'State' column with 3 new dummy columns (since we have 3 categories for 'State')
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set

## Predicting the Test set results