<a href="https://colab.research.google.com/github/anuva04/ML_Beginners/blob/main/multiple_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multiple Linear Regression
- more than one independent variable
- y = b0 + b1*x1 + b2*x2 + ... + bn*xn
- keep only those variables in the equation which are significant and actually predict something

## 5 methods of building models
- All-in
  - use all independent variable
  - depends on prior knowledge
  - used for backward elimination
- Backward elimination
  - select a significance level to leave the model
  - full model with all predictors (independent variables)
  - if the predictor with highest p-value and if it is greater than significance level, remove it
  - fit the model again without this predictor
  - repeat these steps as long as possible
- Forward selection
  - select a significance level to enter the model
  - fit all possible simple regression models with all variables and choose the one with lowest p-value
  - fit all possible models with one extra variable added to the one you already have
  - consider the one with lowest p-value
  - if p-value < significance level, repeat previous steps, else finish
  - keep doing this until you reach finish (p-value > SL)
- Bidirectional elimination or Stepwise Regression
  - select a SL to enter (SLE) and SL to stay (SLS)
  - perform next step of forward selection
  - perform all steps of backward elimination
  - repeat previous steps until at some point you can neither add variables neither remove any
- Score comparison
  - select a criterion for goodness of fit (eg. Akaike condition)
  - construct all possible models: 2^n - 1
  - select the one with best fit

## Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [2]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [3]:
print(X)

[[165349.2 136897.8 471784.1 'New York']
 [162597.7 151377.59 443898.53 'California']
 [153441.51 101145.55 407934.54 'Florida']
 [144372.41 118671.85 383199.62 'New York']
 [142107.34 91391.77 366168.42 'Florida']
 [131876.9 99814.71 362861.36 'New York']
 [134615.46 147198.87 127716.82 'California']
 [130298.13 145530.06 323876.68 'Florida']
 [120542.52 148718.95 311613.29 'New York']
 [123334.88 108679.17 304981.62 'California']
 [101913.08 110594.11 229160.95 'Florida']
 [100671.96 91790.61 249744.55 'California']
 [93863.75 127320.38 249839.44 'Florida']
 [91992.39 135495.07 252664.93 'California']
 [119943.24 156547.42 256512.92 'Florida']
 [114523.61 122616.84 261776.23 'New York']
 [78013.11 121597.55 264346.06 'California']
 [94657.16 145077.58 282574.31 'New York']
 [91749.16 114175.79 294919.57 'Florida']
 [86419.7 153514.11 0.0 'New York']
 [76253.86 113867.3 298664.47 'California']
 [78389.47 153773.43 299737.29 'New York']
 [73994.56 122782.75 303319.26 'Florida']
 [67532

## Encoding categorical data
- If there are n categories for a categorical variable, oneHotEncoding will give n new columns (called dummy variables)
- **We have to use (n-1) of these columns**
- the category for which column is not added, that coefficient is by default to the constant (b0) of the regression equation
- If there are 3 dummy variables D1, D2, D3, then D3 = 1 - (D1+D2)
- So D3 can always be determined if we know D1 and D2, and hence need not be included in the equation

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [5]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training the Multiple Linear Regression model on the Training set
- In multiple linear regression, there is absolutely no need for feature scaling
- dummy variable trap will be handled automatically
- the class used for linear regression takes care of best model on its own, we don't need to implement backward elimination on our own
- the class determines on its own if it is simple linear regression or multiple linear regression

In [8]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the Test set results
- np.concatenate((vector of predicted profits, vector of real profits), axis)
- axis = 0 for horizontal and axis = 1 for vertically
- reshape(length of array, 1) to get output vertically

In [9]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


## Statistical Significance
- It is the threshold of probability of an event occurring 
- If the probability of occurence of an event is less than 5% (most common threshold) then it is statiscally insignificant
- this threshold can be changed based on the event we are working with
- for example if it is related to life and death, than probably 1% threshold will be better
- the actual occurence of an event is called p-value
- it p-value is less than significance level, we will likely reject it