# Multiple Linear Regression

> Formula: $y = b_0 + b_1 * x_1 + b_2 * x_2 + \dots + b_n *x_n$  
$y$ is dependent variable (DV)  
$x_1 \dots x_n$ is independent variables (IVs)  
$b_1 \dots b_n$ is coefficient - How much affected that x does to y  
$b_0$ is constant (intercept)  

### Assumptions of a Linear Regression:
1. Linearity
2. Homoscedasticity
3. Multivariate normality
4. Independence of errors
5. Lack of multicollinearity

### Dummy Variables apply for:
- categorical data

> Create new bit (boolean) variables representing categorical data  
> Always omit one dummy variable

### The meaning of P value
- the P value is the probability of thing happening
- statistical significance is the point where in human intuitive terms, you get uneasy about the $H_0$ being true

## Building A Model

5 methods of building models:
- All-in
- Backward Elimination (*)
- Forward Selection
- Bidirectional Elimination
- Score Comparison  

_stepwise regression refer to 2, 3, 4_

#### 1. All-in Cases: throw all variables in
- Prior knowledge
- you have to
- preparing for Backward Elimination



#### 2. Backward Elimination:
1. Select a significance level to stay in the model (usually SL = 0.05)
2. Fit full model with all possible predictors
3. Consider the predictor with the _highest_ P-value. If _P > SL_, go to Step 4, otherwise FIN
4. Remove the predictor out of the model
5. Fit the model without the removed the predictor and goto step 3

#### 3. Forward Selection
1. Select a SL to enter (usually SL = 0.05)
2. Fit all simple regression (1 IV) models $y$ ~ $x_n$, Select the one with the lowest P-value
3. Keep this variable and fit all possible models with one extra predictor added to the one(s) you already have (IV += 1)
4. Consider the predictor with the *_lowest_* P-value. If _P < SL_, go to Step 3, otherwise FIN

#### 4. Bidirectional Elimination
1. Select a SL level to enter and to stay in the model (SLENTER = 0.05, SLSTAY = 0.05)
2. Perform the next step of Forward Elimination (add new variables with _P < SLENTER_)
3. Perform ALL step of Backward Elimination (Remove all old variables with _P > SLSTAY_)
4. If no new variables can enter and no old variables can exit => FIN

#### 5. All Possible Models
1. Select a criterion of goodness of fit (Akaike criterion, R-Square)
2. Construct All Possible Regression Models: $2^n-1$ total combinations
3. Select the one with the best criterion

## Importing the libraries

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Importing the dataset

In [4]:
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

## Encoding categorical data

In [8]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [9]:
print(X)

[[0.0 0.0 1.0 165349.2 136897.8 471784.1]
 [1.0 0.0 0.0 162597.7 151377.59 443898.53]
 [0.0 1.0 0.0 153441.51 101145.55 407934.54]
 [0.0 0.0 1.0 144372.41 118671.85 383199.62]
 [0.0 1.0 0.0 142107.34 91391.77 366168.42]
 [0.0 0.0 1.0 131876.9 99814.71 362861.36]
 [1.0 0.0 0.0 134615.46 147198.87 127716.82]
 [0.0 1.0 0.0 130298.13 145530.06 323876.68]
 [0.0 0.0 1.0 120542.52 148718.95 311613.29]
 [1.0 0.0 0.0 123334.88 108679.17 304981.62]
 [0.0 1.0 0.0 101913.08 110594.11 229160.95]
 [1.0 0.0 0.0 100671.96 91790.61 249744.55]
 [0.0 1.0 0.0 93863.75 127320.38 249839.44]
 [1.0 0.0 0.0 91992.39 135495.07 252664.93]
 [0.0 1.0 0.0 119943.24 156547.42 256512.92]
 [0.0 0.0 1.0 114523.61 122616.84 261776.23]
 [1.0 0.0 0.0 78013.11 121597.55 264346.06]
 [0.0 0.0 1.0 94657.16 145077.58 282574.31]
 [0.0 1.0 0.0 91749.16 114175.79 294919.57]
 [0.0 0.0 1.0 86419.7 153514.11 0.0]
 [1.0 0.0 0.0 76253.86 113867.3 298664.47]
 [0.0 0.0 1.0 78389.47 153773.43 299737.29]
 [0.0 1.0 0.0 73994.56 122782.75 3

## Splitting the dataset into the Training set and Test set

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Training the Multiple Linear Regression model on the Training set

> _No need to avoid dummy variables trap with sklearn model. sklearn model automatically avoid this trap_  
> _No need to do the backward elimination or others approach to select features since sklearn model has it done automatically_

In [14]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

## Predicting the Test set results

In [16]:
model.score(X_test, y_test)

0.9347068473282303

In [26]:
y_pred = model.predict(X_test)
np.set_printoptions(precision=2)
y_pred_display = y_pred.reshape(len(y_pred), 1)
y_test_display = y_test.reshape(len(y_test), 1)
print(np.concatenate((y_pred_display, y_test_display), axis=1))

[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


In [27]:
print(model.coef_)
print(model.intercept_)

[ 8.66e+01 -8.73e+02  7.86e+02  7.73e-01  3.29e-02  3.66e-02]
42467.52924854249


## Advanced: Implement Backward Elimination Approach
[backward elimination](mutiple_linear_regression_backward.ipynb)

## Advanced: Implement Multiple Linear Regression from Scratch