# The Notebook is for understaning Backward elimination method for building model


### Importing the libraries 

In [1]:
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression

### Reading the dataset (50 startups with the statement of there profit and loss)

In [2]:
dataset = pd.read_csv("./50_Startups.csv")
print (dataset.head())
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:,4].values

   R&D Spend  Administration  Marketing Spend       State     Profit
0  165349.20       136897.80        471784.10    New York  192261.83
1  162597.70       151377.59        443898.53  California  191792.06
2  153441.51       101145.55        407934.54     Florida  191050.39
3  144372.41       118671.85        383199.62    New York  182901.99
4  142107.34        91391.77        366168.42     Florida  166187.94


### As we can see, "State" column is the categorical data. Therefore it needs to be converted in mathematical representation.

### In order to convert the categorical data into the matematical representation oneHotEncoding will be used.

In [3]:
labelencoder_X = LabelEncoder()

In [4]:
X[:, 3] = labelencoder_X.fit_transform(X[:,3])

In [5]:
onehotencoder = OneHotEncoder(categorical_features=[3])

In [6]:
X = onehotencoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


### Avoiding the dummy variable trap

In [7]:
X = X[:, 1:]

### Splitting the data into training and testing set

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### With the above step of data splitting we are now ready for the multi-linear regression

### For the purpose of Multi-linear-regression we will again use Linear Regressor because Multi-linear-regression is nothing but linear regression with multiple slope 

In [9]:
regressor = LinearRegression()

In [10]:
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### From the above lie we have trained a multi-linear regresson and is ready for the testing

In [11]:
y_pred = regressor.predict(X_test)

In [12]:
y_hat = pd.DataFrame(y_pred, columns=["calculated"])
y_act = pd.DataFrame(y_test, columns=["actual"])

output = pd.concat([y_act, y_hat], axis=1)
output

Unnamed: 0,actual,calculated
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211
5,105008.31,116161.242302
6,81229.06,67851.692097
7,97483.56,98791.733747
8,110352.25,113969.43533
9,166187.94,167921.065696


# From the above exmple it is very clear that, "multiple independent variales can be used in linear regression." But the question is it enough to train a model. Can it be possible that we select statisticlly important columns to make the prediction. 


In [13]:
! pip install statsmodels

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [19]:
import statsmodels.api as sm

#### Multi-linear-regression has a eqation of 
##### y = b0 + b1*X1 + b2*X2 + b3*X3 + ...... + bn* Xn
#### From the above eqeation we can see that constant " b0 " do not have a coefficent " X0 " for this we can say that its equal to 1. Therefore multi-linear-regreion equation can re-writen as :
##### y = b0*X0 + b1*X1 + b2*X2 + b3*X3 + ...... + bn* Xn
#### To implement this we have to add a column in dataset that has a value one


# Now as per the backward elimination theory :

## Backward Elimination: (Fastest one)
    Step 1: Select the significance level  to stay in the model (0.05 is most comanlly used value).
    
    Step 2: Fit full model with all the possible predictors 
    
    Step 3: Consider the predictor ith the higest P-Value. If P > SL(significance level)
    
            Step-4: Remove the predictor 
            
            Step 5: Fit model without this variable.
         Else:
            Model is ready

In [15]:
X = np.append(arr =  np.ones((50,1)).astype(int), values = X, axis = 1)

#### Adding a new variable " X_optimal " this will eventually contain only statistically important features from the dataset.

#### As per the theory of backward elimination method step 2 is to fit the full model.

In [17]:
X_optimal = X[:, [0,1,2,3,4,5]]

##### OLS : Ordinary Least Squares

In [21]:
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()

In [22]:
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Wed, 01 Apr 2020",Prob (F-statistic):,1.34e-27
Time:,07:28:34,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


### From the above summery we can see that constant x2 has the p value 0.990 that is very high as compaired to significance level (SL) 0.05. So as per the theory of backward elimination method we remove this variable and again perform the test.

In [23]:
X_optimal = X[:, [0,1,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Wed, 01 Apr 2020",Prob (F-statistic):,8.49e-29
Time:,07:34:03,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
x1,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


### Now after removing the x2 from the previous step we again checked and now we have identified that x1 has a p value of 0.940 and our significance level allows only upto 0.05. So again performing the step of elimination

In [24]:
X_optimal = X[:, [0,3,4,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Wed, 01 Apr 2020",Prob (F-statistic):,4.53e-30
Time:,07:38:28,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


#### after removing all the very high p value based independent variable now we can see that there is one variable with p value 0.602

In [25]:
X_optimal = X[:, [0,3,5]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Wed, 01 Apr 2020",Prob (F-statistic):,2.1600000000000003e-31
Time:,07:46:48,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


#### This time we can clearly see that variable x2 has a p-value of 0.060 which is slightly over our SL so removing this varible should provied us a robust model

In [26]:
X_optimal = X[:, [0,3]]
regressor_OLS = sm.OLS(endog = y, exog = X_optimal).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Wed, 01 Apr 2020",Prob (F-statistic):,3.5000000000000004e-32
Time:,07:51:31,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0


# Now as per the theory of backword elimination we have created a model that has only those columns that are significantly imporatant.