## Backward Elimintaion

In the [previous post](http://javahabit.com/2019/02/02/part-4-ml-multiple-linear-regression/) we learnt about multiple linear elimination. The problem with the last approach was that we used all the features without considering that some of the features may not be impacting or playing any role in the outcome. we also talked about 5 ways of reducing the noisy feature. Backward elimination is one of them. 

### What is *`Backward Elimination`*?
Backward elimination is a process to remove features that have little effect on the dependent variable.

### What could possibly be wrong with leaving the features if they are not impacting or have little impact?
`New England Patriots` won the superbowl on Feb 3, 2019. The team won becaue it had better team, better skills and good coach. If I say, the team also won because patriots fans are the great at cheering and that when patriots play fans supporting the opposition is more tamed, the I would be wrong. If I say that all players in the team wore white jersey and they won. They also won because they played it on Sunday and `T. Brady` thinks that it's his luckiesr day. You would call Baloney to all the facts that I just mentioned. It may have helped -  may be slightly but too insgignificant to make a real difference. The features in a data set are exactly that - `Baloney`. They only add noise in the actual model and many small non significant data may actually provide us model which is way off the margin. The simppler the model, the better the result. 

### How do we implement *`Backward elimination`*?
In backward elimination, you take all the variables and create the algorithm. Select a significance level, then consider the predictor with *Highest P-value* and if `P-Value > Significance level` then eliminate the variable from the equation, else keep it.
![bkwd-elim](resources\img\bkwdelim\backward-elimination.PNG)


### How do we implement it in python?
In the [previous post](http://javahabit.com/2019/02/02/part-4-ml-multiple-linear-regression/) we were trying to figure out if a company is profitable or not by looking at 4 independent variables - `R&D Spent`, `Administration cost`, `Market spending` & `State`. We created a model with all the features. So let's pick up from where we left off. Here's how our dataset looks like
![dataset](resources\img\bkwdelim\dataset.PNG)

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Read the dataset
dataset = pd.read_csv("50_Startups.csv")

#Divide the dataset in dependent and Independent variables
X= dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values



#Taking care of Categorical values.
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
label_encoder = LabelEncoder();
X[:,3]=label_encoder.fit_transform(X[:, 3])
oneHotEncoder = OneHotEncoder(categorical_features=[3])
X= oneHotEncoder.fit_transform(X).toarray()

#getting out of dummy variable trap
X = X[:,1:] # Select all the rows and all the columns starting fom index 1 onwards.
#Create training and test set
from  sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20,
                                                    train_size=0.80,
                                                    random_state=0)


#Check for missing data
null_columns=dataset.columns[dataset.isnull().any()]
t = dataset[null_columns].isnull().sum()

#Training the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the results of the training set
y_pred = regressor.predict(X_test)  


print(regressor.coef_)
print(regressor.intercept_)

print('Train Score: ', regressor.score(X_train, y_train))
print('Test Score: ', regressor.score(X_test, y_test))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


[-9.59284160e+02  6.99369053e+02  7.73467193e-01  3.28845975e-02
  3.66100259e-02]
42554.16761772438
Train Score:  0.9501847627493607
Test Score:  0.9347068473282446


Take a note of `Train Score` and `Test Score`
>Train Score:  0.9501847627493607 

>Test Score:  0.9347068473282446

The difference between them is `0.01547791542` or `1.548%`

So far we used all all the features. Now to use backward elimination we will use an entirely new package and class. However, before we begin, we need to decide on a `significance level`. In this case let's chose a level equal o **0.05**.


In [5]:
import  statsmodels.formula.api as smf

#Appending ones for constants
X = np.append(arr=np.ones((50,1)).astype(int), values=X, axis=1)

### Why did we append 1's in the existing dataset?
>Y = b0X0 + b1X1+ b2X2 + b3X3 + b4X4 + b5X5 + C

In the above equation, if you notice that every **X<sub>n</sub>** has a multiplier **b<sub>n</sub>** but not the constant **C**. Actually if you have a **X<sub>6</sub>** and set it to 1 that solves the prblem. The question is why do we need a 1 multiplier for the constant. The answer lies in the library and class that we use. The package statsmodel only considers a multipler if it has a feature value. If there is no feature value then it would not get picked up while creating the model. So the **C** would be dropped. Hence we need a create feature with value = 1. 

In [6]:
##Creating a model with all varibales
x_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = smf.OLS(endog=y, exog=x_opt).fit()
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.945
Method:                 Least Squares   F-statistic:                     169.9
Date:                Sun, 10 Feb 2019   Prob (F-statistic):           1.34e-27
Time:                        22:42:26   Log-Likelihood:                -525.38
No. Observations:                  50   AIC:                             1063.
Df Residuals:                      44   BIC:                             1074.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.013e+04   6884.820      7.281      0.0

Based on the process, we now have to find the feature with the highest **`P-value`** and if it greater than ould **`SL`** then we will drop it. In this case 
> x2 has the highest P-value = 0.990 > 0.05.

So will drop feature x2 which corresponds to `State dummy variable` 
![allfeature](resources\img\bkwdelim\featureall.PNG)

We will continue and re run the model with just 5 feature


In [7]:
### Removing index 2 as P>0.05 and is the highest P
x_opt = X[:,[0,1,3,4,5]]
regressor_OLS = smf.OLS(endog=y, exog=x_opt).fit()
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.946
Method:                 Least Squares   F-statistic:                     217.2
Date:                Sun, 10 Feb 2019   Prob (F-statistic):           8.49e-29
Time:                        22:52:32   Log-Likelihood:                -525.38
No. Observations:                  50   AIC:                             1061.
Df Residuals:                      45   BIC:                             1070.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.011e+04   6647.870      7.537      0.0

Once again in the above output
> x1 has the highest P-value = 0.940 > 0.05

So we will drop X1, which in this case second dummy variable for `state`.
![allfeature](resources\img\bkwdelim\feature-5.PNG)


So we will continue until we dont have varibale that is greater than our **`significance level`**


In [9]:
### Removing index 1 as P>0.05 and is the highest P
x_opt = X[:,[0,3,4,5]]
regressor_OLS = smf.OLS(endog=y, exog=x_opt).fit()
print(regressor_OLS.summary())



                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.951
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     296.0
Date:                Sun, 10 Feb 2019   Prob (F-statistic):           4.53e-30
Time:                        22:59:29   Log-Likelihood:                -525.39
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      46   BIC:                             1066.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       5.012e+04   6572.353      7.626      0.0

In [10]:
### Removing index 1 as P>0.05 and is the highest P
x_opt = X[:,[0,3,5]]
regressor_OLS = smf.OLS(endog=y, exog=x_opt).fit()
print(regressor_OLS.summary())



                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.950
Model:                            OLS   Adj. R-squared:                  0.948
Method:                 Least Squares   F-statistic:                     450.8
Date:                Sun, 10 Feb 2019   Prob (F-statistic):           2.16e-31
Time:                        22:59:39   Log-Likelihood:                -525.54
No. Observations:                  50   AIC:                             1057.
Df Residuals:                      47   BIC:                             1063.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.698e+04   2689.933     17.464      0.0

In [11]:
### Removing index 1 as P>0.05 and is the highest P
x_opt = X[:,[0,3]]
regressor_OLS = smf.OLS(endog=y, exog=x_opt).fit()
print(regressor_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.947
Model:                            OLS   Adj. R-squared:                  0.945
Method:                 Least Squares   F-statistic:                     849.8
Date:                Sun, 10 Feb 2019   Prob (F-statistic):           3.50e-32
Time:                        22:59:43   Log-Likelihood:                -527.44
No. Observations:                  50   AIC:                             1059.
Df Residuals:                      48   BIC:                             1063.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       4.903e+04   2537.897     19.320      0.0

So in the end, we find that only the `C` constant and **`R&D Spending`** are really important or most significant feature to find out if we should invest in the new business venture.

### How do I believe you that by just keeping R&D feature, will improve our model accuracy?
Let's recalculate our model using **Linear Regression library** and find the difference between accuracy score

In [12]:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Read the dataset
dataset = pd.read_csv("50_Startups.csv")

#Divide the dataset in dependent and Independent variables
X= dataset.iloc[:, 0].values ##Get the R&D score only
y = dataset.iloc[:, -1].values 

In [13]:
#Create training and test set
from  sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.20,
                                                    train_size=0.80,
                                                    random_state=0)

In [14]:
#Training the model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(np.array(X_train).reshape(-1,1), y_train)

# Predicting the results of the training set
y_pred = regressor.predict(np.array(X_test).reshape(-1,1))


print(regressor.coef_)
print(regressor.intercept_)

print('Train Score: ', regressor.score(np.array(X_train).reshape(-1,1), y_train))
print('Test Score: ', regressor.score(np.array(X_test).reshape(-1,1), y_test))

[0.8516228]
48416.297661385026
Train Score:  0.9449589778363044
Test Score:  0.9464587607787219


Let's look at the `Train score` and `Test Score` with all the feature and with just `R&D spending`.
> With all Features
>Train Score:  `0.9501847627493607`  & Test Score:  `0.9347068473282446`

> Difference = `0.01547791542` or `1.548%`

> With just R&D spending feature
> Train Score: `0.9449589778363044`  & Test Score: `0.9464587607787219`

> Difference = 0.0014997829424 or 0.150%

Also the if you see that the test score has improved when from 93.6% to 94.6%.

Hopefully, you enjoyed this series. In the next series, we will look at slighly more interesting topic called **SVMs or Support vector Regression**. 