# Multivariate Linear Regression
Multivariate Linear Regression with Backward Elimination to have features that are significant to the model
### Step 1: Importing required libraries and modules

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from statsmodels.formula.api import OLS as regression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline

### Step 2: Reading CSV dataset to pandas Dataframe

In [None]:
dframe = pd.read_csv("50_startups.csv")
dframe.head()

Copying X training data to a new dataframe

In [None]:
x_data_frame = dframe[dframe.columns.tolist()[:-1]]
x_data_frame.head()

Similarly, copying Y data to a new dataframe

In [None]:
y_data_frame = dframe.Profit
y_data_frame.head()

### Step 3: Label Encoding
To convert categorical data(State) to numerical data for model fitting and prediction

In [None]:
label_encoder = LabelEncoder()
x_data_frame.State = label_encoder.fit_transform(x_data_frame.State)
x_data_frame.head()

> To avoid multi colinearity, eliminate the relation between the encoded labels, use OneHotEncoding on encoded labels<br>

**In this example, i got rid of the Dummy Variable Trap by eliminating very first column of training data.**

In [None]:
hot_encoder = OneHotEncoder(categorical_features=[3])
x_data = hot_encoder.fit_transform(x_data_frame).toarray().astype(np.int)[:,1:]
x_data[:5,:]

In [None]:
y_data = y_data_frame.values.astype(np.int)
y_data[:5]

### Step 4: Split and Fit
In this step, i am spitting the data to test and training set for x and y repectively and then fitting the training data to the regression model. 
<br>
In this case, we are not using scikit Regression Models because they do not provide P value information for the fitted model and thus i am making use of statsmodels in python for Regression Model

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size=0.2, random_state=0)
regressor = regression(exog=x_train, endog=y_train).fit()
regressor.summary()

### Step 5: Building a better model with Backward Elimination
For model for better prediction, i am making use of Backward elimination process to get rid of the features that are not really affecting the performance of the model.

Backward Elimination steps:
- Suppose a significance value, let say 5%
- Fit the model with all features and check for a feature having P value > Significance value, otherwise finish
- Eliminate that feature having P value > Significance value and fit the model with new feature set
- Repeat until we get rid of all the unwanted features from the training set

In this case, we get rid of x1 (refer above summary) from training set having P value of 92% (0.920) and then fit the model with remaining features and check for the feature having P value > Significance value again.

In [None]:
x_train = x_train[:,[1,2,3,4]]
x_test = x_test[:,[1,2,3,4]]
regressor = regression(exog=x_train, endog=y_train).fit()
regressor.summary()

Now, i didn't find any feature having P value greater than Significance value. Thus, by now, our model is ready for prediction.