### Dataset

Fifty startups is going to be a venture capitalist fund challenge.
This is like a very realistic life like business challenge.

We've got we've only got five columns:

- R&D Spend
- Adminstration
- Marketing Spend
- State
- Profit

We have to create a model which will tell us about profit based on R&D spend, Adminstration, marketing spend and state. We do not want to just invest in one company based on these data (there is no need to be a scientist to find out which company perform better based on these data!). We want to understand for instance where companies perform better in New York or California and which independent factor is more critical in our dataset.
Do we look for companies to spend more on R&D spend or on research and development or companies to spend more on marketing. So which of these two spends yields better results of profit brings more results of profit.

In [1]:
# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

In [4]:
X

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

In [5]:
y

array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

In [6]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer 

labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])

ctransformer = ColumnTransformer([("State", OneHotEncoder(), [3])],remainder="passthrough")
X = ctransformer.fit_transform(X)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


#### The Dummy Variable Trap
We you can never include both dummy variables at the same time and in our example we would omitted the California dummy.
But why is that?
What will happen if we include the second dummy variable in the model as well.

Let's see then tension here is that you are basically duplicating a variable. This is because D2 always equals to one minus D1. The phenomenon where one or several independent variables in a linear regression predict another is called multicollinearity as a result of this effect the model cannot distinguish between the effects of D1 from the effects from of D2. And therefore it won't work properly.
And this is called `the dummy variable trap`.
To sum up whenever we are building a model always omit one dummy variable and this applies irrespective of the number of dummy variables they are in that specific dummy set.
If you have 9 then you should only include 9 if you have 100 then you should only include 99 of them.
Also note that if you have two sets of dummy variables then you need to apply the same rule to each set.

In [7]:
# Avoiding the Dummy Variable Trap
X = X[:, 1:]

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Multi Linear Regression
A simple regression is basically this formula:

\begin{equation*}
y   = b_0 + b_1*X_1 + b_2*X_2+...+b_n*X_n
\end{equation*}

`y` is the dependent variable. 

`X` are independent variables.


In simple linear regression, when we have one dependent variable and one independent variable. Everything is easy and we just have a simple linear regression that we have to build. It is easy and works great.

But now in our data we have all these columns and all of these three columns are potential predictors for the dependent variable
Most of the time, we need to decide which ones we want to keep and which ones we want to throw out.
And if you ask why do we need to throw out columns? or do we need to get rid of data? why can't we just use everything in our model?
Well, simply I can think of two reasons of the top of my head.
   - Number one is garbage in garbage out. If you throw in a lot of stuff into your model then your model will not be a good and reliable model. 
   - The other reason I can tell you is that you're going to explain these variables and what it means that certain variables predict the behavior of your dependent variable. So if you have a thousand variables, it is not going to be practical to try and explain that.

So we want to keep only the very important ones the ones that actually predict something.

**How do we construct a model?**

There are five methods of building models.

- **All in:** It means, throwing in all your variables! We do that when we have prior knowledge. If we know that these exact variables are the ones we need to make the model! And when you're preparing for a backward elimination type of construction.

- **Backward Elimination:** Step one you have to select a significance level to stay in the model.By default we're going to go with 5 percent(0.05). Step two you fit the full model with all possible predicter So you kind of do that all in approach. Step three you considered the predicter with the highest P-value. After you fitted the model you'll see the one with the highest P value. If p is greater than the significance level then you go to Step 4 and Step 4 is to remove that predict to remove basically the variable that has the highest P value. Rebuild the model with remaining vairables and repeat these steps from step 3. keep doing that until you come to a point where even the variable with the highest P value is still less than your significance level. So if that condition P is greater than s.l is not correct then you don't to continue anymore and that is finish.

- **Forward Selection:** step 1. Select the significance level to enter the model. Step 2. We fit all possible simple regression models.So we take the dependent variable and we create a regression model with every single independent variable that we have. Then we out of all those models we select the one which has the lowest p value for the independent variable. Step 3. We keep this variable that we've just chosen and we fit all other possible models with one extra predicter added to the one we usually have. That means we've selected a simple linear regression with one variable. Now we need to construct all possible linear regressions with two variables where one of those two variables is the one of various. Basically we'll keep growing the regression model but not just randomly but by selecting out of all of the possible combinations every single time and growing at one variable at a time. And we will only stop when the variable that we've added It has a p value that is greater than our significance level. when the condition is less than SL, is not true then we don't go to Step 3 and we finish the regression

- **Bidirectional Elimination:** We can assume, it is a  combination of backward elimination and forward selection! Select a significant level to enter and a significant level to stay. step 2. perform the next step of forward selection(new variables must have p value less than enter significance level to enter). step 3 perform all steps of bachward elimination(olld variables must have p value less than stay signigicance level to stay). Repeat step 2 and 3 until no new variables can enter and no old vaiables can exit. Done!

- **Score Comparison:** select a criterion of goodness of fit for instance R-squared. Then you construct all possible regression model so if you had and variables and there'll be a (2^n)-1  total combinations of these variables then step three you select the one of these models with the best criterion that you're looking at.
There you go your model is ready.



> Note that sometimes you'll hear `stepwise regression` that actually refers to number 2, 3 and 4 because thet are really the  true step by step methods.

In [9]:
from sklearn.linear_model import LinearRegression

# Fitting Simple Linear Regression to the Training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [10]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)

In [11]:
#comparing perdiction by our model and the actual value
Y_test = pd.DataFrame(y_test,columns={'Y_test_actual'})
Y_test['Y_perdict'] = pd.DataFrame(y_pred,columns={'Y_perdict'})
Y_test

Unnamed: 0,Y_test_actual,Y_perdict
0,103282.38,103015.201598
1,144259.4,132582.277608
2,146121.95,132447.738452
3,77798.83,71976.098513
4,191050.39,178537.482211
5,105008.31,116161.242302
6,81229.06,67851.692097
7,97483.56,98791.733747
8,110352.25,113969.43533
9,166187.94,167921.065696


In [12]:
print("The slopes of the line are: ", regressor.coef_)
print("The intercept of the line is: ", regressor.intercept_)

The slopes of the line are:  [-9.59284160e+02  6.99369053e+02  7.73467193e-01  3.28845975e-02
  3.66100259e-02]
The intercept of the line is:  42554.1676177278


### Backward Elimination

In [13]:
# Building the optimal model using Backward Elimination
import statsmodels.api as sm

X = np.append(arr = np.ones((50, 1)).astype(int), values = X.astype('float64'), axis = 1)

In [14]:
X 

array([[1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
        1.3689780e+05, 4.7178410e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
        1.5137759e+05, 4.4389853e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
        1.0114555e+05, 4.0793454e+05],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
        1.1867185e+05, 3.8319962e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
        9.1391770e+04, 3.6616842e+05],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
        9.9814710e+04, 3.6286136e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
        1.4719887e+05, 1.2771682e+05],
       [1.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
        1.4871895e+05, 3.1161329e+05],
       [1.0000000e+00, 0.0000000e+00,

In [15]:
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,169.9
Date:,"Wed, 15 Apr 2020",Prob (F-statistic):,1.34e-27
Time:,13:57:00,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1063.0
Df Residuals:,44,BIC:,1074.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.013e+04,6884.820,7.281,0.000,3.62e+04,6.4e+04
x1,198.7888,3371.007,0.059,0.953,-6595.030,6992.607
x2,-41.8870,3256.039,-0.013,0.990,-6604.003,6520.229
x3,0.8060,0.046,17.369,0.000,0.712,0.900
x4,-0.0270,0.052,-0.517,0.608,-0.132,0.078
x5,0.0270,0.017,1.574,0.123,-0.008,0.062

0,1,2,3
Omnibus:,14.782,Durbin-Watson:,1.283
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.266
Skew:,-0.948,Prob(JB):,2.41e-05
Kurtosis:,5.572,Cond. No.,1450000.0


We have to look for the highest P-value. It's just one with ninety nine percent. So we are way above the significance level of 5 percent. We need to remove this as we remember in backward elimination if the P-value is above the significance level of 5 percent then we need to go to Step 4 and the Step four is actually to remove this predicter.
Let's remove the predictor x2 which is our second dummy variable for state.

In [16]:
X_opt = X[:, [0, 1, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.946
Method:,Least Squares,F-statistic:,217.2
Date:,"Wed, 15 Apr 2020",Prob (F-statistic):,8.49e-29
Time:,13:57:00,Log-Likelihood:,-525.38
No. Observations:,50,AIC:,1061.0
Df Residuals:,45,BIC:,1070.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.011e+04,6647.870,7.537,0.000,3.67e+04,6.35e+04
x1,220.1585,2900.536,0.076,0.940,-5621.821,6062.138
x2,0.8060,0.046,17.606,0.000,0.714,0.898
x3,-0.0270,0.052,-0.523,0.604,-0.131,0.077
x4,0.0270,0.017,1.592,0.118,-0.007,0.061

0,1,2,3
Omnibus:,14.758,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.172
Skew:,-0.948,Prob(JB):,2.53e-05
Kurtosis:,5.563,Cond. No.,1400000.0


We need to look again for the independent variable that has the highest value. The highest value  is actually this one 94%.
Still way above the 5 percent significance level.Now what we have to do is to remove X1 that is the first variable for state from our independent variables the dummy variables free-State will not be part of the final team.

In [17]:
X_opt = X[:, [0, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.951
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,296.0
Date:,"Wed, 15 Apr 2020",Prob (F-statistic):,4.53e-30
Time:,13:57:00,Log-Likelihood:,-525.39
No. Observations:,50,AIC:,1059.0
Df Residuals:,46,BIC:,1066.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.012e+04,6572.353,7.626,0.000,3.69e+04,6.34e+04
x1,0.8057,0.045,17.846,0.000,0.715,0.897
x2,-0.0268,0.051,-0.526,0.602,-0.130,0.076
x3,0.0272,0.016,1.655,0.105,-0.006,0.060

0,1,2,3
Omnibus:,14.838,Durbin-Watson:,1.282
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.442
Skew:,-0.949,Prob(JB):,2.21e-05
Kurtosis:,5.586,Cond. No.,1400000.0


Continue removing the highest p-value if it is above significance level. And it is x2.

In [18]:
X_opt = X[:, [0, 3, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.948
Method:,Least Squares,F-statistic:,450.8
Date:,"Wed, 15 Apr 2020",Prob (F-statistic):,2.1600000000000003e-31
Time:,13:57:00,Log-Likelihood:,-525.54
No. Observations:,50,AIC:,1057.0
Df Residuals:,47,BIC:,1063.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.698e+04,2689.933,17.464,0.000,4.16e+04,5.24e+04
x1,0.7966,0.041,19.266,0.000,0.713,0.880
x2,0.0299,0.016,1.927,0.060,-0.001,0.061

0,1,2,3
Omnibus:,14.677,Durbin-Watson:,1.257
Prob(Omnibus):,0.001,Jarque-Bera (JB):,21.161
Skew:,-0.939,Prob(JB):,2.54e-05
Kurtosis:,5.575,Cond. No.,532000.0


In [19]:
X_opt = X[:, [0, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.947
Model:,OLS,Adj. R-squared:,0.945
Method:,Least Squares,F-statistic:,849.8
Date:,"Wed, 15 Apr 2020",Prob (F-statistic):,3.5000000000000004e-32
Time:,13:57:01,Log-Likelihood:,-527.44
No. Observations:,50,AIC:,1059.0
Df Residuals:,48,BIC:,1063.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.903e+04,2537.897,19.320,0.000,4.39e+04,5.41e+04
x1,0.8543,0.029,29.151,0.000,0.795,0.913

0,1,2,3
Omnibus:,13.727,Durbin-Watson:,1.116
Prob(Omnibus):,0.001,Jarque-Bera (JB):,18.536
Skew:,-0.911,Prob(JB):,9.44e-05
Kurtosis:,5.361,Cond. No.,165000.0
