## Interview questions 

https://drive.google.com/drive/folders/1BrZ3ZRDYvwGiiFoH13PEGv-c9RQ0-vWE

https://colab.research.google.com/drive/1zfOkg1XCkXVHBQ26soQgPFL3eVFhHLfS - Linear regression notebook

## Linear regression



### Imports

In [57]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pickle
from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV, ElasticNet, ElasticNetCV
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import seaborn as sns

In [22]:
df = pd.read_csv('Advertising.csv')
df

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9
...,...,...,...,...,...
195,196,38.2,3.7,13.8,7.6
196,197,94.2,4.9,8.1,9.7
197,198,177.0,9.3,6.4,12.8
198,199,283.6,42.0,66.2,25.5


### Defining the problem 
    > We have 3 columns regarding 3 various types of advertisement:
        * TV,
        * Radio, 
        * Newspaper.

Our task is to study, how much the differnt types of advertisement impact the sales. And build a model, which will predict the impact on sales.  

Task #2 - Were the amount of money invest in the specific type of adverisement will allow to achieve the goal in sales?
Task #3 - If there are some relations between variables? If they are we can try to implement Linear Regression.

In [23]:
# Basic EDA
df.head()

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [24]:
df.describe()

Unnamed: 0.1,Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0,200.0
mean,100.5,147.0425,23.264,30.554,14.0225
std,57.879185,85.854236,14.846809,21.778621,5.217457
min,1.0,0.7,0.0,0.3,1.6
25%,50.75,74.375,9.975,12.75,10.375
50%,100.5,149.75,22.9,25.75,12.9
75%,150.25,218.825,36.525,45.1,17.4
max,200.0,296.4,49.6,114.0,27.0


In [25]:
# Pandas profiling
ProfileReport(df)

Tab(children=(HTML(value='<div id="overview-content" class="row variable spacing">\n    <div class="row">\n   …



In [26]:
# Saving PF to variable for beautification
pf = ProfileReport(df)
pf.to_widgets()

Tab(children=(HTML(value='<div id="overview-content" class="row variable spacing">\n    <div class="row">\n   …

In [27]:
# Saving report
pf.to_file('pf_advertising.html')

### Multicolinearity 

Two independent variable which are highly correlated to each other. In this case one of them needs to be dropped. Why? Because upon feeding the data to the model, it will start to learn correlation in the dataset (x1, x2) and not work on the correlations between x and y. 


> In this dataset we do not have multicolinearity. We do not need to drop any columns, because of it. 

## Building a model

Building Linear regression model with Tv and Sales. (the highest correlation)

y = mx + c

y - predictions
m - 
c - 

In [28]:
# Selecting variables X, y
X = df[['TV']]
y = df['Sales']

### Using sklearn

In [29]:
from sklearn.linear_model import LinearRegression

In [30]:
lin_reg = LinearRegression()

In [31]:
lin_reg.fit(X, y)

LinearRegression()

#### Checking the values of m and c

In [32]:
lin_reg.intercept_ # c value

7.032593549127693

In [33]:
lin_reg.coef_ # m value

array([0.04753664])

#### Equation for sales

    > sales = m*x + c

In [34]:
# Saving model to file
model = 'lin_reg.sav'
pickle.dump(lin_reg, open(model,'wb'))

#### Predictions

In [35]:
# Predicting 45 on Tv, results in sales
lin_reg.predict([[45]])

array([9.17174237])

In [36]:
# Predicting on multiple values
l = [4,5,6,7,89,34,45,67,23]

for i in l:
    print(lin_reg.predict([[i]]))

[7.22274011]
[7.27027675]
[7.31781339]
[7.36535003]
[11.26335455]
[8.64883932]
[9.17174237]
[10.21754846]
[8.12593628]


In [37]:
# Loading saved model
sav_model = pickle.load(open(model, 'rb'))

In [38]:
# Preciting with saved, loaded model
sav_model.predict([[45]])

array([9.17174237])

#### Evaluations of the model

Note that evaluation on the same data as we are doing training is done here only for learning purposes. We shouldn't do evaluation on the same data. 

In [39]:
lin_reg.score(X,y) # Evalution with rsquared 

0.611875050850071

## Multilinear Regression

In [41]:
data_x = df[['TV', 'Radio', 'Newspaper']]

In [42]:
data_x

Unnamed: 0,TV,Radio,Newspaper
0,230.1,37.8,69.2
1,44.5,39.3,45.1
2,17.2,45.9,69.3
3,151.5,41.3,58.5
4,180.8,10.8,58.4
...,...,...,...
195,38.2,3.7,13.8
196,94.2,4.9,8.1
197,177.0,9.3,6.4
198,283.6,42.0,66.2


In [44]:
y = df.Sales

In [45]:
lin_reg.fit(data_x, y)

LinearRegression()

In [46]:
lin_reg.intercept_, lin_reg.coef_

(2.9388893694594085, array([ 0.04576465,  0.18853002, -0.00103749]))

### Equation for multilinear regression

y = mx1 + mx2 + mx3 + c
y = 0.04 (TV) + 0.188 (Radio) - 0.00103 (Newspaper) + 2.938 (intercept -> c)


### What will be the Confidence Interval on r-squared?

In [48]:
lin_reg.score(data_x, y)

0.8972106381789522

**Conclusion**
    > Score is better for multilinear regression. Multiple columns output better predictions. 

#### Checking r-squared for TV and radio

In [49]:
sel_data = df[['TV', 'Radio']]

In [50]:
lin_reg.fit(sel_data, y)

LinearRegression()

In [51]:
    lin_reg.score(sel_data, y)

0.8971942610828956

**Conlusion**
    > From the data we can see there is no difference in score between having 2 or 3 columns. It seems that investment in newspaper advertisement is not changing anything to the score in this case. 
    
    > Base on the results, we can conclude, that Newspaper column should be dropped. It is not affecting the sales in positive direction.  
    
Note! That r-squared is not only way to evaluate features scores. 

### Evaluation base on statistical models - Feature selection

In [52]:
import statsmodels.formula.api as smf

In [54]:
lm = smf.ols(formula = 'Sales ~ TV', data=df).fit() # Ordinary Least Squares
lm.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.612
Model:,OLS,Adj. R-squared:,0.61
Method:,Least Squares,F-statistic:,312.1
Date:,"Mon, 07 Mar 2022",Prob (F-statistic):,1.47e-42
Time:,15:37:17,Log-Likelihood:,-519.05
No. Observations:,200,AIC:,1042.0
Df Residuals:,198,BIC:,1049.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.0326,0.458,15.360,0.000,6.130,7.935
TV,0.0475,0.003,17.668,0.000,0.042,0.053

0,1,2,3
Omnibus:,0.531,Durbin-Watson:,1.935
Prob(Omnibus):,0.767,Jarque-Bera (JB):,0.669
Skew:,-0.089,Prob(JB):,0.716
Kurtosis:,2.779,Cond. No.,338.0


In [55]:
lm = smf.ols(formula = 'Sales ~ TV + Radio', data=df).fit() # Ordinary Least Squares
lm.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,859.6
Date:,"Mon, 07 Mar 2022",Prob (F-statistic):,4.83e-98
Time:,15:39:37,Log-Likelihood:,-386.2
No. Observations:,200,AIC:,778.4
Df Residuals:,197,BIC:,788.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9211,0.294,9.919,0.000,2.340,3.502
TV,0.0458,0.001,32.909,0.000,0.043,0.048
Radio,0.1880,0.008,23.382,0.000,0.172,0.204

0,1,2,3
Omnibus:,60.022,Durbin-Watson:,2.081
Prob(Omnibus):,0.0,Jarque-Bera (JB):,148.679
Skew:,-1.323,Prob(JB):,5.19e-33
Kurtosis:,6.292,Cond. No.,425.0


In [56]:
lm = smf.ols(formula = 'Sales ~ TV + Radio + Newspaper', data=df).fit() # Ordinary Least Squares
lm.summary()

0,1,2,3
Dep. Variable:,Sales,R-squared:,0.897
Model:,OLS,Adj. R-squared:,0.896
Method:,Least Squares,F-statistic:,570.3
Date:,"Mon, 07 Mar 2022",Prob (F-statistic):,1.58e-96
Time:,15:40:01,Log-Likelihood:,-386.18
No. Observations:,200,AIC:,780.4
Df Residuals:,196,BIC:,793.6
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.9389,0.312,9.422,0.000,2.324,3.554
TV,0.0458,0.001,32.809,0.000,0.043,0.049
Radio,0.1885,0.009,21.893,0.000,0.172,0.206
Newspaper,-0.0010,0.006,-0.177,0.860,-0.013,0.011

0,1,2,3
Omnibus:,60.414,Durbin-Watson:,2.084
Prob(Omnibus):,0.0,Jarque-Bera (JB):,151.241
Skew:,-1.327,Prob(JB):,1.44e-33
Kurtosis:,6.332,Cond. No.,454.0


#### Legend
* std err -> Standard deviation
* Adj R-squared -> Tells us about possible overfitting. It is more robust than the r-squared. It is considering in calculation(above the usual r-squared equation) increased increment of the rows and a number of a columns. 

> Take a look at p-value of Newspaper. It has .860 value, which means it has very low sigificance to the model.

What is the meaning of p-value of the Newspaper here? In 100 experiments, only 14 are contributing to the final score. 86 of them are not contributing. 

**Conclusion**
    > Taking into consideration both r-squared and p-value of the Newspaper, we can drop Newspaper. 

### Multicolinearity - variance inflation factor - Feature selection

To check for Multicolinearity calculate variance inflation factor. 

### Regularization

Adjusting values of m and c to achieve the best results. 
Regularization is used to stabilize the outcome predictions. 

#### L1 - REGULARIZATION - Lasso Regression
> Having higher impact. - for smaller values of m


#### L2 - Ridge Regression
> Having less impact. - for smaller values of m

#### Elastic net
> L1, L2 combined. It is changing behaviour according to the need. 

## Loading new data

In [58]:
pd.read_csv('Admission_Prediction.csv')

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337.0,118.0,4.0,4.5,4.5,9.65,1,0.92
1,2,324.0,107.0,4.0,4.0,4.5,8.87,1,0.76
2,3,,104.0,3.0,3.0,3.5,8.00,1,0.72
3,4,322.0,110.0,3.0,3.5,2.5,8.67,1,0.80
4,5,314.0,103.0,2.0,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
495,496,332.0,108.0,5.0,4.5,4.0,9.02,1,0.87
496,497,337.0,117.0,5.0,5.0,5.0,9.87,1,0.96
497,498,330.0,120.0,5.0,4.5,5.0,9.56,1,0.93
498,499,312.0,103.0,4.0,4.0,5.0,8.43,0,0.73


### Task
Predicting chance of admission based on given information. 

### EDA

In [59]:
# Storing data in variable adm
adm = pd.read_csv('Admission_Prediction.csv')

In [60]:
# Pandas profiling
ProfileReport(adm)

Tab(children=(HTML(value='<div id="overview-content" class="row variable spacing">\n    <div class="row">\n   …

