# Multiple Linear Regression

### Importing Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.linear_model import LinearRegression

### Loading Dataset

In [3]:
os.chdir('..')

In [5]:
path = os.path.join(os.getcwd(), 'Datasets\multiple_linear_regression.csv')
path

'D:\\Babin\\Internship\\Fusemachine-Internship\\Data Science\\Datasets\\multiple_linear_regression.csv'

In [6]:
data = pd.read_csv(path)
data.head()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
0,1714,2.4,1
1,1664,2.52,3
2,1760,2.54,3
3,1685,2.74,3
4,1693,2.83,2


In [7]:
data.isnull().sum()

SAT           0
GPA           0
Rand 1,2,3    0
dtype: int64

In [8]:
data.describe()

Unnamed: 0,SAT,GPA,"Rand 1,2,3"
count,84.0,84.0,84.0
mean,1845.27381,3.330238,2.059524
std,104.530661,0.271617,0.855192
min,1634.0,2.4,1.0
25%,1772.0,3.19,1.0
50%,1846.0,3.38,2.0
75%,1934.0,3.5025,3.0
max,2050.0,3.81,3.0


In [9]:
x = data.drop('GPA', axis = 1)
y = data['GPA']

In [11]:
reg = LinearRegression()
reg.fit(x,y)

LinearRegression()

In [12]:
print(f'Coefficient: {reg.coef_}')

Coefficient: [ 0.00165354 -0.00826982]


In [13]:
print(f'Intercept: {reg.intercept_}')

Intercept: 0.29603261264909486


In [14]:
x.shape

(84, 2)

In [15]:
y.shape

(84,)

In [23]:
pred_sat

array([[1700,    1],
       [ 200,    2]])

In [26]:
pred_sat.shape

(2, 2)

In [28]:
predicted_score = reg.predict(pred_sat)
predicted_score



array([3.09878385, 0.61020133])

### Calculating the R-squared

In [31]:
reg.score(x,y)

0.4066811952814283

### Formula for Adjusted R^2

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

where, ![image.png](attachment:image.png)

In [33]:
r2 = reg.score(x,y)

n = x.shape[0]
p = x.shape[1]

adjusted_r2 = 1 - (1-r2) * (n-1)/(n-p-1)
adjusted_r2

0.3920313482513401

### How to detect the variables which are not needed in a model?

### Feature Selection with F-regression

In [34]:
from sklearn.feature_selection import f_regression

In [38]:
f_regression(x,y)

(array([56.04804786,  0.17558437]), array([7.19951844e-11, 6.76291372e-01]))

In [39]:
f_regression(x,y)[1]

array([7.19951844e-11, 6.76291372e-01])

Here, f_regression would translate into two regression one where we predict GPA with SAT score and one where we predict GPA with Rand 1,2,3. Then the method would calculate the f-statistic for each of those regressions and return the respective p-values. <br/>
If there were 50 features then 50 simple regressions would be created. <br/>
<b>Note: If a variable has a p-value > 0.05, we can disregard it.<b>

Here, the first list contains the f-statistics for each of the regression and the second list contains the corresponding p-values

In [40]:
p_values = f_regression(x,y)[1]
p_values

array([7.19951844e-11, 6.76291372e-01])

In [41]:
p_values.round(3)

array([0.   , 0.676])

Here, we can see that the p_value of the SAT is 0 and that of Rand 1,2,3 is 0.676. <b>We know, if a variable has a p-value greater than 0.05 it is a useless variable. Hence, the SAT is a useful variable and Rand 1,2,3 is not a useful variable.

### Creating a summary table

In [45]:
x.columns.values

array(['SAT', 'Rand 1,2,3'], dtype=object)

In [46]:
reg_summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg_summary

Unnamed: 0,Features
0,SAT
1,"Rand 1,2,3"


In [47]:
reg.coef_

array([ 0.00165354, -0.00826982])

In [48]:
reg_summary['Coefficients'] = reg.coef_
reg_summary['p-values'] = p_values.round(3)

In [49]:
reg_summary

Unnamed: 0,Features,Coefficients,p-values
0,SAT,0.001654,0.0
1,"Rand 1,2,3",-0.00827,0.676
