# Multiple Linear Regression

Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. 

It is used to find:

1)How strong the relationship is between two or more independent variables and one dependent variable.

2_The value of the dependent variable at a certain value of the independent variables. 


Multiple linear regression formula
The formula for a multiple linear regression is:

y = b0+b1*x1+b2*x2+.......+bn*xn+e

y = the predicted value of the dependent variable

bo= the y-intercept (value of y when all other parameters are set to 0)

b1x1 = the regression coefficient (B_1) of the first independent variable (X_1) (a.k.a. the effect that increasing the value of the independent variable has on the predicted y value)

… = do the same for however many independent variables you are testing
bnxn = the regression coefficient of the last independent variable

e = model error (a.k.a. how much variation there is in our estimate of y)

Multiple linear regression makes all of the same assumptions as simple linear regression:

Homogeneity of variance (homoscedasticity): the size of the error in our prediction doesn’t change significantly across the values of the independent variable.

Independence of observations: the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among variables.

In multiple linear regression, it is possible that some of the independent variables are actually correlated with one another, so it is important to check these before developing the regression model. If two independent variables are too highly correlated (r2 > ~0.6), then only one of them should be used in the regression model.

Normality: The data follows a normal distribution.

Linearity: the line of best fit through the data points is a straight line, rather than a curve or some sort of grouping factor.

In [2]:
#Import Library pandas
import pandas as pd

In [3]:
#Read the dataset
df=pd.read_csv(r'C:\Users\Lenovo\Desktop\Archana Pachupate PC\Praticals\BSc 5 sem\Paper 3\3 rd Pratical\Cheddar.csv')

In [4]:
df

Unnamed: 0,taste,Acetic,H2S,Lactic
0,12.3,4.543,3.135,0.86
1,20.9,5.159,5.043,1.53
2,39.0,5.366,5.438,1.57
3,47.9,5.759,7.496,1.81
4,5.6,4.663,3.807,0.99
5,25.9,5.697,7.601,1.09
6,37.3,5.892,8.726,1.29
7,21.9,6.078,7.966,1.78
8,18.1,4.898,3.85,1.29
9,21.0,5.242,4.174,1.58


In [5]:
# Check the first few rows
df.head()

Unnamed: 0,taste,Acetic,H2S,Lactic
0,12.3,4.543,3.135,0.86
1,20.9,5.159,5.043,1.53
2,39.0,5.366,5.438,1.57
3,47.9,5.759,7.496,1.81
4,5.6,4.663,3.807,0.99


In [6]:
# Check the dimensions (rows and columns)
df.shape

(30, 4)

In [7]:
# Check data types of columns
df.dtypes

taste     float64
Acetic    float64
H2S       float64
Lactic    float64
dtype: object

In [8]:
##Check missing values
df.isnull().sum()

taste     0
Acetic    0
H2S       0
Lactic    0
dtype: int64

In [9]:
##General information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   taste   30 non-null     float64
 1   Acetic  30 non-null     float64
 2   H2S     30 non-null     float64
 3   Lactic  30 non-null     float64
dtypes: float64(4)
memory usage: 1.1 KB


In [10]:
##Coloums name
df.columns

Index(['taste', 'Acetic', 'H2S', 'Lactic'], dtype='object')

In [11]:
# Calculate the correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

           taste    Acetic       H2S    Lactic
taste   1.000000  0.549539  0.755752  0.704236
Acetic  0.549539  1.000000  0.617956  0.603783
H2S     0.755752  0.617956  1.000000  0.644812
Lactic  0.704236  0.603783  0.644812  1.000000


In [12]:
df.corr()

Unnamed: 0,taste,Acetic,H2S,Lactic
taste,1.0,0.549539,0.755752,0.704236
Acetic,0.549539,1.0,0.617956,0.603783
H2S,0.755752,0.617956,1.0,0.644812
Lactic,0.704236,0.603783,0.644812,1.0


In [13]:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

In [17]:
# Extract x and y values
x=df[['Acetic', 'H2S', 'Lactic']]
y=df['taste']

In [24]:
# Fit multiple linear regression model
X = sm.add_constant(x)  # Add constant term for intercept
model = sm.OLS(y, X).fit()

# Print summary statistics
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  taste   R-squared:                       0.652
Model:                            OLS   Adj. R-squared:                  0.612
Method:                 Least Squares   F-statistic:                     16.22
Date:                Wed, 08 May 2024   Prob (F-statistic):           3.81e-06
Time:                        11:34:25   Log-Likelihood:                -109.89
No. Observations:                  30   AIC:                             227.8
Df Residuals:                      26   BIC:                             233.4
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -28.8768     19.735     -1.463      0.1