# Multiple linear regression

**Multiple linear regression** is a statistical technique used to <u>model the relationship between **one dependent variable and two or more independent variables.**</u>

This method extends **simple linear regression,** which involves only <u>one independent variable, by allowing for multiple predictors.</u>

The goal is to understand *how changes in the independent variables are associated with changes in the dependent variable* and <u>to predict the dependent variable based on known values of the independent variables.</u>

## Key Concepts
**1. Dependent Variable (Y):** The outcome or the variable you are trying to predict or explain.<br>
**2. Independent Variables (X1, X2, ... Xn):** The predictors or the variables you use to predict the dependent variable.<br>
**3. Regression Coefficients (β0, β1, β2, ... βn):** These coefficients represent the strength and type (positive or negative) of the relationship between each independent variable and the dependent variable.


## Model Representation
The Multiple linear regression model can be represented by the following equation:

**𝑌 = 𝛽0 + 𝛽1𝑋1 + 𝛽2𝑋2 + . . . + 𝛽𝑛𝑋𝑛 + 𝜀**

Where:

- **𝑌** is the dependent variable.<br>
- **𝑋1, 𝑋2, . . ., 𝑋𝑛** are the independent variables.<br>
- **𝛽0** is the intercept.<br>
- **𝛽1, 𝛽2, . . . ,𝛽𝑛** are the regression coefficients.<br>
- **𝜀** is the error term, representing the variation in 𝑌 that cannot be explained by the independent variables.

## Objectives
**1. Estimation:** Determine the values of the regression coefficients that best fit the observed data.<br>
**2. Prediction:** Use the model to predict the value of the dependent variables for given values of the independent variables.<br>
**3. Interpretation:** Understand the influence of each independent variable on the dependent variable.<br>

## Assumptions
Multiple linear regression relies on several key assumptions:

**1. Linearity:** The relationship between the dependent variable and each independent variable is linear.<br>
**2. Independence:** The observations are independent of each other.<br>
**3. Homoscedasticity:** The variance of the error terms is constant across all levels of the independent variables.<br>
**4. Normality:** The error terms are normally distributed.

## Example
Suppose you want to predict the sales **(Y)** based on advertising expenditures on TV **(X1),** radio **(X2),** and newspapers **(X3).** <br>
The <u>**Multiple linear regression** model</u> might look like:

#### Sales = 𝛽0 + 𝛽1(TV) + 𝛽2(Radio) + 𝛽3(Newspapers) + ε

By fitting this model to your data, you can estimate the coefficients <u>**𝛽0, 𝛽1, 𝛽2, and 𝛽3,**</u> which can then be used to predict future sales based on different levels of advertising expenditures.

### Applications
Multiple linear regression is widely used in various fields, including:

- **Economics:** To model relationships between economic indicators.
- **Finance:** To predict stock prices based on multiple financial indicators.
- **Medicine:** To understand the impact of various factors on health outcomes.
- **Social Sciences:** To study the influence of multiple factors on human behavior.<br>

By understanding and applying Multiple linear regression, researchers and analysts can gain insights into complex relationships and make informed predictions based on multiple factors.

# Import Library

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sklearn import linear_model

https://raw.githubusercontent.com/akdubey2k/ML/main/2_linear_regression_multivariate/Exercise/hiring.csv

# Dataset Loading

In [2]:
# load the training and validation dataset
url = "https://raw.githubusercontent.com/akdubey2k/ML/main/2_linear_regression_multivariate/Exercise/hiring.csv"
df = pd.read_csv(url)
df

Unnamed: 0,experience,test_score (out of 10),interview_score (out of 10),salary ($)
0,,8.0,9,50000
1,,8.0,6,45000
2,five,6.0,7,60000
3,two,10.0,10,65000
4,seven,9.0,6,70000
5,three,7.0,10,62000
6,ten,,7,72000
7,eleven,7.0,8,80000


# Dataset Exploratory Data Analysis (EDA)

In [3]:
# check, for null in the dataset or feature value.
df.isna().sum()

experience                     2
test_score (out of 10)         1
interview_score (out of 10)    0
salary ($)                     0
dtype: int64

In [4]:
# check, for null in the dataset or feature value.
df.isnull().sum()

experience                     2
test_score (out of 10)         1
interview_score (out of 10)    0
salary ($)                     0
dtype: int64

In [5]:
# check, for null in the dataset or feature value.
df.experience.isna()

0     True
1     True
2    False
3    False
4    False
5    False
6    False
7    False
Name: experience, dtype: bool

# Data Cleaning

In [6]:
# clean the null data by filling it with "zero" string in the dataset.
df[['experience']] = df[['experience']].fillna('zero')
df[['experience']]
# df.experience = df.experience.fillna('zero')
# df.experience

Unnamed: 0,experience
0,zero
1,zero
2,five
3,two
4,seven
5,three
6,ten
7,eleven


## Word to Number

In [7]:
# Use a Python module (library) to convert number words (eg. twenty one) to
# numeric digits (21).
# It works for positive numbers upto the range of 999,999,999,999 (i.e. billions).
!pip install word2number
from word2number import w2n



In [8]:
df.experience = df.experience.apply(w2n.word_to_num)
print (w2n.word_to_num("two million three thousand nine hundred and eighty four"))

2003984


In [9]:
df

Unnamed: 0,experience,test_score (out of 10),interview_score (out of 10),salary ($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,,7,72000
7,11,7.0,8,80000


In [10]:
import math
mean_testScore = math.floor(df[['test_score (out of 10)']].mean())
mean_testScore

  mean_testScore = math.floor(df[['test_score (out of 10)']].mean())


7

In [11]:
median_testScore = math.floor(df[['test_score (out of 10)']].median())
median_testScore

  median_testScore = math.floor(df[['test_score (out of 10)']].median())


8

In [12]:
# # clean the null data by fiiling it with mean testscore from the dataset.
df['test_score (out of 10)'] = df['test_score (out of 10)'].fillna(mean_testScore)

In [13]:
df

Unnamed: 0,experience,test_score (out of 10),interview_score (out of 10),salary ($)
0,0,8.0,9,50000
1,0,8.0,6,45000
2,5,6.0,7,60000
3,2,10.0,10,65000
4,7,9.0,6,70000
5,3,7.0,10,62000
6,10,7.0,7,72000
7,11,7.0,8,80000


# Model Creation

In [14]:
model = linear_model.LinearRegression()
# model.fit(df.drop[['salary ($)']], df['salary ($)']) # TypeError: 'method' object is not subscriptable

# model.fit(df.drop('salary ($)', axis='columns'), df['salary ($)'])
model.fit(df[['experience', 'test_score (out of 10)', 'interview_score (out of 10)']], df['salary ($)'])
model.predict([[2, 9, 6]])



array([53713.86677124])

# Model Training and Prediction

In [15]:
model.fit(df.drop('salary ($)', axis='columns'), df['salary ($)'])
model.predict([[12, 10, 10]])



array([93747.79628651])