**Multiple Linear Regression (MLR)** 

  
This notebook contains **step-by-step code, explanation, and interpretation** for building and evaluating a **Multiple Linear Regression (MLR)** model using Python.

It is designed for **teaching and learning purposes**, and also simulates a real-world business scenario where a company wants to identify **which department or factor most strongly impacts promotions** — to decide where to invest their time and resources.

**Learning Objectives**

- Understand how **Multiple Linear Regression (MLR)** works behind the scenes
- Learn to **preprocess data**: encoding, cleaning, reshaping
- Apply and interpret **linear regression model**
- Use **OLS (Ordinary Least Squares)** to get detailed statistics
- Perform **Backward Elimination** using p-values
- Understand concepts like:
  - **Bias and Variance**
  - **Intercepts and Coefficients**
  - **Adjusted R² vs R²**
  - **T-Test & p-values**
  - **Feature Elimination**
  - Basic idea of **API (Application Programming Interface)**



**Tools & Libraries Used**
    
- **Python** 
- **NumPy** – numerical operations
- **Pandas** – data manipulation
- **Matplotlib** – visualization
- **scikit-learn** – machine learning
- **statsmodels** – statistical modeling


**Final Goal:**

To help a company answer:
    
> "**Which department or variable impacts promotion the most, so we can confidently invest in it?**"


# Import Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Load the Dataset

In [None]:
df = pd.read_csv(r"C:\Users\shali\Desktop\DS_Road_Map\8. Machine Learning\Regression\Multiple_Linear_Regression\Investment.csv")
df.head()

Unnamed: 0,DigitalMarketing,Promotion,Research,State,Profit
0,165349.2,136897.8,471784.1,Hyderabad,192261.83
1,162597.7,151377.59,443898.53,Bangalore,191792.06
2,153441.51,101145.55,407934.54,Chennai,191050.39
3,144372.41,118671.85,383199.62,Hyderabad,182901.99
4,142107.34,91391.77,366168.42,Chennai,166187.94


# Understand the Data 

Let's view the available columns and understand the structure of the dataset.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   DigitalMarketing  50 non-null     float64
 1   Promotion         50 non-null     float64
 2   Research          50 non-null     float64
 3   State             50 non-null     object 
 4   Profit            50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [4]:
df.columns #View Columns

Index(['DigitalMarketing', 'Promotion', 'Research', 'State', 'Profit'], dtype='object')

In [5]:
df.describe()

Unnamed: 0,DigitalMarketing,Promotion,Research,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [6]:
df.isnull().sum() 

DigitalMarketing    0
Promotion           0
Research            0
State               0
Profit              0
dtype: int64

# Define Independent & Dependent Variables


In [7]:
X = df.iloc[:, :-1]  # All columns except last
y = df.iloc[:, -4]   # Target variable (you can change it based on dataset)

# Encode Categorical Data (if any)

In [8]:
X = pd.get_dummies(X, dtype=int)
print(X)

    DigitalMarketing  Promotion   Research  State_Bangalore  State_Chennai  \
0          165349.20  136897.80  471784.10                0              0   
1          162597.70  151377.59  443898.53                1              0   
2          153441.51  101145.55  407934.54                0              1   
3          144372.41  118671.85  383199.62                0              0   
4          142107.34   91391.77  366168.42                0              1   
5          131876.90   99814.71  362861.36                0              0   
6          134615.46  147198.87  127716.82                1              0   
7          130298.13  145530.06  323876.68                0              1   
8          120542.52  148718.95  311613.29                0              0   
9          123334.88  108679.17  304981.62                1              0   
10         101913.08  110594.11  229160.95                0              1   
11         100671.96   91790.61  249744.55                1     

# Split the Dataset

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Train the Model

In [10]:
from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


# Predict and Compare

In [11]:
y_pred = regressor.predict(X_test)
print(y_pred)

[182645.56  91790.61 110594.11  84710.77 101145.55 127864.55  65947.93
 152701.92 122782.75  91391.77]


In [12]:
comparison = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison.head())

       Actual  Predicted
28  182645.56  182645.56
11   91790.61   91790.61
10  110594.11  110594.11
41   84710.77   84710.77
2   101145.55  101145.55


# Model Evaluation

In [17]:
bias = regressor.score(X_train, y_train)
variance = regressor.score(X_test, y_test)

print("Bias (Train Accuracy):", bias)
print("Variance (Test Accuracy):", variance)

Bias (Train Accuracy): 1.0
Variance (Test Accuracy): 1.0


# Model Parameters

In [19]:
m_slope = regressor.coef_
print("Slope/Co-efficients (b₁):", m_slope)

Slope/Co-efficients (b₁): [-3.58815785e-16  1.00000000e+00 -3.89677049e-17 -1.53880995e-13
  3.15122730e-13 -1.61241735e-13]


In [20]:
c_intercept = regressor.intercept_
print("Intercept (b₀):", c_intercept)

Intercept (b₀): 8.731149137020111e-11


# Add Constant for OLS

OLS (Ordinary Least Squares) from `statsmodels` requires a constant column to represent the intercept.


# Add Constant Column

In [21]:
X = np.append(arr=np.ones((X.shape[0], 1)).astype(int), values=X, axis=1)

# Backward Elimination (Full Features)



In [23]:
import statsmodels.api as sm

X_opt = X[:, [0, 1, 2, 3, 4, 5]]  # Example: adjust as per your column count
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Promotion,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,7.500000000000001e+29
Date:,"Fri, 11 Jul 2025",Prob (F-statistic):,0.0
Time:,11:07:08,Log-Likelihood:,1082.9
No. Observations:,50,AIC:,-2154.0
Df Residuals:,44,BIC:,-2142.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.638e-11,7.46e-11,-0.488,0.628,-1.87e-10,1.14e-10
x1,-1.943e-16,4.98e-16,-0.390,0.698,-1.2e-15,8.09e-16
x2,1.0000,5.6e-16,1.78e+15,0.000,1.000,1.000
x3,3.469e-16,1.84e-16,1.886,0.066,-2.37e-17,7.18e-16
x4,2.183e-11,3.49e-11,0.625,0.535,-4.86e-11,9.22e-11
x5,7.276e-12,3.58e-11,0.203,0.840,-6.49e-11,7.95e-11

0,1,2,3
Omnibus:,0.163,Durbin-Watson:,0.214
Prob(Omnibus):,0.922,Jarque-Bera (JB):,0.041
Skew:,0.066,Prob(JB):,0.98
Kurtosis:,2.953,Cond. No.,1470000.0


**OLS Regression Results – Full Explanation**

**Context:**

You ran the following code:

```python
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
```

**Part-by-Part Explanation of the Output**


**Header Summary**

| Term                 | Meaning                                                             |
| -------------------- | ------------------------------------------------------------------- |
| **Dep. Variable**    | `Promotion`: The dependent variable (target/output)                 |
| **Model**            | `OLS` = Ordinary Least Squares Regression                           |
| **Method**           | Least Squares method used to estimate coefficients                  |
| **No. Observations** | 50 data points (rows in dataset)                                    |
| **Df Residuals**     | 44 = 50 - 6 → total observations minus number of model coefficients |
| **Df Model**         | 5 → You used 5 predictors (x1 to x5)                                |
| **Covariance Type**  | Non-robust standard errors used                                     |


**Model Quality Metrics**

| Metric                 | Value                      | Meaning                                                                                                     |
| ---------------------- | -------------------------- | ----------------------------------------------------------------------------------------------------------- |
| **R-squared**          | `1.000`                    | Model explains **100%** of variance in target. This usually indicates **overfitting** or multicollinearity. |
| **Adj. R-squared**     | `1.000`                    | Adjusted for number of predictors. Still 1.000 → very high!                                                 |
| **F-statistic**        | `7.5e+29`                  | Very high → model is statistically significant                                                              |
| **Prob (F-statistic)** | `0.00`                     | p-value for F-test is 0 → the model overall is significant                                                  |
| **AIC / BIC**          | AIC: `-2154`, BIC: `-2142` | Lower values indicate a better model (used for comparing models)                                            |


**Coefficient Table (Main Focus)**

| Term   | Coef        | Std Err     | t       | P > | Significance         |
|--------|-------------|-------------|---------|------|----------------------|
| const  | -3.638e-11  | 7.46e-11    | -0.488  | 0.628| ❌ Not significant    |
| x1     | -1.943e-16  | 4.98e-16    | -0.390  | 0.698| ❌ Not significant    |
| x2     | 1.0000      | 5.6e-16     | 1.78e+15| 0.000| ✅ Highly significant |
| x3     | 3.469e-16   | 1.84e-16    | 1.886   | 0.066| ⚠️ Borderline         |
| x4     | 2.183e-11   | 3.49e-11    | 0.625   | 0.535| ❌ Not significant    |
| x5     | 7.276e-12   | 3.58e-11    | 0.203   | 0.840| ❌ Not significant    |


**How to Interpret This?**

- **Coef (Coefficient)**: The amount of change in the target variable for 1 unit change in that predictor.
- **P > |t| (p-value)**: Tells you whether the variable is statistically significant.
  - If **p < 0.05**, the variable is **significant** → Keep it.
  - If **p ≥ 0.05**, the variable is **not significant** → You can remove it (Backward Elimination).
- **t** and **Std Err**: Used internally to compute the p-value.
- Based on this table:
  - Keep only `x2`
  - Remove variables like `x1`, `x4`, `x5`


**Statistical Tests**

| Metric              | Value                                                               | Meaning |
| ------------------- | ------------------------------------------------------------------- | ------- |
| **Omnibus / JB**    | These test if residuals are normally distributed (good if p > 0.05) |         |
| **Durbin-Watson**   | `0.214` → indicates **strong positive autocorrelation** (bad!)      |         |
| **Skew / Kurtosis** | `Skew ≈ 0`, `Kurtosis ≈ 3` → residuals are normally distributed     |         |
| **Cond. No.**       | `1.47e+06` → very high → indicates **multicollinearity risk** (bad) |         |



**What to Do Next? → Backward Elimination**

Based on this summary:

* Remove the variable with **highest p-value > 0.05** → `x5` (`0.840`)
* Rerun OLS without it
* Repeat the process until all p-values < 0.05



**OLS Regression Results**

We started with 5 independent variables (x1 to x5) and added a constant term.

**Summary Highlights:**

- **R-squared = 1.000** → Model fits data perfectly (possible overfitting)
- **Only x2 is statistically significant (p < 0.05)**
- **x1, x4, x5 have high p-values → should be removed**
- **Durbin-Watson = 0.214** → Indicates autocorrelation (not ideal)
- **Condition Number = 1.47e+06** → Multicollinearity suspected

**Next Step:**
Perform **Backward Elimination**:
- Remove variable with highest p-value (`x5`)
- Rerun the model
- Continue until all p-values < 0.05


# Remove Feature with Highest p-value
Keep repeating this by removing the feature with the highest p-value above 0.05 until all remaining features are significant.

In [24]:
X_opt = X[:, [0, 1, 2, 3, 4]]  # Removed 5th feature
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Promotion,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1.217e+30
Date:,"Fri, 11 Jul 2025",Prob (F-statistic):,0.0
Time:,11:22:39,Log-Likelihood:,1088.9
No. Observations:,50,AIC:,-2168.0
Df Residuals:,45,BIC:,-2158.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.164e-10,6.47e-11,-1.798,0.079,-2.47e-10,1.4e-11
x1,-1.943e-16,4.35e-16,-0.447,0.657,-1.07e-15,6.82e-16
x2,1.0000,4.91e-16,2.04e+15,0.000,1.000,1.000
x3,-3.886e-16,1.59e-16,-2.442,0.019,-7.09e-16,-6.81e-17
x4,-7.276e-12,2.69e-11,-0.270,0.788,-6.15e-11,4.7e-11

0,1,2,3
Omnibus:,1.887,Durbin-Watson:,0.742
Prob(Omnibus):,0.389,Jarque-Bera (JB):,1.425
Skew:,-0.208,Prob(JB):,0.491
Kurtosis:,2.285,Cond. No.,1440000.0


# Final Model after Elimination

In [25]:
X_opt = X[:, [0, 2]]  # Final selected features
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,Promotion,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,4.132e+31
Date:,"Fri, 11 Jul 2025",Prob (F-statistic):,0.0
Time:,11:23:02,Log-Likelihood:,1140.7
No. Observations:,50,AIC:,-2277.0
Df Residuals:,48,BIC:,-2274.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-5.457e-11,1.94e-11,-2.818,0.007,-9.35e-11,-1.56e-11
x1,1.0000,1.56e-16,6.43e+15,0.000,1.000,1.000

0,1,2,3
Omnibus:,50.89,Durbin-Watson:,0.057
Prob(Omnibus):,0.0,Jarque-Bera (JB):,632.888
Skew:,-2.078,Prob(JB):,3.7199999999999996e-138
Kurtosis:,19.927,Cond. No.,559000.0


# Concept Highlights

- **API** (Application Programming Interface): Connects front-end to back-end.
- **OLS**: A statistical method to fit linear regression.
- **p-value**: Helps determine if a feature is statistically significant (p < 0.05 is good).
- **T-test**: Performed on sample data to test hypothesis.
- **Backward Elimination**: A feature selection method based on p-values.
- **Adjusted R² > R²**: Indicates a better, more reliable model when adding/removing variables.


# Final Decision: Which Department to Focus On?

Based on statistical analysis using **Backward Elimination in OLS**, the model retained only **one important feature (x1)**, which:

- Has a **p-value = 0.000** → highly significant
- Has **coefficient = 1.0** → 1 unit increase in x1 increases promotion by 1 unit
- Explains **100% of the variation** in the Promotion outcome (R² = 1.000)

This means the department or factor represented by `x1` is **most directly responsible** for driving promotions.


# **Recommendation to Company:**

Focus your time, budget, and resources on the department represented by **x1** is **Department_Marketing** — this is the best area to invest in for maximizing promotions and growth.
