# Project -1

## Sales Prediction (Simple Linear Regression)

## Problem Statement

Build a model which predicts sales based on the money spent on different platforms for marketing.

## Steps for building a regression model using Python:

1. Import pandas and numpy libraries
2.  Use `read_csv` to load the dataset into DataFrame.
3. Identify the feature(s) (`X`) and outcome (`Y`) variable in the DataFrame for building the model.
4. Split the dataset into training and validation sets using `train_test_split()`.
5. Import statsmodel library and fit the model using `OLS()` method.
6.  Print model summary and conduct model diagnostics.

## Data
- Data is available at https://www.kaggle.com/datasets/ashydv/advertising-dataset?select=advertising.csv
- contains the salary of 50 graduating MBA students of a Business School in 2016 and their corresponding percentage marks in grade 10
  
  ![image.png](attachment:image.png)

In this notebook, we'll build a linear regression model to predict Sales using an appropriate predictor variable.

## Building simple linear regression model

### Importing important libraries

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

import matplotlib.pylab as plt
import seaborn as sns

plt.style.use('ggplot')

np.set_printoptions(precision=4, linewidth=100)

### Data file path

In [4]:
# Provide the relative path to the data file
file_path = "../ml-data/MBA-Salary.csv"

### Importing the data file

In [5]:
# importing the data file
mba_salary_df = pd.read_csv(file_path)
mba_salary_df.head( 10 )

Unnamed: 0,S. No.,Percentage in Grade 10,Salary
0,1,62.0,270000
1,2,76.33,200000
2,3,72.0,240000
3,4,60.0,250000
4,5,61.0,180000
5,6,55.0,300000
6,7,70.0,260000
7,8,68.0,235000
8,9,82.8,425000
9,10,59.0,240000


### Checking infromation about the dataset

In [8]:
mba_salary_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   S. No.                  50 non-null     int64  
 1   Percentage in Grade 10  50 non-null     float64
 2   Salary                  50 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 1.3 KB


In [19]:
mba_salary_df.dtypes

S. No.                      int64
Percentage in Grade 10    float64
Salary                      int64
dtype: object

In [22]:
mba_salary_df.shape

(50, 3)

### Importing statsmodel library for linear regression analysis

- The statsmodel library is used in python for building statistical models. 
- OLS (Ordinary Least Squares) API available in statsmodel.api is used for estimation of the parameters for simple linear regression model. 
- The OLS model takes two parameters Y and X.

In [11]:
# if you encounter `ModuleNotFoundError` for statsmodel then install it using
%pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.0-cp311-cp311-macosx_10_9_x86_64.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting patsy>=0.5.2
  Downloading patsy-0.5.3-py2.py3-none-any.whl (233 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.8/233.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.3 statsmodels-0.14.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Creatin the two parameters
- In the present case, 'Percentage in Grade 10' will be X and Salary will be Y.
- OLS API available in statsmodel.api estimates only the coeffient of X parameter $Y=\beta_0+\beta_1 X+\epsilon$.
- As the value of the columns remains same across all samples, the parameter estimated for this feature or column will be the intercept term.

In [13]:
# Creating feature Set(X) and Outcome Variable (Y)
import statsmodels.api as sm

X = sm.add_constant(mba_salary_df['Percentage in Grade 10'])
X.head(5)

Unnamed: 0,const,Percentage in Grade 10
0,1.0,62.0
1,1.0,76.33
2,1.0,72.0
3,1.0,60.0
4,1.0,61.0


In [15]:
Y = mba_salary_df['Salary']

### Splitting the Dataset into Training and Validation Sets

- `train_test_split()` function from `skelarn.model_selection` module provides the ability to split the dataset randomly into 
  - training and 
  - validation datasets. 
- The parameter `train_size` takes a fraction between `0` and `1` for specifying training set size. 
- The remaining samples in the original set will be 
  - test or 
  - validation set.
- The records that are selected for training and test set are randomly sampled. 
- The method takes a seed value in parameter named `random_state`, to fix which samples go to training and which ones go to test set.
- `train_test_split()` method returns four variables as below.
  1. `train_X` contains `X` features of the training set.
  2. `train_y` contains the values of response variable for the training set.
  3. `test_X` contains `X` features of the test set.
  4. `test_y` contains the values of response variable for the test set.

#### Importing Sklearn library

If it is not installed, use

`pip install scikit-learn`

In [16]:
from sklearn.model_selection import train_test_split

In [17]:
train_test_split

<function sklearn.model_selection._split.train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)>

In [18]:
train_X, test_X, train_y, test_y = train_test_split( X, Y, train_size = 0.8, random_state = 100 )

- `train_size` = 0.8 implies 80% of the data is used for training the model and the remaining 20% is used for validating the model.
- `random_state` = 100: This parameter sets the seed value for the random number generator. It ensures that the data splitting process is reproducible. 
- Using the same `random_state` value will produce the same train-test split each time the code is executed, which is useful for consistent results and debugging.

In [26]:
train_X.head()

Unnamed: 0,const,Percentage in Grade 10
0,1.0,62.0
11,1.0,60.0
18,1.0,70.0
45,1.0,57.58
38,1.0,54.0


In [24]:
test_X

Unnamed: 0,const,Percentage in Grade 10
6,1.0,70.0
36,1.0,68.0
37,1.0,52.0
28,1.0,58.0
43,1.0,74.5
49,1.0,60.8
5,1.0,55.0
33,1.0,78.0
20,1.0,63.0
42,1.0,74.4


In [27]:
test_y

6     260000
36    177600
37    236000
28    360000
43    250000
49    300000
5     300000
33    330000
20    120000
42    300000
Name: Salary, dtype: int64

### Fitting the Model

We will fit the model using OLS method and pass `train_y` and `train_X` as parameters.

In [28]:
mba_salary_lm = sm.OLS(train_y, train_X ).fit()

The `fit()` method on `OLS()` estimates the parameters and returns model information to the variable `mba_salary_lm`, which contains the model parameters, accuracy measures, and residual values among other details.

### Printing Estimated Parameters and Interpreting Them

In [29]:
print(mba_salary_lm.params)

const                     30587.285652
Percentage in Grade 10     3560.587383
dtype: float64


The estimated (predicted) model can be written as

`MBA Salary = 30587.285 + 3560.587 * (Percentage in Grade 10)`

The equation can be interpreted as follows: 
- For every 1% increase in Grade 10, the salary of the MBA students will increase by 3560.587.

## Model Diagnostics

It is important to validate the regression model to ensure its validity and goodness of fit before it can be used for practical applications. The following measures are used to validate the simple linear regression models:

1. Co-efficient of determination (R-squared).
2. Hypothesis test for the regression coefficient.
3. Analysis of variance for overall model validity (important for multiple linear regression).
4. Residual analysis to validate the regression model assumptions.
5. Outlier analysis, since the presence of outliers can significantly impact the regression parameters.


### 1. Co-efficient of Determination (R-Squared or $R^2$)

- The primary objective of regression is to explain the variation in `Y` using the knowledge of `X`. 
- The co-efficient of determination (R-squared or $R^2$) measures the percentage of variation in `Y` explained by the model $(\beta_0+\beta_1 x)$. 
- The simple linear regression model can be broken into:
    1. Variation in outcome variable explained by the model. 
    2. Unexplained variation as shown in Eq.

        $\underbrace{Y_i}_{\text{Variation in Y}} = \underbrace{\beta_0 + \beta_1 X_i}_{\text{Variation in Y explained by the model}} + \underbrace{\epsilon_i}_{\text{Variation in Y not explained by the model}}$
    
        It can be proven mathematically that

        $\underbrace{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}_{SST} = \underbrace{\sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2}_{SSR} + \underbrace{\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2}_{SSE}$

        where $\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i$ is the predicted value of $Y_i$.

        $\hat{\beta}_1 = \sum \frac{(X_i-\bar{X})(\bar{Y}_i-\bar{Y})}{(X_i-\bar{X})^2}$

        $\hat{\beta}_0 = \bar{Y} - \beta_1 \bar{X}$

        - The hat ($\hat{\beta}_{0 ~\text{or} ~1}$) symbol is used for denoting the estimated value. 
        - SST = is the sum of squares of total variation $\Rightarrow \sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2$, 
        - SSR = is the sum of squares of explained variation due to the regression model, and
        - SSE = is the sum of squares of unexplained variation (error) $\Rightarrow \text{SSE}=\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2$.

        $\boxed{R^2 = 1 - \frac{SSE}{SST}}$ 
        
        - Mathematically, R-squared ($R^2$) is square of correlation coefficient ($R^2 = r^2$), where $r$ is the Pearson correlation co-efficient.
        - Higher R-squared indicates better fit; however, one should be careful about the spurious relationship.

<div style="background-color: #f9f9f9; border: 1px solid #ddd; padding: 10px;">

**Assumptions of the Linear Regression Model**
    
1. The errors or residuals $\epsilon_i$ are assumed to follow a normal distribution with expected value of error $\Rightarrow E(\epsilon_i)=0$.
2. The variance of error, $\text{VAR}(\epsilon_i)$, is constant for various values of independent variable $X$. This is known as homoscedasticity. When the variance is not constant, it is called heteroscedasticity.
3. The error and independent variable are uncorrelated.
4. The functional relationship between the outcome variable and feature is correctly defined.

**Properties of Simple Linear Regression**

1. The mean value of $Y_i$ for given $X_i$, $\Rightarrow E(Y_i | X) = \hat{\beta}_0 + \hat{\beta_1} X$.
2. $Y_i$ follows a normal distribution with mean $\hat{\beta}_0 + \hat{\beta_1} X$ and variance $\text{Var}(\epsilon_i)$
</div>

### 2. Hypothesis Test for the Regression Co-efficient

- The regression co-efficient ($\beta_1$) captures the existence of a linear relationship between the outcome variable and the feature. 
- If $\beta_1 = 0$, we can conclude that there is no statistically significant linear relationship between the two variables. 
- It can be proved that the sampling distribution of $\beta_1$ is a t-distribution (`Kutner et al., 2013`; `U Dinesh Kumar, 2017`).
- The null and alternative hypotheses are
    
    - $H_0:\beta_1 = 0$
    - $H_1: \beta_1 \neq 0$.

- **t-test:** $t_{\alpha/2, n-2} = \frac{\hat{\beta}_1}{S_e(\hat{\beta}_1 )}$.
- **standard error of estimate of the regression co-efficient**: 
  
  $S_e(\hat{\beta}_1) = \frac{S_e}{\sqrt{(X_i - \bar{X})^2}}$

  where $S_e$ is the standard error of the estimated value of $Y_i$ (and the residuals) and is given by

  $S_e = \sqrt{\frac{Y_i-\hat{Y}_i}{n-2}}$.

  The hypothesis test is a two-tailed test. The t-test is a t-distribution with n − 2 degrees of freedom (two degrees of freedom are lost due to the estimation of two regression parameters $\beta_0$ and $beta_1$).

  $S_e(\beta_1)$ is the standard error of regression co-efficient $\beta_1$.

### 3. Analysis of Variance (ANOVA) in Regression Analysis

We can check the overall validity of the regression model using ANOVA in the case of multiple linear regression model with k features. The null and alternative hypotheses are given by

- $H_0: \beta_1 = \beta_2 = ... = \beta_k = 0$.
- $H_A:$ Not all regression coefficients are zero.

The corresponding F-statistic is given by

$F=\frac{MSR}{MSE} = \frac{SSR/k}{SSE/(n-k-1)}$

where 

- $\text{MSR}= \text{SSR}/k$ and 
- $\text{MSE} = \text{SSE}/(n − k − 1) \Rightarrow$ are mean squared regression and mean squared error, respectively. 
- F-test is used for checking whether the overall regression model is statistically significant or not.

## Regression Model Summary Using Python

The function `summary2()` prints the model summary which contains the information required for diagnosing a regression model 

In [30]:
mba_salary_lm.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.19
Dependent Variable:,Salary,AIC:,1008.868
Date:,2023-06-20 10:12,BIC:,1012.2458
No. Observations:,40,Log-Likelihood:,-502.43
Df Model:,1,F-statistic:,10.16
Df Residuals:,38,Prob (F-statistic):,0.00287
R-squared:,0.211,Scale:,5012100000.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,30587.2857,71869.4497,0.4256,0.6728,-114904.8089,176079.3802
Percentage in Grade 10,3560.5874,1116.9258,3.1878,0.0029,1299.4892,5821.6855

0,1,2,3
Omnibus:,2.048,Durbin-Watson:,2.611
Prob(Omnibus):,0.359,Jarque-Bera (JB):,1.724
Skew:,0.369,Prob(JB):,0.422
Kurtosis:,2.3,Condition No.:,413.0


The output can be separated in three tables as follows:

In [33]:
mba_salary_lm.summary2().tables[0]

Unnamed: 0,0,1,2,3
0,Model:,OLS,Adj. R-squared:,0.19
1,Dependent Variable:,Salary,AIC:,1008.868
2,Date:,2023-06-20 10:15,BIC:,1012.2458
3,No. Observations:,40,Log-Likelihood:,-502.43
4,Df Model:,1,F-statistic:,10.16
5,Df Residuals:,38,Prob (F-statistic):,0.00287
6,R-squared:,0.211,Scale:,5012100000.0


In [34]:
mba_salary_lm.summary2().tables[1]

Unnamed: 0,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,30587.285652,71869.449679,0.425595,0.672804,-114904.808889,176079.380192
Percentage in Grade 10,3560.587383,1116.925833,3.187846,0.002867,1299.489245,5821.685521


In [35]:
mba_salary_lm.summary2().tables[2]

Unnamed: 0,0,1,2,3
0,Omnibus:,2.048,Durbin-Watson:,2.611
1,Prob(Omnibus):,0.359,Jarque-Bera (JB):,1.724
2,Skew:,0.369,Prob(JB):,0.422
3,Kurtosis:,2.3,Condition No.:,413.0


- The model R-squared value is 0.211, that is, the model explains 21.1% of the variation in salary.