In [2]:
import pandas as pd;
import numpy as np;

In [3]:
adv_data = pd.read_csv("Data/Advertising.csv");

In [4]:
n = adv_data.shape[0];

## I. Creating a simple linear regression ##  

### 1. TV linear regression  

We can approximate the relationship between the X (TV) and Y (sales) by a linear function as following : 

<b>Y = β0 + β1.X + error.</b><br>
here,<br>
β0, β1 are considered reducible parameters, however "error" is not reducible.<br>
Where β0 is the intercept term, β1 is the slope.<br>


<b>estimating the parameters 

when dealing the simple version of linear regression, we can easily estimate our parameters using some Calculus<br>
<b>What is RSS: </b> <br>
RSS = $\sum_{i=1}^n$(y<sub>i</sub> − β̂ <sub>0</sub> − β̂ <sub>1</sub>.x<sub>i</sub>)^2.<br>
Where n here is our sample's size.<br>
From this formula we can intuitively derive all the parameters' expression by taking the partial derivatives with respect to  β̂ <sub>0</sub> and β̂ <sub>1</sub> and put it to zero (finding minima coordinates). <br>

![image.png](attachment:30de7a91-5c31-436e-9dbe-83273a6454b2.png)

In [5]:
tv_data = np.array(adv_data['TV']);
tv_data_mean = tv_data.mean()
y_data = np.array(adv_data['sales']);
y_data_mean = y_data.mean()
# print(np.sum((tv_data - tv_data_mean)*(y_data - y_data_mean)));

beta_hat1 = np.sum((tv_data - tv_data_mean)*(y_data - y_data_mean)) / np.sum((tv_data - tv_data_mean)**2);
beta_hat0 = y_data_mean - beta_hat1 * tv_data_mean;

rss = np.sum((y_data - (beta_hat0 + beta_hat1 * tv_data))**2);
print(rss)
print(beta_hat1)
print(beta_hat0)

2102.5305831313512
0.047536640433019736
7.0325935491276965


<b>Assessing the accuracy of the estimates

Using hypothesis testing (T-statistic) <br>
H0 : there is no relationship between X and Y => beta_1 = 0<br>
Ha : Contrary to H0.<br>




![image.png](attachment:6b0e2e72-118f-4234-8946-26ee2c0ce2b8.png)

![image.png](attachment:3b7f3083-15b4-4023-99d4-5095d368c03a.png) <br>
Here, sigma = var(error) = RSE .<br>
![image.png](attachment:57a97ef2-77eb-4d91-8af8-022c45315878.png)

In [13]:
# calculating SE(Beta_1),t
rse = (rss/(n-2))**0.5;
se_beta1 = np.sqrt(rse**2 / np.sum((tv_data - tv_data_mean)**2));
se_beta0 = np.sqrt((rse**2)*( (1/n) + tv_data_mean**2/(np.sum((tv_data - tv_data_mean)**2))));
t_beta1 = beta_hat1 / se_beta1;
t_beta0 = beta_hat0 / se_beta0;

estimates_results = pd.DataFrame({'estimate': ['beta_0','beta_1'] , 'coefficients':[beta_hat0,beta_hat1] , 'T-statistic' : [t_beta0,t_beta1] , 'P-value' : ['<0.0001' ,'<0.0001']});
print(estimates_results);

  estimate  coefficients  T-statistic  P-value
0   beta_0      7.032594    15.360275  <0.0001
1   beta_1      0.047537    17.667626  <0.0001


from the t-distribution table we will find that p-value is approximately equal to 0 <br>
For 95% confidence interval, and n-2 DF => We reject H0 and accept Ha

<b>Assessing the accuracy of the model

So, we have proved that there is a relationship between sales and TV. Now we should assess our model using <b> the residual standard error (RSE)</b> or<b> R^2 statistic </b>.<br>
<b> R^2 statistic </b> : the proportion of variability of Y explained by our model.

![image.png](attachment:d7b55c6e-1948-4183-98f7-f5af841391a4.png) ![image.png](attachment:edae9b65-0231-443e-960e-73bc8207e687.png)

In [20]:
tss = np.sum((y_data - y_data_mean)**2);
R = np.sqrt((tss - rss) /tss );
print(tss);
print(R**2);
model_results = pd.DataFrame({'RSE': [rse] , 'R^2':[R**2]});
model_results

5417.14875
0.6118750508500711


Unnamed: 0,RSE,R^2
0,3.258656,0.611875
