# Introduction to Linear Regression Using OLS - Practical


* Linear regression is one of the most widely used techniques in statistical modeling. 

* It is a method used to model the relationship between a dependent variable and one or more explanatory variables. 

* The goal of linear regression is to find the line of best fit that accurately represents the relationship between the variables. 

* This practical will introduce you to the concept of linear regression and guide you through the steps to perform a simple linear regression analysis using a simulated data. 

* By the end of the practical, you will have a clear understanding of how to interpret the results and use linear regression to make predictions.

In [None]:
# Import libraries
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Generate data 

In [None]:
# Define the number of samples and explanatory variables
num_samples = 100
num_features = 7

# Generate random values for the explanatory variables
np.random.seed(0)
X = np.hstack((np.ones((num_samples, 1)), np.random.normal(size=(num_samples, num_features))))

# Define the true parameter values for the linear regression
true_coefs = np.array([0, 2, -3, 4,-2,-10, 5, 9])

# Generate the target values using the explanatory variables
y = X @ true_coefs + np.random.normal(size=num_samples)

# Combine the explanatory variables and response values into a single data frame
df = pd.DataFrame(np.hstack([X, y.reshape(-1, 1)]), columns=["x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "y"])

Have a look at the data with `.head()`

In [None]:
df.head()

Use `.describe` to generate descriptive statistics

In [None]:
df.describe()

## 2. Visualise the data

Use seaborn's `pairplot` method to plot the `x_vars` against a single `y_vars`. What do you see?

(If the plot seems small in your notebook, set the "height" argument to something around 4 or 5, then doubleclick on it to enlarge when it renders.)

In [None]:
# Create the seabonr pairplot of each indepedent variable versus the dependent variable below.

# Your code here
sns.pairplot(data=df,
             x_vars = df.columns[:-1],
             y_vars = df.columns[-1], height=5);

print('Visually we can already guess that weight will be a positive value for x_7 and weight will be negative value for x_5')
print('We will need to test formally to confirm.')


## 3. Perform OLS regression

Use `statsmodels.formula.api` to perform linear regression. 

The `statsmodels.formula.api.ols()` function will allow you to specify the function form as follows: `formula="y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7"`

You can then run `.fit()` to get the optimal parameters.

In [None]:
# Import the package
from statsmodels.formula.api import ols

# Enter the code here
model = ols(formula='y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7', data=df)
results = model.fit()

# Get the overall results for the multivariate model
print(results.summary())

## 4. Interpret the weights

Now that we have the optimal weights and bias let's interpret the weights. What does each of the weights for $x1$ and $x_5$ tell us about the model?

In [None]:
# Your code here ...

print('beta_1=1.9862 : It means that y will increase by 1.9862 if x_1 increases by 1 unit , other variables remaining constant.')
print('beta_5=-10.126  : It means that y will decrease with 10.126  if x_5 increases by 1 unit , other variables remaining constant.')


## 5. Determine the significance of the weights and bias

Are all of the parameters significant?

We can use the simple 2-sided t-test to determine this.

* We can either accept or reject the Null Hypothesis according to a 5% level of significance in this example.
    - Null Hypothesis : $X_j$ has no effect on the responce $Y$ i.e H₀: $\theta$=0

    - Alternate Hypothesis: $X_j$ variable has effect on the response $Y$ i.e H₁: $\theta$ ≠0


In [None]:
# Your code here ...

print('All of the weights expect for the bias term are significantly different from 0 at a 5% level of significance.') 
print('The p-vlaue for the t-statistic for each weight is less than 0.05/2 and therefore we reject the null hypothesis that the parameter is zero')
print('The bias term has a p-value > 0.05/2 and therefore we accept the null hypothesis. The bias term is not significantly different from 0')


## 6. Determine if the overall model is significant

We can use the F-statistic and its corresponding p-value to determine if the overall model is significant.


In [None]:
# Your code here ...

print('The F-statistic = 3725 and the p-value = 6.18e-110 which is very close to 0, so we would reject the null hypothesis at a 5% level of significance as the p-value<=0.05. .')
print('This means that the overall model is significant even though the bias term is not significantly different from 0')


## 7. Interpret the  R squared score

The overall model is significant and we can now assess the strength of the relationship between the model and dependent variables.


In [None]:
print(' R squared score which is 0.996. This means that 99.6% of the variation on the target variable is explained by the mutivariate linear regression model.')

## 8. Generate predictions

To obtain the predictions for each observation in the data you can use the `.predict()` method of the OLS class.

In [None]:
# Your code here ...
pred = results.predict()


print(pred)