# Simple Linear Regression with `statsmodels`

The Python libraries `statsmodels` and `scikit-learn` make implementing linear regression very easy - much easier than implementing from scratch like we did in the last lesson.

We will start with the `statsmodels` library. First, let's import the `advertising` data.

In [None]:
# Import necessary libaries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

# Import and display first five rows of advertising dataset
advert = pd.read_csv('advertising.csv')
advert.head()

This dataset contains data about the advertising budget spent on TV, Radio, and Newspapers for a particular product and the resulting sales. We expect a positive correlation between such advertising costs and sales. 

Let’s start with TV advertising costs to create a simple linear regression model. First let’s plot the variables to get a better sense of their relationship:

In [None]:
# Create scatter plot
plt.figure(figsize=(12, 6))
plt.plot(advert['TV'], advert['Sales'], 'o')
plt.xlabel('TV Advertising Costs')
plt.ylabel('Sales')
plt.title('TV vs Sales')

plt.show()

As TV advertisement cost increases, sales also increase – they are positively correlated! 

Now with the `statsmodels` library, let’s create a line of best fit using the least sum of square method.

In [None]:
import statsmodels.formula.api as smf

# Initialise and fit linear regression model using `statsmodels`
model1 = smf.ols('Sales ~ TV', data=advert)
model1 = model1.fit()

In the above code, we used `statsmodels`’ `ols` function to initialise our simple linear regression model. This takes the formula `y ~ X`, where `X` is the predictor variable (TV advertising costs) and `y` is the output variable (Sales). Then, we fit the model by calling the OLS object’s `fit()` method. If you’d like to learn more about `ols`, you can read the documentation [here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.OLS.html).

Calling `.params` will show us the model’s parameters:

In [None]:
model1.params

In the notation that we have been using, $\alpha$ is the intercept and  $\beta$ is the slope i.e.:

$\alpha = 7.032, \beta = 0.047$

Thus, the equation for the model will be:

$\text{Sales} = 7.032 + 0.047*\text{TV}$

Let's also check an indicator of the model efficacy, *R<sup>2</sup>*. Luckily, `statsmodel` gives us a ready-made method for doing this so we don’t need to code all the math ourselves:

In [None]:
model1.rsquared

We can also take a look at the model summary by writing this snippet:

In [None]:
print(model1.summary())

There is a lot here. Of these results, we have discussed:
- R-squared
- F-statistic
- Prob (F-statistic) - this is the p-value of the F-statistic
- Intercept coef - this is `alpha`
- TV coef - this is `beta` for predictor `TV`
- P>|t| - this is the p-value for our coefficients

You can learn more about the other linear regression results [here](https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html).

Now that we’ve fit a simple regression model, we can try to predict the values of sales based on the equation we just derived!

In [None]:
sales_pred = model1.predict(advert['TV'])

The `.predict` method predicts sales value for each row based on the model equation using TV costs. This is the equivalent of manually typing out our equation: `sales_pred = 7.032 + 0.047*(advert['TV'])`.

We can visualise our regression model by plotting `sales_pred` against the TV advertising costs to find the line of best fit:

In [None]:
# Plot regression against actual data
plt.figure(figsize=(12, 6))
plt.plot(advert['TV'], advert['Sales'], 'o')           # scatter plot showing actual data
plt.plot(advert['TV'], sales_pred, 'r', linewidth=2)   # regression line
plt.xlabel('TV Advertising Costs')
plt.ylabel('Sales')
plt.title('TV vs Sales')

plt.show()

Now let's calculate the RSE to measure how accurate our model is in predicting sales:

In [None]:
# Create new column to store predictions
advert['sales_pred'] = sales_pred

# Calculate RSE
advert['SSD'] = (advert['Sales'] - advert['sales_pred'])**2
SSD = advert['SSD'].sum()
RSE = np.sqrt(SSD / 198)   # n = 200
salesmean = np.mean(advert['Sales'])
error = RSE / salesmean

print(f'RSE = {RSE}\nMean sale = {salesmean}\nError = {np.round(error, 4)*100}%')

Thus, this model has an average accuracy of 76.76%. This can definitely be improved upon!

In the next step, we will add more features as predictors and see whether it improves our model. Go back to the notebook directory in Jupyter by pressing `File` > `Open…` in the toolbar at the top, then open the notebook called `2.2 Multiple regression with statsmodels.ipynb`.