In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

## Loading dataset
The data set contains information about money spent on advertisement and the generated sales. Money was spent on TV, radio and newspaper ads.
The objective is to use linear regression to understand how advertisement spending impacts sales.

In [None]:
data = pd.read_csv("data/Advertising.csv")

## Simple linear regression
First, we want to see how well we can predict sales given only the money spent on TV ads. Below visualize sales as a function of money spent on TV ads.

In [None]:
plt.figure(figsize=(16, 8))
plt.scatter(
    data['TV'],
    data['sales'],
    c='black'
)
plt.xlabel("Money spent on TV ads ($)")
plt.ylabel("Sales ($)")
plt.show()

First we are interested in finding the best approximation $y_i \approx \beta x_i$, where $y_i,x_i\in\mathbb{R}$ are sales and money spent on TV ads. We want to find the constant $\beta$ that minimizes $\sum_{i=1}^n (y_i - \beta x_i)^2$.

In [None]:
y = data['sales'].values
X_simple = data['TV'].values
beta_simple = # TODO: fill in
print("The linear model is: Y = {:.5}X".format(beta_simple))

Given this $\beta$, we compute the predictions for the points $x_i$ we have in the data set as $\beta x_i$.

In [None]:
predictions_simple = # TODO: fill in

Now we plot our predictions to see how well they fit the data.

In [None]:
plt.figure(figsize=(16, 8))
plt.scatter(
    data['TV'],
    data['sales'],
    c='black'
)
plt.plot(
    data['TV'],
    predictions_simple,
    c='blue',
    linewidth=2
)
plt.xlabel("Money spent on TV ads ($)")
plt.ylabel("Sales ($)")
plt.show()

We typically want to quantify the quality of learned models. In this lab, we will use the root mean squared error (RMSE), defined as $RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat y_i)^2}$, where $y_i$ are the actual outcomes and $\hat y_i$ are our predictions. 

In [None]:
def RMSE(preds, true_y):
    RMSE = # TODO: fill in
    return RMSE

Compute the RMSE of our first model.

In [None]:
print("The RMSE of our model is: {}".format(RMSE(y, predictions_simple)))

Now we check if adding an intercept term helps; we want to approximate $y_i \approx \beta_0 + \beta_1 x_i$, where $y_i,x_i\in\mathbb{R}$ are sales and money spent on TV ads. We find $\beta=(\beta_0,\beta_1)$ which minimizes $\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2$. We will augment the data matrix with a constant, so that every observation is $x_i=(1,x_i(1))=(x_i(0),x_i(1))$ and we minimize $\sum_{i=1}^n (y_i - \beta_0 x_i(0) - \beta_1 x_i(1))^2$.

In [None]:
X_simple_w_constant = sm.add_constant(X_simple) # augmenting the data matrix with a constant
beta_w_constant = # TODO: fill in
print("The linear model is: Y = {:.5} + {:.5}X".format(beta_w_constant[0], beta_w_constant[1]))

Given this $\beta$, we compute the predictions for the points $x_i$ we have in the data set as $\beta_0x_i(0) + \beta_1 x_i(1)$.

In [None]:
predictions_w_constant = # TODO: fill in

Now we plot our predictions to see how well they fit the data.

In [None]:
plt.figure(figsize=(16, 8))
plt.scatter(
    data['TV'],
    data['sales'],
    c='black'
)
plt.plot(
    data['TV'],
    predictions_w_constant,
    c='blue',
    linewidth=2
)
plt.xlabel("Money spent on TV ads ($)")
plt.ylabel("Sales ($)")
plt.show()

Again we compute the RMSE of our model.

In [None]:
print("The RMSE of our model is: {}".format(RMSE(y, predictions_w_constant)))

## Transforming the features

Looking at the data, it seems that the square root function might fit the mapping from TV ads to sales better. We can still use the linear regression formula; we just need to transform the feature TV ads to sqrt(TV ads). Find the model to approximate $y_i \approx \beta_0 + \beta_1 \sqrt{x_i}$.

In [None]:
X_sqrt_w_constant = sm.add_constant(np.sqrt(X_simple))
beta_sqrt_w_constant = # TODO: fill in
print("The linear model is: Y = {:.5} + {:.5}sqrt(X)".format(beta_sqrt_w_constant[0], beta_sqrt_w_constant[1]))

As before, compute the predictions.

In [None]:
predictions_sqrt_w_constant = # TODO: fill in

Now we compute the RMSE of this model.

In [None]:
print("The RMSE of our model is: {}".format(RMSE(y, predictions_sqrt_w_constant)))

## Multivariate linear regression
Now we will try predicting the sales given the money spent on TV, radio and newspaper ads. In this case, unfortunately, we can't visualize the whole data set.

We want to approximate $y_i \approx \beta^\top x_i$, where $y_i\in\mathbb{R}$ is sales, and $x_i\in\mathbb{R}^4$ is money spent on TV, radio and newspaper ads, already augmented with a constant 1. We want to find the vector $\beta$ that minimizes $\sum_{i=1}^n (y_i - \beta^\top x_i)$.

In [None]:
Xs = data.drop(['sales', 'Unnamed: 0'], axis=1)
X_multi = Xs.values
X_multi = sm.add_constant(X_multi) # augmenting the data matrix with a constant
beta_multi = # TODO: fill in
print("The linear model is: Y = {:.5} + {:.5}*TV + {:.5}*radio + {:.5}*newspaper".format(beta_multi[0], beta_multi[1], beta_multi[2], beta_multi[3]))

Now we compute the predictions:

In [None]:
predictions_multi = # TODO: fill in

We compute the RMSE of the final model:

In [None]:
print("The RMSE of our model is: {}".format(RMSE(y, predictions_multi)))

Summarize your observations in 2-3 sentences.