# Linear Regression - Advertising Example
Adapted from Chapter 3 of *An Introduction to Statistical Learning with Applications in Python* textbook

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import warnings
warnings.filterwarnings('ignore') #ignoring some deprication warnings

### Load Data

Consider the advertising data as mentioned in the Linear Models notes:

In [None]:
data = pd.read_csv('Advertising.csv')
data.head()

## Simple Linear Regression
Let's investigate the relationship between sales and money spent on TV advertising

### Arrange data into feature matrix and target vector

In [None]:
x = data['TV']
print(x.shape)
y = data['sales']
print(y.shape)

In [None]:
# Need to reshape X to be 2-D array
X = np.array(x)
X = X.reshape(-1,1)

### Do we have a linear relationship?

In [None]:
ax = sns.lmplot(x='TV', y='sales', data=data)

### Implement linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=55)

lr = LinearRegression().fit(X_train, y_train)

What are the estimated coefficients?

In [None]:
print("lr.coef_: ", lr.coef_)
print("lr.intercept_: ", lr.intercept_)

How accurate is our estimated model?

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = lr.predict(X_test)
mean_squared_error(y_test, y_pred)

In [None]:
print("Training score:", lr.score(X_train, y_train))
print("Testing score:", lr.score(X_test, y_test))

Looking at the R^2 score, it seems that we are underfitting the data, because we have high bias and low variance.

What happens if we include more than one feature?

## Multiple Linear Regression
We will repeat the same analysis as above, but with all three advertising budgets

In [None]:
X = data.drop('sales', axis=1)
print(X.shape)
y = data['sales']
print(y.shape)

### Are the features correlated?

In [None]:
sns.pairplot(data)

In [None]:
g = sns.heatmap(data.corr(method='spearman'), annot=True)

### Implement linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

lr = LinearRegression().fit(X_train, y_train)

What are the estimated coefficients?

In [None]:
print("lr.coef_: ", lr.coef_)
print("lr.intercept_: ", lr.intercept_)

The coefficient for the TV data is similar to the simple linear regression example. The highest coefficient is for the radio data. The coefficient for the newspaper data is very low (close to zero). This means that the newspaper data is not as important to the model as the other two. 

Looking at the heatmap, this makes sense, as the newspaper data is correlated with the radio data, but not the sales. This means that spending more on radio advertising is usually done with an increased newspaper spend, but that just increasing newspaper spending does not necessarily increase sales.

In [None]:
from sklearn.metrics import mean_squared_error

y_pred = lr.predict(X_test)
mean_squared_error(y_test, y_pred)

In [None]:
print("Training score:", lr.score(X_train, y_train))
print("Testing score:", lr.score(X_test, y_test))

We can see that, by adding the additional data features, we were able to decrease the error and the bias.