## Linear Regression

*Regression analysis* is a common statistical process for estimating the relationships between variables. This can allow us to make numeric predictions based on past data. *Simple Linear Regression* predicts a numeric response variable based on a single input variable (feature).

### Example 1: Simple Linear Regression

To demonstrate the use of simple linear regression with sci-kit learn, we will first create sample data in the form of NumPy arrays.

In [None]:
import numpy as np
np.random.seed(0)
x = np.random.random(size=(15, 1))
y = 3 * x.flatten() + 2 + np.random.randn(15)

Plot the data using Matplotlib:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(x, y, 'o')

Apply simple linear regression to learn (fit) the model, where *x* is our input variable and *y* is the target variable that we would like to learn how to predict:

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x, y)

Display the model parameters that we have learned: 

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_[0])

This model can now be use to make predictions for *y* given new values of *x*:

In [None]:
x_unseen = 0.78
model.predict(x_unseen)

Plot the data and the model prediction (i.e. the regression line):

In [None]:
# create predictions which we will use to generate our line
X_fit = np.linspace(0, 1, 100)[:, np.newaxis]
y_fit = model.predict(X_fit)
# plot the data
plt.plot(x.flatten(), y, 'o')
# plot the line
plt.plot(X_fit, y_fit)

### Example 2: Simple Linear Regression

As another example of simple linear regression, we will load a CSV dataset related to product advertising. Would like to analyse the relationship between budget spent on different advertising media and product sales.

In [None]:
import pandas as pd
df = pd.read_csv("advertising.csv", index_col=0)
df.head()

Will will try building a simple linear model to predict Sales based on the TV budget spend:

In [None]:
model = LinearRegression()
# create a copy of the data frame, with a single input variable
x = df[["TV"]]
# fit the model based on the original response variable
model.fit(x,df["Sales"])

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_[0])

Let's try to predict the first five values of the original data (note: normally we would use a separate test dataset in a real evaluation).

In [None]:
test_x = x[0:5]
model.predict(test_x)

When we compare the predictons to the actual sales values for the first 5 rows, we see there are some errors:

In [None]:
df["Sales"][0:5]

We can create a plot that shows how the regression line fits to the data for this feature:

In [None]:
plt.scatter(df["TV"], df["Sales"])
plt.xlabel("TV Budget Spend")
plt.ylabel("Sales")
# add the predictions from regression
plt.plot(df["TV"], model.predict(x), color="red")
plt.show()

We can calculate the overall *mean squared error* between the predictions and the actual sales data. This gives us an idea of how well the model based on TV budget predicts sales.

In [None]:
np.mean((df["Sales"] - model.predict(x)) ** 2)

We can repeat the same process using a different features, such as newspaper budget spend:

In [None]:
# extract the relevant column
x = df[["Newspaper"]]
# build the model
model = LinearRegression()
model.fit(x,df["Sales"])

When we calculate the overall *mean squared error* between the predictions and the actual sales data, we see that making predictions based on the newspaper spend leads to a higher error - i.e. this feature is a less reliable predictor.

In [None]:
np.mean((df["Sales"] - model.predict(x)) ** 2)

For real evaluations we would use a separate *test set* to measure the quality of predictions:

In [None]:
# separate the training test data - normally we would do this randomly
train_df = df[0:160]
test_df = df[160:200]
train_x = train_df[["Newspaper"]]
test_x = test_df[["Newspaper"]]

In [None]:
# only build a model on the training set
model = LinearRegression()
model.fit(train_x,train_df["Sales"])

In [None]:
model.predict(test_x)

In [None]:
np.mean((test_df["Sales"] - model.predict(test_x)) ** 2)

### Example 3: Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features, where we try to learn a model with one coefficient per input feature.

We will use the previous advertising dataset, which had 3 independent features: TV, Radio, Newspaper.

In [None]:
df = pd.read_csv("advertising.csv", index_col=0)
# we remove the sales column that we are going to predict
x = df.drop("Sales",axis=1)
x.head()

Now use all 3 input variables to fit linear regression model:

In [None]:
model = LinearRegression()
model.fit(x,df["Sales"])

When we build the model, note that each input feature has its own slope coefficient:

In [None]:
print("Model intercept is", model.intercept_)
print("Model slope is", model.coef_)

Again we can make predictions for sales based on new values for the 3 input features:

In [None]:
test_x = x[0:1]
print(test_x)
print("Predicted Sales = %.2f" % model.predict(test_x))
print("Actual Sales = %.2f" % df["Sales"].iloc[0])

We can make predictions for multiple new unseen examples in the same way:

In [None]:
unseen_X = np.array( [ [ 140.0, 45.3, 70.5 ], [ 70.0, 84.62, 98.95 ] ] )
unseen_X

In [None]:
model.predict( unseen_X )