# Linear Regression

In the last lab we manually figured out the multiplier that minimises the error, but this is a machine learning course! In this lab we'lllet the machine figure out a model.

Linear regression is one of the simplest machine learning models, and attempts to do what we just manually performed - fit a line to a set of input features. In addition to our previous manual model, however, it also allows the following:

- Includes a bias (the $+ c$ part of $y=mx + c$).
- Allows an arbitrary number of input features. e.g. if you had 2 features $x_1$ and $x_2$, the model equation could be written as: $y = w_1 x_1 + w_2 x_2 + c$
- Automatically calculates the weights to minimise the Residual Sum of Squares.

In the code below we make use of the the Scitkit-Learn library that provides us with functions to create and train a model. 

Use the [Scikit-Learn Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf) to help you understand the code below.

**Functions**
- sklearn.linear_model.LinearRegression
- .fit and .predict to use
- .coef_ and .intercept_ to get information on the resulting model

### First, load the data

As with the last lab, we'll load and split our training dataset

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

def load_housing_data(test_size=0.2, random_state=2020):
    # Load data from Eliiza's github page
    raw_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/housing_data.csv") 

    # Separate labels from feature columns.
    X = raw_data.drop('SalePrice', axis=1)
    y = raw_data['SalePrice']
    
    # Split the dataset with the requested proportions.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Return in standard order.
    return (X_train, y_train), (X_test, y_test)

In [None]:
(X_train, y_train), (X_test, y_test) = load_housing_data()

### Build a model!

In [None]:
from sklearn.linear_model import LinearRegression

selected_columns = ['OverallQual']

lin_reg = LinearRegression()

# Fit the model on the selected columns
lin_reg.fit(X_train[selected_columns], y_train)

### And test it

Thats it! We've built a simple model, now we can start making predictions using the test data.

In [None]:
from sklearn.metrics import mean_absolute_error

y_pred = lin_reg.predict(X_test[selected_columns])

mae = mean_absolute_error(y_test, y_pred)

print("Mean Absolute Error:", mae)

### Look at the model parameters

It is possible to inspect the model parameters, in this case the co-efficients and bias:

In [None]:
print("Multiplier:", lin_reg.coef_)
print("Bias:", lin_reg.intercept_)

If you have time, try plugging these numbers in to the simple predictor we made in the previous lab - remember to add in a bias parameter though.

### Exercise 

Look back at the correlation matrix from the previous lab and try using other columns for the linear regression model.

- Do you get more accurate results?
- How many `.coef_` coefficients are there in the new model? Why?
- Does the model take longer to train?

## An introduction to input pipelines

When training and testing our models, we want to have a consistent input pipeline to ensure the data is sanitised correctly.

In this case we'll use a very simple pipeline that will provide a starting point to expand in future notebooks.

In [None]:
from sklearn.compose import ColumnTransformer

selected_columns = ['OverallQual', 'BedroomAbvGr']

simple_pipeline = ColumnTransformer([('column filter', 'passthrough', selected_columns)])

We need to fit both the pipeline and the model. In this case, the pipeline doesn't apply any computations so fitting it doesn't actually do anything. In future notebooks this will change.

In [None]:
# Pre-process input data. Include fit to configure any parameters in the pipeline.
X_train_prepared = simple_pipeline.fit_transform(X_train)

# Train Linear Regression model.
lin_reg = LinearRegression()
lin_reg.fit(X_train_prepared, y_train)

Our input pipeline has been configured, and our model has been trained. Now we can try them out and evaluate the performance.

In [None]:
# When testing, we don't want to fit the pipeline, only apply the transformations
X_test_prepared = simple_pipeline.transform(X_test)

y_pred = lin_reg.predict(X_test_prepared)

mae = mean_absolute_error(y_test, y_pred)

print("Mean Absolute Error:", mae)