# Simple ML models

We'll be using the following data science libraries throughout this course (click the links for cheat sheets provided by DataCamp)
* [numpy](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)  (for vectorised math operations)
* [pandas](https://datacamp-community-prod.s3.amazonaws.com/9f0f2ae1-8bd8-4302-a67b-e17f3059d9e8) (for dataframes)
* [keras](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf) (for neural networks)
* [scikitlearn](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Scikit_Learn_Cheat_Sheet_Python.pdf) (for other machine learning models)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn

The first step to training models is to figure out a way to tell how good a model is.  

You can't just train until it "gets the right results", there will almost always be some difference between the predicted outcome and the measured outcomes.

To tackle this, we'll split our data up into a "training set" (for inspection and training models), and a "test set" for model evaluation.

This is to ensure a fair test of the model's ability to generalise to new examples. The same reason why an exam contains different questions to the practice exams a student learns from.

Throughout this module, we'll be using the [Ames Housing Prices Data Set](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

The data is described [here](https://github.com/eliiza/ml-training-data/blob/master/housing_price_data/data_description.txt)

## Scikit-learn library

- Intro to sklearn
- After wrangling with Pandas -> machine learning library
- Huge advantage in consistency

## Fetching the data

We'll start by making a function that allows us to fetch the dataset and split it into a test and training set. The funciton will allow you to:

- Choose the percentage split between training and test data.
- Seed the random number generator to split the data the same way each run-through.

**sklearn function:** [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
from sklearn.model_selection import train_test_split

def load_housing_data(test_size=0.2, random_state=2020):
    # Load data from Eliiza's github page
    raw_data = pd.read_csv("https://raw.githubusercontent.com/eliiza/ml-training-data/master/housing_price_data/housing_data.csv") 

    # Separate labels from feature columns.
    X = raw_data.drop('SalePrice', axis=1)
    y = raw_data['SalePrice']
    
    # Split the dataset with the requested proportions.
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    
    # Return in standard order.
    return (X_train, y_train), (X_test, y_test)

In [None]:
(X_train, y_train), (X_test, y_test) = load_housing_data()

### Exercise
- How many observations are in the training and test datasets?

## Select Performance Measure

- https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics

# Mean Absolute Error

Now that we have a test set, we can start to evaluate some models!

We'll use the mean absolute error. This is the average size of the difference between the predicted value vs the observed value.

Formally, this is defined as

$$  \mathsf{MAE} = \frac{1}{n} * \sum_{i=1}^n |\mathsf{predicted\_value}[i] - \mathsf{actual\_value}[i]|  $$

It is good to understand how MAE is calculated, but SciKit Learn provides functions to calculate it for you. 

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

# ...

# A heuristic model

You will  most likely be familiar with the equation $y = mx + c$. We are going to go one step simpler. In this section we will pick a single column, which will be our $x$, and attempt to manually fit the gradient $m$, which we will call the multiplier.

In [None]:
def simple_predictor(X, target_column='OverallQual', multiplier=50000):
    return multiplier * X[target_column]

In [None]:
# Reload the training and test data.
(X_train, y_train), (X_test, y_test) = load_housing_data(random_state=42)

In [None]:
# Make some predictions on the test dataset.
y_pred = simple_predictor(X_test, multiplier=50000)

mae = mean_absolute_error(y_test, y_pred)

print("Mean Absolute Error:", mae)

### Exercise

Try some different values of the multiplier and see if you can find an optimal value. What techniques can you use to improve your guessing?

One way to improve your guesses is to slightly adjust the multiplier up and down, determine which direction makes the error smaller and try a new guess in that direction.

Another way is to plot the error across a range of multipliers, then estimate the best value.

In [None]:
errors = []
multipliers = []

for i in range(1000,60000,1000):
    y_pred = simple_predictor(X_train, multiplier=i)

    mae = mean_absolute_error(y_train, y_pred)
    
    errors.append(mae)
    multipliers.append(i)
    
plt.plot(multipliers, errors)
plt.xlabel('Multiplier')
plt.ylabel('Mean Abs. Error')

### Exercise

Based on the above graph, what is the best choice of multiplier? Are you able to get a MAE below $40,000