# Basics

It's exam season ! We collected samples from 200 students that had a free day before their last exam and some decided to throw a party.

<div class="alert alert-block alert-info">
Load the dataset, it contains datas on 200 "students" : their age, number of hours they worked, number of pints they had the night before and final grade (0-20).

Before going further, back to the basics : compute the minimum, maximum and mean of each column.

In [2]:
import numpy as np
import matplotlib.pyplot as plt

In [4]:
ds = np.load("dataset.npy")

columns_names = ['age', 'hours', 'pints', 'grade']

## A tale of prediction

Machine learning is mostly used to predict the output of a given system provided a known input. The prediction itself can take different forms,
require more or less postprocessing to be usable, but in the end, it's a guess.

<div class="alert alert-block alert-info">
To explore a bit this world, let's turn to the dataset and check the link between the number of hours worked and the number of pints...
Plot the former against the latter, it's always a good idea to visualize your data.
</div>

<div class="alert alert-block alert-warning">
Note that there's no reason the samples would be correctly ordered, so it's probably better to use <code>plt.scatter</code> instead of <code>plt.plot</code>. An insteresting representation trick when you might have overlapping data points is to use the <code>alpha</code> parameter to make the plot translucent and better visualize possible clusters (<code>alpha=0.2</code> should work alright).
</div>

In [792]:
age = ds[:,0]
hours = ds[:,1]
pints = ds[:,2]
grade = ds[:,3]

Let's try to build a model to represent that relationship

<div class="alert alert-block alert-success">
A <b>model</b> is an idealized representation of reality. The goal is to capture the <b>essential behaviour</b> and discard noise effects.

Here, we could try a 1- or 2-parameter model (constant or linear with an intercept). If we go down this path, we already know that we will not capture all of the observations. To determine the parameters
We can do that in 3 different ways:

- Compute the Least Mean Squares regresssion using the formula $M^TM p = M^T y$ with $p = [a_N, a_{N-1},\ldots,a_1,a_0]$ the vector of coefficients and $M = [x^N|x^{N_1}|\ldots|x|1]$ and $(x, y)$ are input/output **vectors**
- Use numpy's ``polyfit`` function
- Use ``sklearn`` LinearRegressor class

<div class="alert alert-block alert-info">
    Try a 3-parameter polynomial fit on the proposed data (<code>numpy.polyfit</code>). Plot the result against the data<br/>
What can you observe ?

It looks like we're missing parts of the events

<div class="alert alert-block alert-success">
This is typical from <b>Underfitting</b> :<ul>
    <li>The model is too simple to accurately represent the data</li>
    <li>It doesn't capture noise but misses information</li>
</ul>
</div>

<div class="alert alert-block alert-info">
Try increasing the polynomial degree to 3, 4 or 10 to fit data better.

Hum... this time, we see some improvement. The model seems to somewhat capture the data for a polynomial of degree 4.
Between a 4th order and a 10th order polynomial, there's no much improvement though. So, what should we use ?

##Â How close from the "truth" ?

Here, we're playing with simulated data, the "ground truth" is known.

<div class="alert alert-block alert-info">
Load the <code>real_pint_data</code> file and plot it alongside the previous fits, it contains denoised data and will show us what the actual phenomenon looks like.

In [5]:
ground_truth = np.load("real_pint_data.npy")

OK so it seems that a 4th order fit is enough here... 

<div class="alert alert-block alert-success">
    Higher order models might be OK  at first sight but they do have an inherent issue : <b>they overfit</b>.
That means:
    <ul>
        <li>the large number of parameters allow them to fit the proposed data <i>too well</i> and they capture noise</li>
        <li>they are not able to <b>generalize</b> : if presented new inputs, they won't be able to  correctly predict an output.</li>
    </ul>


Note that there's always a model large enough to perfectly match all presented datapoints

## Generalization, prediction accuracy

Let's try to measure the accuracy of our prediction.
The easy way to do that is to compute an error measure on the predicted data vs real data.

$$\epsilon = \frac{1}{N}\sum_i L(y_i, f(x_i))$$

with the model $f$ linking input $x_i$ to a prediction that must be compared to the observed data $y_i$ through a "distance" $L$.

Let's use the squared error : $L(y_{true}, y_{pred}) = (y_{true} - y_{pred})^2$

<div class="alert alert-block alert-info">
Compute the error for the different polynomial orders we tried, plot a graph of the error versus the polynomial degree.

In [8]:
def MSE(x, y, model):
    # code here
    pass

This error we computed is called the *train* error, because it is the error observed between the developped model and the data we use to adjust it.

In terms of machine learning, what we're actually looking for is not just to correctly approximate a set of observations but also to be able to handle information that we've never seen before.

<div class="alert alert-block alert-info">
Let's extend the example to a larger dataset. We have the records of 1000 students in the file <code>full_ds</code>. Load it, plot the scatter of the work hours/pints relationship and use your models trained on 200 points to match the 1000 points. If the model is correct, it should generalize.

In [9]:
fullds = np.load('full_ds.npy')
age1k = fullds[:,0]
hours1k = fullds[:,1]
pints1k = fullds[:,2]
grade1k = fullds[:,3]

Let's plot the evolution of the error on the large dataset with increasing polynomial degree. For the sake of getting a nice graph, retrain your polynomial models for degrees going from 0 to 8 and recompute the errors.

For each model there's a sweet spot and this graph is perfect to find it. The model is optimal when the test error is at a minimum and usually it is just after a drop in the train error.

<div class="alert alert-block alert-info">
Let's play some more with this by changing how many of the records we use to train the regressions. We'll keep p=4, join the two datasets together and take 10%, 20%, 30% etc... of this dataset to train and the rest to evaluate the quality of the training.
</div>

<div class="alert alert-block alert-warning">
There's a useful function to do that in scikit-learn : <code>sklearn.model_selection.train_test_split</code>

In [10]:
from sklearn.model_selection import train_test_split

# Looking at the full dataset

Let's stop focusing on that hours/pint relationship and go towards the full dataset.

You've seen how even linear models can manage to represent complex links between features. Now, how to predict the grade based on the other columns ?

Scikit-learn is a very complete toolbox for all sorts of data processing and machine learning tasks (it even has simple neural nets to do some preliminary analyses).
    
In the follow up, we'll stick to this linear models but using a built-in regressor based on Stochastic Gradient Descent. You'll find this term again in the Neural Net realm as it is a rather efficient optimisation strategy.

<div class="alert alert-block alert-info">
From <code>sklearn.linear_model</code>, import and instantiate a <code>SGDRegressor</code> object.

In [780]:
from sklearn import linear_model

<div class="alert alert-block alert-info">Join the age, hours and pints vectors into a matrix, 1 sample per row.

<div class="alert alert-block alert-info">Use the method <code>fit</code> of the regressor to learn from the inputs/grade couples.

<div class="alert alert-block alert-info">Join the age, hours and pints vectors of the 1k sample dataset the same way and pass it to the <code>predict</code> method

<div class="alert alert-block alert-info">Plot a scatter of the predicted_grade versus the actual grades from the 1k sample dataset. If your regression worked well, the points should align on the $x=y$ line

Hmmm... You don't have to believe me right away but I tell you : this dataset **is** linearly separable, so this should work. What is wrong then ?

<div class="alert alert-block alert-warning">It is always a good idea to normalize your inputs and outputs when dealing with data.</div>

<div class="alert alert-block alert-info">Using the <code>StandardScaler</code> and <code>MinMaxScaler</code> from <code>sklearn.preprocessing</code> transform your input and output data both for the training and testing datasets.<div class="alert alert-block alert-info">Join the age, hours and pints vectors into a matrix, 1 sample per row.

In [787]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler

<div class="alert alert-block alert-info">Retrain a <code>SGDRegressor</code> on the normalized data and draw the same scatter plot as before. What can you conclude ?

<div class="alert alert-block alert-info">Some samples are still out of the line, look into scikit-learn documentation and the previous work on the hours/pint relationship to find a way to match these anyway.