# Correlation Metrics

## Authors
V. Aquiviva, B.W. Holwerda

## Learning Goals
* Introduction to regression with Machine Learning
* classical chi-squared fit compared to ML solutions.
* key to using the right algorithm for the problem.

## Keywords
regression, machine learning, fit a line.

## Companion Content


## Summary
One of the first and simplest things one can do is to fit a linear relation to data. Here we shall do so using machine learning techniques.


<hr>


## Student Name and ID:



## Date:

<hr>

In [5]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# %matplotlib inline

In [6]:
font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': True})
#matplotlib.rcParams['figure.dpi'] = 300

### Time to Generate Our data

First we generate our linear data with some noise. Time and distance away from the observer.

In [7]:
np.random.seed(16) #set seed for reproducibility purposes

sample_size = 10
x = np.arange(sample_size) 
y = 2.*x + 5. + np.random.randn(sample_size) #generate some data with random gaussian scatter

## Minimize the $\chi^2$ fit

First we plot the data and generate a series of models to find a minimum for the model-data
separation. This is the non-ML, classical approach.

### Exercise 1

Plot y against y. Label x as time in seconds and y as the distance traveled in meters.

In [8]:
# student work


In [9]:
print(x,np.round(y,1))

[0 1 2 3 4 5 6 7 8 9] [ 5.1  5.5  8.4 11.1 11.8 14.4 16.1 19.5 20.2 23.1]


In [10]:
y = np.round(y,1)

In [11]:
slopes = np.linspace(1,3,101) 
intercepts = np.linspace(4,6,101)

## Note: these are already > 10000 models (curse of dimensionality!)

For convenience, we can define two functions that describe our model (a straight line) and the squared error function:

In [13]:
def model(x,m,b):
    return m*x+b #straight line

def se(m,b,x,y):
    return np.sum(((model(x,m,b) - y)**2))

In [14]:
square_errs = np.array([[se(m,b,x,y) for b in intercepts] for m in slopes]) 
#This generates an array where first index refers to slope and second index refers to intercept
square_errs.shape #check that the array has been built properly

(101, 101)

### Exercise 2

Plot the $\chi^2$ landscape using imshow() and colorbar().

What can you say about the model parameter space? 

In [16]:
# student work here


*student answer here*

### Finding our minimum

To find the minimum in the $\chi^2$ landscape above, we use min() and argmin(). Then the unravel the indices to get the best slope and intercept with the lowest $\chi^2$

In [62]:
print(square_errs.min()) #min Squared Error value

print(square_errs.argmin()) #index of min; however this corresponds to flattened array

indices = np.unravel_index(square_errs.argmin(), square_errs.shape) #indices of minimum value as a (row, col) pair

print(indices)

3.6479999999999944
5269
(52, 17)


In [63]:
#Derive the slope and intercept for best model

bestm, bestb = slopes[indices[0]],intercepts[indices[1]]
bestm, bestb

(2.04, 4.34)

### Exercise 3

Plot the data with the best-fit model (lowest $\chi^2$) as a line 

In [17]:
#student work here

### Machine Learning Algorithms

Now we will attempt to fit a model to the same data using two machine learning algorithms from sklearn. 
Let's import a couple of simple machine learning models: DecisionTree, Linear Regression 

In [18]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

### Split the data into training and testing samples

A keystone principle for most Machine Learning applications is to split the sample into a training set, where the algorithm will learn from and a test sample to test the resulting solution against.

In [19]:
from sklearn.model_selection import train_test_split 

In [20]:
np.random.seed(10) #fix for reproducibility

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.3) #create train/test split

In [21]:
X_train, y_train

(array([6, 3, 1, 0, 7, 4, 9]),
 array([16.1, 11.1,  5.5,  5.1, 19.5, 11.8, 23.1]))

In [22]:
X_test

array([8, 2, 5])

### Exercise 4

How big is the training set and how big the test set?

*student work here*

### Two models

we initiate two models, a decision tree and a linear regression.

In [23]:
treemodel = DecisionTreeRegressor() # default params

In [24]:
regmodel = LinearRegression() # default params

Build the model on the training set and use it to predict the output for the test set!

In [25]:
y_pred_tree = treemodel.fit(X_train.reshape(-1, 1), y_train).predict(X_test.reshape(-1, 1))

In [26]:
y_pred_reg = regmodel.fit(X_train.reshape(-1, 1), y_train).predict(X_test.reshape(-1, 1))

In [27]:
y_test, y_pred_reg, y_pred_tree #True/predicted by LR and DT respectively

(array([20.2,  8.4, 14.4]),
 array([20.89279279,  8.41981982, 14.65630631]),
 array([19.5,  5.5, 11.8]))

### Exercise 5

based on the test results, which of the two models does a better job or approximating the data?

*student work here*

### Exercise 6 - Mean $\chi^2$

Calculate Mean Squared Error i.e. the $\chi^2$ for the two models:

In [28]:
np.mean((y_test-y_pred_reg)**2)

0.18201586721857252

In [29]:
np.mean((y_test-y_pred_tree)**2)

5.22

### Exercise 7

does the $\chi^2$ agree with your assessment in Exercise 5?

*student work here*

### Exercise 8

How does the $\chi^2$ compare with the initial minimum value we found with the grid of models? Which model is best?

*student work here*

### Exercise 9

Models can also be used to *predict* a future value. Predict the value at t = 12 s with the two models.
Use model.predict(np.array(12).reshape(-1, 1))).

In [31]:
# student work


### Exercise 10

How do the two predictions compare? 

*student work here*

### Exercise 11

can you make a prediction using our original best model's slope and intercept? How does it compare?

In [32]:
# student work


### Exercise 12

Plot the original best $\chi^2$ model, the linear regression, and the Decision-Tree solutions. Include a legend for the three models. HINT: use xx = np.arange(0,12) as input.

In [33]:
# student work here
