In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Supervised Learning Part 1 -- Regression Analysis

In regression we are trying to predict a continuous output variable -- in contrast to the nominal variables we will be predicting in the classification examples later. 

Let's start with a simple toy example with one feature dimension (explanatory variable) and one target variable. We will create a dataset out of a sinus curve with some noise:

In [None]:
x = np.linspace(-3, 3, 100)
print(x)

In [None]:
rng = np.random.RandomState(42)
y = np.sin(4 * x) + x + rng.uniform(size=len(x))

In [None]:
plt.plot(x, y, 'o');

## Linear Regression

The first model that we will introduce is the so-called simple linear regression. Here, we want to fit a line to the data.

One of the simplest models is a linear one, that simply tries to predict the data as lying on a line. One way to find such a line is `LinearRegression` (also known as [*Ordinary Least Squares (OLS)*] (https://en.wikipedia.org/wiki/Ordinary_least_squares) regression).

The scikit-learn API requires us to provide the target variable (`y`) as a 1-dimensional array; scikit-learn's API expects the samples (`X`) in form a 2-dimensional array -- even though it may only consist of 1 feature. Thus, let us convert the 1-dimensional `x` NumPy array into an `X` array with 2 axes:

In [None]:
print('Before: ', x.shape)
X = x[:, np.newaxis]
print('After: ', X.shape)

Regression is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:

1. a training set that the learning algorithm uses to fit the model
2. a test set to evaluate the generalization performance of the model

The ``train_test_split`` function from the ``model_selection`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.

<img width="50%" src='https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/figures/train_test_split_matrix.png?raw=1'/>


We start by splitting our dataset into a training (75%) and a test set (25%):

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

## The scikit-learn estimator API and Logistic Regresion

<img width="50%" src='https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/figures/supervised_workflow.png?raw=1'/>

Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the linear regression class.

In [None]:
from sklearn.linear_model import LinearRegression

Next, we use the learning algorithm implemented in `LinearRegression` to **fit a regression model to the training data**:

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LinearRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `regressor.get_params()`, which returns a parameter dictionary.)

After fitting to the training data, we paramerterized a linear regression model with the following values.

In [None]:
print('Weight coefficients: ', regressor.coef_)
print('y-axis intercept: ', regressor.intercept_)

Since our regression model is a linear one, the relationship between the target variable (y) and the feature variable (x) is defined as:

$$y = weight \times x + \text{intercept}$$

Plugging in the min and max values into thos equation, we can plot the regression fit to our training data:

In [None]:
min_pt = X.min() * regressor.coef_[0] + regressor.intercept_
max_pt = X.max() * regressor.coef_[0] + regressor.intercept_

plt.plot([X.min(), X.max()], [min_pt, max_pt])
plt.plot(X_train, y_train, 'o');

Similar to the estimators for classification in the previous notebook, we use the `predict` method to predict the target variable. And we expect these predicted values to fall onto the line that we plotted previously:

In [None]:
y_pred_train = regressor.predict(X_train)

In [None]:
plt.plot(X_train, y_train, 'o', label="data")
plt.plot(X_train, y_pred_train, 'o', label="prediction")
plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')
plt.legend(loc='best')

As we can see in the plot above, the line is able to capture the general slope of the data, but not many details.

Next, let's try the test set:

In [None]:
y_pred_test = regressor.predict(X_test)

In [None]:
plt.plot(X_test, y_test, 'o', label="data")
plt.plot(X_test, y_pred_test, 'o', label="prediction")
plt.plot([X.min(), X.max()], [min_pt, max_pt], label='fit')
plt.legend(loc='best');

There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute how good the model is:   
For regression tasks, this is the **R<sup>2</sup>** score. 

$$R^2= 1- \frac{\sum_{i}(f_i-y_i)^2}{\sum_{i}(y_i-\mu)^2}$$

In [None]:
regressor.score(X_test, y_test)

___
## Exercise

We will now look at a dataset with more than one variable. The ideas from the simple example above can be reused (we do not have to add an additional axis to the data).

Create a linear regressor that fits the ``diabetes`` data.

The methods that plot the data above cannot be used directly, can you adapt them in some way?

In [None]:
from sklearn.datasets import load_diabetes

diabetes_data = load_diabetes()
X, y = diabetes_data.data, diabetes_data.target
print("The data is described at https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html. The short description is:")
print("Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.")

# We need to:
# Split the data into training and test sets
# Create a LinearRegression object
# Fit the regressor object to the training data
# Evaluate the model on the test data

# ...

___

## KNeighborsRegression

This is a simple regression method that: given a new, unknown observation, look up in your reference database which ones have the closest features and take the output of the nearest point, or we could average several nearest points. This method is less popular for regression than for classification, but still a good baseline.

In [None]:
# First recreate our synthetic data
x = np.linspace(-3, 3, 100)
rng = np.random.RandomState(42)
y = np.sin(4 * x) + x + rng.uniform(size=len(x))
X = x[:, np.newaxis]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
kneighbor_regression = KNeighborsRegressor(n_neighbors=1)
kneighbor_regression.fit(X_train, y_train)

Again, let us look at the behavior on training and test set:

In [None]:
y_pred_train = kneighbor_regression.predict(X_train)

plt.plot(X_train, y_train, 'o', label="data", markersize=10)
plt.plot(X_train, y_pred_train, 's', label="prediction", markersize=4)
plt.legend(loc='best');

On the training set, we do a perfect job: each point is its own nearest neighbor!

In [None]:
y_pred_test = kneighbor_regression.predict(X_test)

plt.plot(X_test, y_test, 'o', label="data", markersize=8)
plt.plot(X_test, y_pred_test, 's', label="prediction", markersize=4)
plt.legend(loc='best');

On the test set, we also do a better job of capturing the variation, but our estimates look much messier than before.
Let us look at the R<sup>2</sup> score:

In [None]:
kneighbor_regression.score(X_test, y_test)

Much better than before! Here, the linear model was not a good fit for our problem; it was lacking in complexity and thus under-fit our data.

As with linear regression, use the diabetes dataset and create a KNN-regressor.

In [None]:
from sklearn.datasets import load_diabetes

diabetes_data = load_diabetes()
X, y = diabetes_data.data, diabetes_data.target
print("The data is described at https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html. The short description is:")
print('"Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline."')

# We need to:
# Split the data into training and test sets
# Create a KNeighborsRegressor object (how many neighbours do you want to use?)
# Fit the regressor object to the training data
# Evaluate the model on the test data

# ...

___
## Exercise

Create a KNN-regressor for the diabetes data.
___

___
## Exercise
Compare the KNeighborsRegressor and LinearRegression on the boston housing dataset. You can load the dataset using ``sklearn.datasets.load_boston``. You can learn about the dataset by reading the ``DESCR`` attribute.

In [None]:
# %load solutions/knn_vs_linreg.py

On Google Colab, visit [knn_vs_linreg.py](https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/solutions/knn_vs_linreg.py) and manually copy the content of the solution and paste to the cell above.

___