# Machine Learning - A First Example

After we've familiarized ourselves with the theory of machine learning, let's go into a practical example.

The data available for download [here][data] contains measurements of the electrical energy output (EP) of a [combined cycle power plant][ccpp], together with a number of variables, containing

- Ambient pressure (AP)
- Relative humidity (RH)
- Exhaust vacuum (V)
- Temperature (T)

We are interested in predicting the power output (PE).

[data]: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
[ccpp]: https://en.wikipedia.org/wiki/Combined_cycle

In [None]:
print open('data/CCPP/Readme.txt').read()

# Pandas

The [pandas][pd] package is a powerful tool for loading small-ish datasets into Python, applying transformation, calculating aggregates and making basic visualizations. We'll learn more about `pandas` later in the course.

[pd]: http://pandas.pydata.org/

In [None]:
import pandas as pd

In [None]:
# make sure to chekc the other pandas.read_XXX functions
data = pd.read_excel('data/CCPP/Folds5x2_pp.xlsx')

In [None]:
# the basic pandas object is the DataFrame
type(data)

In [None]:
# get some basic information about the data columns,
# very similar to R's summary function.
data.describe()

## Covariance

As we've learned in the theoretical part of the session, your target variable $Y$ *must* depend in some way on some of the inputs $X_i$ if are to have any hope of making a sensible prediction. One measure of this is the *covariance* of the two variables, denoted

$$\operatorname{Cov}(X_i, Y) =  \operatorname{E}\left[\left(X_i - \operatorname{E}[X_i]\right)\left(Y - \operatorname{E}[Y]\right)\right]$$

In [None]:
# calculates the covariance between all pairs of variables
# we want to _predict_ the power output PE, which means we
# are interested in variables highly correlated or anticorrelated
# with PE
data.cov()

We see a good correlation between `PE` and `V`, so let's choose `V` for our $X$.

In [None]:
%matplotlib inline
# enable inline plotting

## Plotting

`pandas` supports a number of ways to make plots of your data. We'll discuss them in detail in a later session.

In [None]:
data.plot.scatter(x = 'V', y = 'PE')

## Accessing data

Pandas supports two ways to access a given column `C` of a data frame `data`. The two syntaxes `data.C` and `data['C']` are mostly equivalent.

In [None]:
# We've seen in the plot that we have multiple values for PE 
# for a given V. To make our lives easier, we take the mean
# values for all columns for any given V. We will discuss 
# details of grouping and summarizing in a later session.
import numpy as np
# data.V == data['V'] # also possible
data.V = np.round(data['V'], 1)
data = data.groupby('V', as_index=False)\
    .mean()

In [None]:
data.plot.scatter(x = 'V', y = 'PE')
# looking much better already

# Machine Learning

Let's get to work and take a closer look at KNN.

## Splitting in training and testing data.

We want to fit a K-Nearest Neighbor model to our data, as discussed in the theory part. The objective will be to do as good as possible with predicting PE for unseen values of `V`. So how can we test our model? We choose to train it only on, say, 70% of the data available and use the 30% we left out as a stand-in for unseen data to test model performance. The function `train_test_split` in `sklearn.model_selection` does the splitting for us.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
?train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['V']],
                                                    data.PE,
                                                    test_size=0.3)

## Training our model

It's now time to train our K-Nearest Neighbor model.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
# choose k=5 for starters
five_nearest = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)

In [None]:
# we can now use the model to predict PE for a few values of V
five_nearest.predict(50)

In [None]:
five_nearest.predict(70)

### How good did we do?

Let's consider the residual sum of squares,

$$\operatorname{RSS}(k) = \sum_i \left(\hat y_i^{(k)} - y_i\right)^2$$

where the sum runs over the indecs of testing data, and $\hat y_i^{(k)}$ is the k-Nearest Neighbor prediction belonging to our $i$-th testing data point, while $y_k$ is the known value.

In [None]:
sum((five_nearest.predict(X_test) - y_test)**2)

In [None]:
# let's repeat for k = 10
ten_nearest = KNeighborsRegressor(n_neighbors = 10).fit(X_train, y_train)

In [None]:
sum((ten_nearest.predict(X_test) - y_test)**2)
# we did a little better ...

In [None]:
# ... but tlet's be systematic here. We start 
# with defining a RSS function.
def RSS(f, X, y):
    return sum((f.predict(X) - y)**2)

In [None]:
RSS(ten_nearest, X_test, y_test)

Let's do a scan over values for $k$ to better map out the dependence of RSS on $k$. What are sensible values? We start with 1 and stop when we have reached a $k$ of about half the size of the training sample.

In [None]:
len(y_train)

In [None]:
ks = np.arange(1, 100)

In [None]:
models = [KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)
          for k in ks]

In [None]:
RSS_test = [RSS(f, X_test, y_test) for f in models]

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(ks, RSS_test)
plt.xlabel('k')
plt.ylabel("RSS")

## Bias-Variance tradeoff

For very small $k$, the models has high *variance*, i.e. the error is dominated by the noise in the data. If we choose $k$ too big, the model becomes *biased*, meaning that we don't reproduce the function's shape (i.e. that of `PE` in dependence of `V` very well). Somewhere in between lies the "sweet spot". A lot of work in machine learning projects goes into identifying that sweet spot.

## Training data

Let's now have a look at the RSS on our **training** data. What do we expect? We won't be able to do much on the high-$k$ side since our function just can't fit the underlying distribution well. But on the low-$k$ side it should have a decisive advantage: It *knows* the noise and should produce for $k$=1 a perfect fit. 

In [None]:
RSS_train = [RSS(f, X_train, y_train) for f in models]

In [None]:
plt.plot(ks, RSS_train)
plt.xlabel('k')
plt.ylabel("RSS_train")

In [None]:
# compare the two ...
plt.plot(ks, RSS_test / RSS_test[-1], label = 'test')
plt.plot(ks, RSS_train / RSS_train[-1], label = 'train')
plt.xlabel('k')
plt.ylabel("RSS, normalized")
plt.legend()

## Degrees of freedom

Another way of looking at this is analyzing the dependence of RSS on the degrees of freedom. Too few degrees of freedom don't allow us to reproduce the dependence of `PE` on `V` accurately. Too many degrees of freedm and we're "chasing noise".

In [None]:
plt.plot(float(len(y_train))/ ks, RSS_test)
plt.xscale('log')
plt.xlabel('Effective DOF')
plt.ylabel('RSS')

## Functional form

Let's finally look at the functonal form of our k-Nearest Neighbor models. For a low $k$, we expect a high-noise, erratic behavior, while for high $k$, we expect a smoother and smoother function.

In [None]:
plot_ks = (1, 5, 30)
plot_models = [models[list(ks).index(k)] for k in plot_ks]

In [None]:
for k, f in zip(plot_ks, plot_models):
    xs = [[i] for i in np.arange(65, 70, 0.1)]
    plt.plot(xs, f.predict(xs), label="k = {}".format(k))
plt.plot(X_train, y_train, 'o', label='train')
plt.plot(X_test, y_test, 's', label='test')
plt.xlim(65, 70)
plt.ylim(430, 455)
plt.legend()