# Machine Learning - A First Example

After we've familiarized ourselves with the theory of machine learning, let's go into a practical example.

The data available for download [here][data] contains measurements of the electrical energy output (EP) of a [combined cycle power plant][ccpp], together with a number of variables, containing

- Ambient pressure (AP)
- Relative humidity (RH)
- Exhaust vacuum (V)
- Temperature (T)



[data]: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
[ccpp]: https://en.wikipedia.org/wiki/Combined_cycle

In [None]:
print open('data/CCPP/Readme.txt').read()

In [None]:
import pandas as pd

In [None]:
data = pd.read_excel('data/CCPP/Folds5x2_pp.xlsx')

In [None]:
type(data)

In [None]:
data.describe()

In [None]:
data.cov()

In [None]:
%matplotlib inline

In [None]:
data.plot.scatter(x = 'V', y = 'PE')

In [None]:
import numpy as np
# data.V == data['V']
data.V = np.round(data['V'], 1)

In [None]:
data = data.groupby('V', as_index=False)\
    .mean()

In [None]:
data.plot.scatter(x = 'V', y = 'PE')

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
?train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data[['V']],
                                                    data.PE,
                                                    test_size=0.3)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
five_nearest = KNeighborsRegressor(n_neighbors = 5).fit(X_train, y_train)

In [None]:
five_nearest.predict(50)

In [None]:
five_nearest.predict(70)

In [None]:
sum((five_nearest.predict(X_test) - y_test)**2)

In [None]:
ten_nearest = KNeighborsRegressor(n_neighbors = 10).fit(X_train, y_train)

In [None]:
sum((ten_nearest.predict(X_test) - y_test)**2)

In [None]:
def RSS(f, X, y):
    return sum((f.predict(X) - y)**2)

In [None]:
RSS(ten_nearest, X_test, y_test)

In [None]:
len(y_train)

In [None]:
ks = np.arange(1, 100)

In [None]:
models = [KNeighborsRegressor(n_neighbors=k).fit(X_train, y_train)
          for k in ks]

In [None]:
RSS_test = [RSS(f, X_test, y_test) for f in models]

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(ks, RSS_test)
plt.xlabel('k')
plt.ylabel("RSS_")

In [None]:
RSS_train = [RSS(f, X_train, y_train) for f in models]

In [None]:
plt.plot(ks, RSS_train)
plt.xlabel('k')
plt.ylabel("RSS_train")

In [None]:
plt.plot(ks, RSS_test / RSS_test[-1])
plt.plot(ks, RSS_train / RSS_train[-1])

In [None]:
1. / ks # element-by-element

In [None]:
plt.plot(float(len(y_train))/ ks, RSS_test)
plt.xscale('log')

In [None]:
plot_ks = (1, 5, 30)
plot_models = [models[list(ks).index(k)] for k in plot_ks]

In [None]:
for k, f in zip(plot_ks, plot_models):
    xs = [[i] for i in np.arange(65, 70, 0.1)]
    plt.plot(xs, f.predict(xs), label="k = {}".format(k))
plt.plot(X_train, y_train, 'o', label='train')
plt.plot(X_test, y_test, 's', label='test')
plt.xlim(65, 70)
plt.ylim(430, 455)
plt.legend()