## Supervised learning

<img src="figures/supervised_workflow.svg" width=100%>

# Data Representations

<img src="figures/data_representation.svg" width=100%>

# Dataset Split

<img src="figures/train_test_split_matrix.svg" width=100%>

In [None]:
from preamble import *
% matplotlib notebook

In [None]:
# read data.
# you can find a description in bank/bank-campaign-desc.txt
data = pd.read_csv("data/bank-campaign.csv")

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.head()

In [None]:
y = data.target.values

In [None]:
X = data.drop("target", axis=1).values

In [None]:
X.shape

In [None]:
y.shape

**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**

Splitting the data:

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# import model
from sklearn.linear_model import LogisticRegression
# instantiate model, set parameters
lr = LogisticRegression()
# fit model
lr.fit(X_train, y_train)

Make predictions:

In [None]:
lr.predict(X_train)[:10]

In [None]:
lr.score(X_train, y_train)

In [None]:
lr.score(X_test, y_test)

<table style="border:None">
<tr style="border:None; font-size:20px; padding:10px;"><th colspan=2>``model.fit(X_train, [y_train])``</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><th>``model.predict(X_test)``</th><th>``model.transform(X_test)``</th></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Classification</td><td>Preprocessing</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Regression</td><td>Dimensionality Reduction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Clustering</td><td>Feature Extraction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>&nbsp;</td><td>Feature selection</td></tr>
</table>

### Additional methods
__Model evaluation__ : ``score(X, [y])``

__Uncertainties from Classifiers__: ``decision_function(X)`` and ``predict_proba(X)``

# Method chaining
Shorter, maybe less readible.

In [None]:
# this is short, but we never stored the model
LogisticRegression().fit(X_train, y_train).score(X_test, y_test)

# Exercise
Load the dataset ``data/bike_day_raw.csv``, which has the regression target ``cnt``.
This dataset is hourly bike rentals in the citybike platform. The ``cnt`` column is the number of rentals, which we want to predict from date and weather data.

Split the data into a training and a test set using ``train_test_split``.
Use the ``LinearRegression`` class to learn a regression model on this data. You can evaluate with the ``score`` method, which provides the $R^2$ or using the ``mean_squared_error`` function from ``sklearn.metrics`` (or write it yourself in numpy).

In [None]:
pd.read_csv("data/bike_day_raw.csv")