## Supervised learning

<img src="figures/supervised_workflow.svg" width=100%>

# Data Representations

<img src="figures/data_representation.svg" width=100%>

# Dataset Split

<img src="figures/train_test_split_matrix.svg" width=100%>

In [1]:
from preamble import *
% matplotlib notebook

In [2]:
# read data.
# you can find a description in bank/bank-campaign-desc.txt
data = pd.read_csv("data/bank-campaign.csv")

In [3]:
data.shape

(41188, 64)

In [4]:
data.columns

Index(['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oc

In [5]:
data.head()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success,target
0,56,261,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,no
1,57,149,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,no
2,37,226,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,no
3,40,151,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,no
4,56,307,1,999,0,1.1,93.994,-36.4,4.857,5191.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,no


In [6]:
y = data.target.values

In [7]:
X = data.drop("target", axis=1).values

In [8]:
X.shape

(41188, 63)

In [9]:
y.shape

(41188,)

In [11]:
data.target.value_counts()

no     36548
yes     4640
Name: target, dtype: int64

**Data is always a numpy array (or sparse matrix) of shape (n_samples, n_features)**

Splitting the data:

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [14]:
# import model
from sklearn.linear_model import LogisticRegression
# instantiate model, set parameters
lr = LogisticRegression()
# fit model
lr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Make predictions:

In [15]:
lr.predict(X_train)[:10]

array(['no', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'no', 'no'], dtype=object)

In [16]:
lr.score(X_train, y_train)

0.90890550645819168

In [17]:
lr.score(X_test, y_test)

0.91201320773040695

<table style="border:None">
<tr style="border:None; font-size:20px; padding:10px;"><th colspan=2>``model.fit(X_train, [y_train])``</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><th>``model.predict(X_test)``</th><th>``model.transform(X_test)``</th></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Classification</td><td>Preprocessing</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Regression</td><td>Dimensionality Reduction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>Clustering</td><td>Feature Extraction</td></tr>
<tr style="border:None; font-size:20px; padding:10px;"><td>&nbsp;</td><td>Feature selection</td></tr>
</table>

### Additional methods
__Model evaluation__ : ``score(X, [y])``

__Uncertainties from Classifiers__: ``decision_function(X)`` and ``predict_proba(X)``

# Method chaining
Shorter, maybe less readible.

In [18]:
# this is short, but we never stored the model
LogisticRegression().fit(X_train, y_train).score(X_test, y_test)

0.91201320773040695

In [None]:
fit_predict()
fit_transform()

# Exercise
Load the dataset ``data/bike_day_raw.csv``, which has the regression target ``cnt``.
This dataset is hourly bike rentals in the citybike platform. The ``cnt`` column is the number of rentals, which we want to predict from date and weather data.

Split the data into a training and a test set using ``train_test_split``.
Use the ``LinearRegression`` class to learn a regression model on this data. You can evaluate with the ``score`` method, which provides the $R^2$ or using the ``mean_squared_error`` function from ``sklearn.metrics`` (or write it yourself in numpy).

In [20]:
data = pd.read_csv("data/bike_day_raw.csv")
X = data.drop("cnt", axis=1)
y = data.cnt

In [28]:
data.head()

Unnamed: 0,season,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,cnt
0,1,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,985
1,1,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,801
2,1,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,1349
3,1,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,1562
4,1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,1600


In [21]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [33]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [34]:
X_train_ohe = ohe.fit_transform(X_train)
X_test_ohe = ohe.transform(X_test)

In [35]:
from sklearn.linear_model import LinearRegression

In [36]:
X_train_ohe.shape

(548, 34)

In [38]:
X_train_ohe.std(axis=0)

array([ 0.429,  0.439,  0.434,  0.43 ,  0.277,  0.26 ,  0.291,  0.28 ,
        0.283,  0.28 ,  0.266,  0.277,  0.26 ,  0.277,  0.28 ,  0.283,
        0.168,  0.168,  0.344,  0.344,  0.358,  0.357,  0.349,  0.353,
        0.344,  0.46 ,  0.46 ,  0.48 ,  0.472,  0.152,  0.   ,  0.   ,
        0.   ,  0.   ])

In [39]:
# scale here

In [40]:
lr = LinearRegression().fit(X_train, y_train)

In [41]:
lr.score(X_train, y_train)

0.53289255294986992

In [42]:
lr.score(X_test, y_test)

0.49910337568762719

In [43]:
from sklearn.metrics import mean_squared_error
y_pred = lr.predict(X_test)
mean_squared_error(y_test, y_pred)

2220717.2087004711