# Supervised Learning Algorithms


## Model Complexity

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import sklearn
sklearn.set_config(print_changed_only=True)

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
blood = fetch_openml('blood-transfusion-service-center')

X_train, X_test, y_train, y_test = train_test_split(
    blood.data, blood.target, random_state=0)

In [3]:
X_train.shape

(561, 4)

In [4]:
import pandas as pd
pd.Series(y_train).value_counts()

1    438
2    123
dtype: int64

In [5]:
pd.Series(y_train).value_counts(normalize=True)

1    0.780749
2    0.219251
dtype: float64

Really Simple API
-------------------
0) Import your model class

In [6]:
from sklearn.svm import LinearSVC

1) Instantiate an object and set the parameters

In [7]:
svm = LinearSVC()

2) Fit the model

In [8]:
svm.fit(X_train, y_train)



LinearSVC()

3) Apply / evaluate

In [9]:
print(svm.predict(X_train))
print(y_train)

['1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1'
 '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1' '1

In [10]:
svm.score(X_train, y_train)

0.7807486631016043

In [11]:
svm.score(X_test, y_test)

0.7058823529411765

And again
---------

In [12]:
from sklearn.ensemble import RandomForestClassifier

In [13]:
rf = RandomForestClassifier()

In [14]:
rf.fit(X_train, y_train)

RandomForestClassifier()

In [15]:
rf.score(X_train, y_train)

0.9411764705882353

In [16]:
rf.score(X_test, y_test)

0.7112299465240641

## Materials: https://github.com/amueller/ml-workshop-1-of-4

## Exercises

## Exercise 1
Load the iris dataset from the ``sklearn.datasets`` module using the ``load_iris`` function.

Split it into training and test set using ``train_test_split``.

## Exercise 2
Then train an evaluate ``sklearn.neighbors.KNeighborsClassifier``, the RandomForestClassifier and  ``sklearn.linear_model.LogisticRegression`` on the iris dataset.
How do these perform on the training set vs the test set? Which one is the best on the training set, which one is the best on the test set?

## Exercise 3 (extra)
Can you construct a binary classification dataset (using np.random for example) on which ``sklearn.linear_model.LogisticRegression`` achieves an accuracy of 1? Can you construct a binary classification dataset on which it achieves accuracy 0.5?

In [17]:
# %load solutions/train_iris.py