In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Supervised Learning Part 2a -- Classification

To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data on two-dimensional screens.

We will illustrate some very simple examples before we move on to more "real world" data sets.

First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function.

In [0]:
from sklearn.datasets import make_blobs

X, y = make_blobs(centers=2, random_state=0)

print('X ~ n_samples x n_features:', X.shape)
print('y ~ n_samples:', y.shape)

print('\nFirst 5 samples:\n', X[:5, :])
print('\nFirst 5 labels:', y[:5])

As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis.

In [0]:
plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend(loc='upper right');

Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:

1. a training set that the learning algorithm uses to fit the model
2. a test set to evaluate the generalization performance of the model

The ``train_test_split`` function from the ``model_selection`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.

<img width="50%" src='https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/figures/train_test_split_matrix.png?raw=1'/>


In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    stratify=y)

plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel('first feature')
plt.ylabel('second feature')
plt.legend(loc='upper right');

## The scikit-learn estimator API and Logistic Regresion

<img width="50%" src='https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/figures/supervised_workflow.png?raw=1'/>

Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class.

In [0]:
from sklearn.linear_model import LogisticRegression

Next, we instantiate the estimator object.

In [0]:
classifier = LogisticRegression()

In [0]:
X_train.shape

In [0]:
y_train.shape

To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):

In [0]:
classifier.fit(X_train, y_train)

(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)

We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:

In [0]:
prediction = classifier.predict(X_test)

We can compare these against the true labels:

In [0]:
print(prediction)
print(y_test)

We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:

In [0]:
np.mean(prediction == y_test)

There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:
    

In [0]:
classifier.score(X_test, y_test)

It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:

In [0]:
classifier.score(X_train, y_train)

LogisticRegression is a so-called linear model,
that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:

In [0]:
def plot_2d_separator(classifier, X, fill=False, ax=None, eps=None):
    if eps is None:
        eps = X.std() / 2.
    x_min, x_max = X[:, 0].min() - eps, X[:, 0].max() + eps
    y_min, y_max = X[:, 1].min() - eps, X[:, 1].max() + eps
    xx = np.linspace(x_min, x_max, 100)
    yy = np.linspace(y_min, y_max, 100)

    X1, X2 = np.meshgrid(xx, yy)
    X_grid = np.c_[X1.ravel(), X2.ravel()]
    try:
        decision_values = classifier.decision_function(X_grid)
        levels = [0]
        fill_levels = [decision_values.min(), 0, decision_values.max()]
    except AttributeError:
        # no decision_function
        decision_values = classifier.predict_proba(X_grid)[:, 1]
        levels = [.5]
        fill_levels = [0, .5, 1]

    if ax is None:
        ax = plt.gca()
    if fill:
        ax.contourf(X1, X2, decision_values.reshape(X1.shape),
                    levels=fill_levels, colors=['blue', 'red'])
    else:
        ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=levels,
                   colors="black")
    ax.set_xlim(x_min, x_max)
    ax.set_ylim(y_min, y_max)
    ax.set_xticks(())
    ax.set_yticks(())


In [0]:
plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel("first feature")
plt.ylabel("second feature")
plot_2d_separator(classifier, X)
plt.legend(loc='upper right');

**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:

In [0]:
print(classifier.coef_)
print(classifier.intercept_)

___
## Exercise
The example above was a very simple example with only two features. Logistic regression can work for data with a lot more dimensions as well. We will now create a classifier that can predict whether a breast cancer tumour is malignant or benign.

Use the breast cancer data and create a logistic regression classifier.

You can look at the simple example above for guidance.

NOTE: The plot_2d_separator will not work with this dataset!

In [0]:
from sklearn.datasets import load_breast_cancer
bc_data = load_breast_cancer()
print(bc_data.DESCR)

X, y = bc_data.data, bc_data.target

# ...

___

## Another classifier: K Nearest Neighbors
Another popular and easy to understand classifier is K nearest neighbors (kNN).  It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.

In [0]:
# First reload our synthetic data
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split

X, y = make_blobs(centers=2, random_state=0)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.25,
                                                    random_state=1234,
                                                    stratify=y)

The interface is exactly the same as for ``LogisticRegression above``.

In [0]:
from sklearn.neighbors import KNeighborsClassifier

This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:

In [0]:
knn = KNeighborsClassifier(n_neighbors=1)

We fit the model with out training data

In [0]:
knn.fit(X_train, y_train)

In [0]:
plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            c='blue', s=40, label='0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            c='red', s=40, label='1', marker='s')

plt.xlabel("first feature")
plt.ylabel("second feature")
plot_2d_separator(knn, X)
plt.legend(loc='upper right');

In [0]:
knn.score(X_test, y_test)

Now train a KNN-classifier for the breast cancer dataset.

In [0]:
from sklearn.datasets import load_breast_cancer
bc_data = load_breast_cancer()
print(bc_data.DESCR)

X, y = bc_data.data, bc_data.target

___
## Exercise
Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change.

In [0]:
# %load solutions/knn_with_diff_k.py

On Google Colab, visit [knn_with_diff_k.py](https://github.com/fordanic/cmiv-ai-course/blob/master/notebooks/solutions/knn_with_diff_k.py) and manually copy the content of the solution and paste to the cell above.

___