<!--NAVIGATION-->

<a href="https://colab.research.google.com/github/bpesquet/machine-learning-katas/blob/master/classic-datasets/Breast_Cancer.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open in Google Colaboratory"></a>


# Kata: Breast Cancer Dataset

| Learning type | Activity type | Objective |
| - | - | - |
| Supervised | Binary classification | Predict if a tumor is benign or malignant |


## Instructions

This is a self-correcting exercise generated by [nbgrader](https://github.com/jupyter/nbgrader). 

Complete the cells beginning with `# YOUR CODE HERE` and run the subsequent cells to check your code.

## About the dataset

The [Breast Cancer][1] dataset is used for multivariate binary classification. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.

![](images/breast-cancer-logo.jpg)

[1]: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

## Package setup

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import keras.optimizers
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical
from nose.tools import assert_equal, assert_true

# Display plots inline, change default figure size and change plot resolution to retina
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Set Seaborn aesthetic parameters to defaults
sns.set()

## Utility functions

In [None]:
def plot_loss_acc(history):
    """Plot training and (optionally) validation loss and accuracy"""

    loss = history.history['loss']
    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(10, 10))

    plt.subplot(2, 1, 1)
    plt.plot(epochs, loss, '.--', label='Training loss')
    final_loss = loss[-1]
    title = 'Training loss: {:.4f}'.format(final_loss)
    plt.ylabel('Loss')
    if 'val_loss' in history.history:
        val_loss = history.history['val_loss']
        plt.plot(epochs, val_loss, 'o-', label='Validation loss')
        final_val_loss = val_loss[-1]
        title += ', Validation loss: {:.4f}'.format(final_val_loss)
    plt.title(title)
    plt.legend()

    acc = history.history['acc']

    plt.subplot(2, 1, 2)
    plt.plot(epochs, acc, '.--', label='Training acc')
    final_acc = acc[-1]
    title = 'Training accuracy: {:.2f}%'.format(final_acc * 100)
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    if 'val_acc' in history.history:
        val_acc = history.history['val_acc']
        plt.plot(epochs, val_acc, 'o-', label='Validation acc')
        final_val_acc = val_acc[-1]
        title += ', Validation accuracy: {:.2f}%'.format(final_val_acc * 100)
    plt.title(title)
    plt.legend()

## Step 1: Loading the data

### Question

* Load the Breast Cancer dataset included with scikit-learn. Store it in a variable named `dataset`. 
* Display 10 random samples with feature names, target and class.

In [None]:
# YOUR CODE HERE

## Step 2: Using a simple model

In [None]:
# Use scikit-learn's builtin logistic regression classifier to obtain a pretty good accuracy
model = LogisticRegression()
model.fit(dataset.data, dataset.target)
accuracy = model.score(dataset.data, dataset.target)
print('Accuracy: {:.2f}%'.format(accuracy * 100))

## Step 3: Training a neural network model

### Question

Train a model on the data to obtain a training accuracy > 85%. Store the training history in a variable named `history`.

In [None]:
# YOUR CODE HERE

In [None]:
plot_loss_acc(history)

In [None]:
# Retrieve final accuracy
final_acc = history.history['acc'][-1]
# Assert final accuracy
assert_true(final_acc > 0.85)

## Step 4: Beating the simple model

### Question

Optimize training to beat the simple model by achieving a training accuracy > 96%.

In [None]:
# YOUR CODE HERE

In [None]:
# Plot training history
plot_loss_acc(history)

In [None]:
# Retrieve final accuracy
final_acc = history.history['acc'][-1]
# Assert final accuracy
assert_true(final_acc > 0.96)