# Summative assessment

This is your summative assignment for the Machine Learning introduction course. You should solve it on your own, this is an individual assessment, not a group assessment!

All plots you produce in this notebook need to have proper labels and legends where appropriate. 

For this assignment you can use facilities in the `numpy` and `sklearn` libraries, and plots should be produced using `matplotlib`. Functions and classes imported at the beginning of answer cells are suggestions, there is no requirement to use them. 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn

For this notebook we will use the breast cancer dataset from `sklearn`:

In [None]:
from sklearn.datasets import load_breast_cancer

dic = load_breast_cancer()
data = dic['data']
target = dic['target']

**TASK 1:**

Standardize the inputs and prepare a training and test sample with ratio 3:1 using `train_test_split`. Use ` random_state=42` to make your result easily comparable. We will train our models on the training set, and keep the test set to compare the models at the end of this notebook. [4 marks]


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# use random_state=42 in train_test_split!

# YOUR CODE HERE

We will consider two inputs:

**full feature set**: this is the full set of features you used above.

**reduced feature set**: this is considering only the first 3 features in the full dataset. 

Later in this notebook we will consider a "quick" test that only uses the first three features and a "thorough" test that uses all features.

**TASK 2:**

Use a logistic regression model to classify the data using the **reduced feature set**. Plot the scores as a function of the regularisation parameter and an estimate of the uncertainty on the score using 5-fold cross validation. You might want to use `fill_between` for the uncertainties. Your plot should have appropriate labels. [15 marks]


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# YOUR CODE HERE

**TASK 3:**

Use a logistic regression model to classify the data using the **full feature set**. Plot the scores as a function of the regularisation parameter and an estimate of the uncertainty on the score using 5-fold cross validation. You might want to use `fill_between` for the uncertainties. Your plot should have appropriate labels. [13 marks]


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# YOUR CODE HERE

**TASK 4:**

Using the test set, plot the ROC curve (true positive rate vs false positive rate) for the two models above to compare their performance. [12 marks]


In [None]:
from sklearn.metrics import roc_curve

# YOUR CODE HERE

**TASK 5:**

Thinking about a situation where we can use the two models as diagnostic in practice, we can imagine that the reduced feature set model would be a quick test, as it relies on fewer features, while the full feature set model would require more individual tests to be performed on a patient. For this task we consider the following diagnostic: first we perform a classification using the "quick" diagnostic using the reduced feature set model, if the decision function comes within one unit of the decision boundary (i.e. the model is not very confident of its answer) we perform the more thorough test using the full feature set, fixing the decision boundary for the full feature set model at 0. Produce a ROC curve for this diagnostic strategy and show it on a plot together with the ROC curve for the two models above. [20 marks]


In [None]:
# YOUR CODE HERE

**TASK 6:**

Consider the following models:

- a neural network with 4 hidden layers with each 4 nodes with sigmoid activation function,
- a $k$-neighbors model with $k=5$
- a Support Vector Machine.

Use each of these model to fit the data using the full feature set. Using 5-fold cross validation make an estimate of the expected score and its uncertainty. Show your results in an error bar plot. Use cross validation to select paramters if necessary. [36 marks]


In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score

# YOUR CODE HERE