# Model Assessment

### Objective

This notebook introduces the concept of model assessment of supervised machine learning models.

### Learning objective

After finished this notebook, you should be able to design experiments to select and evaluate supervised machine learning models. The main concepts include:

- training and testing sets
- cross-validation
- bootstrap
- assess the performance of classifiers


### Requirements

1. This notebook was created using
  * python 3.7.1
  * numpy 1.15.4
  * matplotlib 3.0.2
  * scikit-learn 0.20.1
  * seaborn 0.9.0
  * scipy 1.1.0

You can check your Python version by running
```python
import sys
print(sys.version)
```

and the version of any module by running
```python
import <module name>
print(<module name>.__version__)
```

## Loading the libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline

We use `matplotlib` for data visualization together with its built-in interface called `pyplot`. Please check https://matplotlib.org/contents.html for more information.


## Wisconsin Prognostic Breast Cancer dataset

The data are available at https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data

### Data description

* The dataset contains 569 samples of **malign** and **benign** tumor cells
* It has 35 columns, where the first one stores the **unique ID** number of the sample, the second one the diagnosis (M=malignant, B=benign), 3--32 ten real-valued features are computed for each cell nucleus

    * radius (mean of distances from center to points on the perimeter)
    * texture (standard deviation of gray-scale values)
    * perimeter
    * area
    * smoothness (local variation in radius lengths)
    * compactness ($\frac{perimeter^2}{area} - 1.0$)
    * concavity (severity of concave portions of the contour)
    * concave points (number of concave portions of the contour)
    * symmetry
    * fractal dimension (coastline approximation - 1); 5 here 3--32 are divided into three parts first is Mean (3-13), Stranded Error(13-23) and Worst(23-32) and each contain 10 parameter (radius, texture,area, perimeter, smoothness,compactness,concavity,concave points,symmetry and fractal dimension).
    
The goal is to classify whether the breast cancer is **benign** or **malignant**. 

### Loading the data





In [None]:
from sklearn.datasets import load_breast_cancer
cancer_ds = # TODO 

### Checking the shape of the data

In [None]:
cancer_ds.data.shape

Checking the name of the features

In [None]:
cancer_ds.feature_names

Listing the classes of the tumor

In [None]:
cancer_ds.target_names

### Checking the number of samples for each class

In [None]:
{n: v for n, v in zip(cancer_ds.target_names, np.bincount(cancer_ds.target))}

In [None]:
print(cancer_ds.DESCR)

According to the description, **0 represents malign turmor**, and **1 benign** tumor, and there are 212 benign examples. Thus, we can check to ensure that our mapping is correct.

In [None]:
# Class Distribution: 212 - Malignant, 357 - Benign
# TODO (confirm that 0 represents malign tumor)

Converting the scikit-learn dataset type to a pandas DataFrame

In [None]:
# TODO

In [None]:
cancer_df['diagnosis'] = cancer_ds.target
# cancer_df['diagnosis'] = cancer_df['diagnosis'].map({0:'M', 1:'B'})

In [None]:
cancer_df['diagnosis'].unique()

In [None]:
cancer_df.columns

We can use histograms to check how each feature can help on classifying malign and benign tumors. In this case, 
if the two histograms are separated based on the feature, then we can say that the feature is important to discern each class.

In [None]:
fig, axes = plt.subplots(10,3, figsize=(12, 9))

# filter the data based on the type of diagnosis (0 = malignant, 1 = benign)
malignant = cancer_ds.data[cancer_ds.target==0]
benign = cancer_ds.data[cancer_ds.target==1]

ax = axes.ravel()

for i in range(30):
    _,bins = np.histogram(cancer_ds.data[:,i], bins=40)
    ax[i].hist(malignant[:,i], bins=bins, color='r', alpha=.5)
    ax[i].hist(benign[:,i], bins=bins, color='g', alpha=0.3)
    ax[i].set_title(cancer_ds.feature_names[i],fontsize=9)
    ax[i].axes.get_xaxis().set_visible(False)
    ax[i].set_yticks(())

ax[0].legend(['malignant','benign'], loc='best',fontsize=8)
plt.tight_layout()
plt.show()

It seems that **mean fractal dimension** (4 row, 1 column) isn't verfy useful for discerning _malign_ from _benign_ tumor. On the other hand, **worst radius**, **worst perimeter**, and **worst concave points** are important features that can give us strong hints about the type of the tumor.

In [None]:
# plt.subplot(1,3,1)
plt.subplots(1,3, figsize=(12, 5))

plt.subplot(1,3,1)
plt.scatter(cancer_df['worst symmetry'], cancer_df['worst texture'], s=cancer_df['worst area']*0.05, color='magenta', label='check', alpha=0.3)
plt.xlabel('Worst Symmetry',fontsize=12)
plt.ylabel('Worst Texture',fontsize=12)

plt.subplot(1,3,2)
plt.scatter(cancer_df['mean radius'], cancer_df['mean concave points'], s=cancer_df['mean area']*0.05, color='purple', label='check', alpha=0.3)
plt.xlabel('Mean Radius',fontsize=12)
plt.ylabel('Mean Concave Points',fontsize=12)

plt.subplot(1,3,3)
plt.scatter(cancer_df['mean radius'], cancer_df['worst perimeter'], s=cancer_df['mean area']*0.05, color='purple', label='check', alpha=0.3)
plt.xlabel('Mean Radius',fontsize=12)
plt.ylabel('Worst perimeter',fontsize=12)

plt.tight_layout()
plt.show()

## Dimension reduction through Principal Component Analysis (PCA)

In [None]:
from sklearn import decomposition, preprocessing

std_scale = preprocessing.StandardScaler().fit(cancer_ds.data)
X_scaled = std_scale.transform(cancer_ds.data)

In [None]:
#TODO reduce to two principal components


In [None]:
# project the two principal components

Computing the explained variance

In [None]:
explained_variance = np.var(X_projected, axis=0)
explained_variance_ratio = explained_variance / np.sum(explained_variance)
print (explained_variance_ratio)

### Creating a pandas DataFrame from reduced data

In [None]:
## TODO

In [None]:
fig, ax= plt.subplots(figsize=(7,5))
fig.patch.set_facecolor('white')

markers = {0:'*',1:'o'}
alphas  = {0:.3, 1:.5}
labels ={0:'Malignant',1:'Benign'}

targets = cancer_pca_df['diagnosis'].unique()
colors = ['r', 'g']

for target, color in zip(targets, colors):
    ix = cancer_df['diagnosis'] == target
    ax.scatter(cancer_pca_df.loc[ix, 'pc1'], cancer_pca_df.loc[ix, 'pc2'], c=color, s=50, 
              label=labels[target], marker=markers[target], alpha=alphas[target])    
    
ax.legend(labels)
ax.set_xlabel('First Principal Component', fontsize = 14)
ax.set_ylabel('Second Principal Component', fontsize = 14)
plt.show()

In [None]:
plt.matshow(pca.components_, cmap='viridis')
plt.yticks([0,1],['1st Comp','2nd Comp'], fontsize=10)
plt.colorbar()
plt.xticks(range(len(cancer_ds.feature_names)),cancer_ds.feature_names,rotation=65,ha='left')
plt.tight_layout()
plt.show()

In [None]:
feature_worst = list(cancer_df.columns[20:31])
s = sns.heatmap(cancer_df[feature_worst].corr(),cmap='coolwarm') 
s.set_yticklabels(s.get_yticklabels(),rotation=30,fontsize=7)
s.set_xticklabels(s.get_xticklabels(),rotation=30,fontsize=7)
plt.show()

## Classification model: Logistic Regression 

Logistic regression is a classification model that is simple to implement but that performs very well on linearly separable classes. It is one of the most used algorithm for classification in the industry. 

We can understand the logistic regression as a probabailistic model, **odds ratio**, which is the odds in favor of a particular event. The odds ratio can be written as $\frac{p}{1-p}$, where $p$ stands for the probability of the positive event. Positive in this case does not necessarily mean _good_, but it refers to the event that we want to predict, for example, the probability that a patient has a certain disease, or that a email is a spam or a ham; we can see the positive event as the class label $y = 1$. We can then, define the **logit** function as: $$logit(p) = log\frac{1}{1-p}$$. 

The logit function can takes input values in the range 0 to 1 and transforms them to values over the entire real $\mathbb{R}$ set range, which we can use to express as a linear relationship between features values and the log-odds:

$$logit(p(y=1|x)) = w_ox_o + w_1x_1 + \ldots{} + w_mx_m = \sum\limits_{i=1}^{m}w_ix_i = w^{T}x$$

where, $p(y=1|x)$ is the conditional probability that a particular sample belongs to the class $1$ given its features $x$. 

Thus, we can apply the _sigmoid_ function to predict that a certain sample belongs to a specific class

$$\phi(z) = \frac{1}{1+e^{-z}}$$

where, $z$ is the linear combination of the weights and sample features and it can be calculated as $z = w^Tx = w_o + w_1x_1+\ldots+w_mx_m$

Under the assumption of linear boundaries segragating classes, the posterior probability of class $C_1$ can be written as a logistic sigmoid acting on a linear function of the feature vector X so that $\mathbb{P}(C_1|X) = \sigma(\beta^TX)$.

We can check plotting the sigmod function for some values in the range between -5 and 5.

In [None]:
def sigmod(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.arange(-5, 5, 0.1)
phi_z = sigmod(z)

plt.plot(z, phi_z)
plt.axvline(0.0, color='k')
plt.axhspan(0.0, 1.0, facecolor='1.0', alpha=1.0, ls='dotted')
plt.axhline(y = 0.5, ls='dotted', alpha=1.0, color='k')
plt.yticks([0.0, 0.5, 1.0])
plt.ylim(-0.1, 1.1)
plt.xlabel('z')
plt.ylabel('$\phi(z)$')
plt.show()

When $\phi(z)$ approaches 1, $z$ goes towards infinity $(z \mapsto \infty)$, since $e^{-z}$ becomes very small for large values of $z$. Similarly, when $\phi(z)$ goes towards $0$ for $z\mapsto -\infty$ as a result of an increasing large denominator. As a result, we can conclude that the _sigmoid_ function takes real number values as input and transforms them to values in the range [0, 1] with a intercept at $\phi(z) = 0.5$

Clearly, the output of the _sigmoid_ function is then intercept as the probability of a particular sample belongs to class 1 $\phi(z) = \mathbb{P}(y=1|x; w)$, given its features $x$ parametrized by the weights $x$. The probability can then be simply converted into a binary outcome via an unit step function

$$\hat{y} =
  \begin{cases}
    1 & \text{if $\phi(z) \geq 0.5$} \\    
    0 & \text{otherwise}
  \end{cases}$$

### Using scikit-learn implementation of LogicRegression

scikit-learn implements Logistic Regression through the class [linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
# TODO import the libraries

### Splitting the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
# from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cancer_ds.data, cancer_ds.target, random_state=10)

Creating the logistic model

In [None]:
lr = LogisticRegression(C=1e7, solver='liblinear')
lrm = lr.fit(X_train, y_train)

### Evaluating the model

In [None]:
print("Training set score: {:.3f}".format(lrm.score(X_train, y_train)))
print("Test set score: {:.3f}".format(lrm.score(X_test, y_test)))

### Checking the predition of the model using the reduced data

In [None]:
X = cancer_pca_df.drop(['diagnosis'], axis = 1).values
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X, cancer_pca_df.diagnosis, random_state=1)

lrm_pca = LogisticRegression(C=1e10, solver='liblinear').fit(X_train_pca, y_train_pca)

In [None]:
print("Training set score: {:.3f}".format(lrm_pca.score(X_train_pca, y_train_pca)))
print("Test set score: {:.3f}".format(lrm_pca.score(X_test_pca, y_test_pca)))

## Cross-validation

In [None]:
# from sklearn import cross_validation
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
kfold = StratifiedKFold(n_splits=10, random_state=1)
kf = KFold(n_splits=10)

scores = []
i = 0
for (train, test) in kfold.split(X_train, y_train):
    lrm.fit(X_train[train], y_train[train])
    score = lrm.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold %s, Class dist.: %s, Acc. %.3f' %(i + 1, np.bincount(y_train[train]), score))
    i  +=1

print('CV accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))

In [None]:
# from sklearn import cross_validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

scores = []
i = 0
for (train, test) in kf.split(X_train):
    lrm.fit(X_train[train], y_train[train])
    score = lrm.score(X_train[test], y_train[test])
    scores.append(score)
    print('Fold %s, Class dist.: %s, Acc. %.3f' %(i + 1, np.bincount(y_train[train]), score))
    i+=1

print('CV accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))

scikit-learn also implements a k-fold cross-validation scorer through `cross_val_score`, which enables us to evaluate a model using stratified k-fold cross-validation more efficiently. 

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(estimator = lrm, 
                         X=X_train_pca, 
                         y=y_train_pca, 
                         cv=10, 
                        n_jobs=1)
print('Cross-validation accuracy scores: %s' %(scores))
print('CV accuracy: %.3f +/- %.3f' %(np.mean(scores), np.std(scores)))

### Looking at the different performance evaluation metrics

In [None]:
from sklearn.metrics import confusion_matrix

y_pred = lrm.predict(X_test)
conf_matrix = #TODO

print('Confusion matrix \n{}'.format(conf_matrix))

Formating the confusion matrix

In [None]:
fig, ax = plt.subplots(figsize=(2.5, 2.5))
ax.matshow(conf_matrix, cmap=plt.cm.Blues, alpha=0.3)
for i in range(conf_matrix.shape[0]):
    for j in range(conf_matrix.shape[1]):
        ax.text(x=j, y=i, s=conf_matrix[i,j], va='center', ha='center')

plt.xlabel('predicted label')
plt.ylabel('true label')
plt.show()

### Computing precision, recall, and F1-score

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print('Precision: %.3f' % precision_score(y_true=y_test, y_pred=y_pred))
print('Recall: %.3f' % recall_score(y_true=y_test, y_pred=y_pred))
print('F1-score: %.3f' % f1_score(y_true=y_test, y_pred=y_pred))

### Plotting the ROC curve

In [None]:
from sklearn.metrics import roc_curve, auc
from scipy import interp

X_train2 = X_train[:, [4,14]]

fig = plt.figure(figsize=(7,5))
mean_tpr = 0.0
mean_fpr = np.linspace(0,1,100)
all_tpr = []

i = 0
for (train, test) in kfold.split(X_train, y_train):
    probas = lrm.fit(X_train2[train], y_train[train]).predict_proba(X_train2[test])
    fpr, tpr, thresholds = roc_curve(y_train[test], probas[:, 1], pos_label=1)
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)    
    plt.plot(fpr, tpr, lw=1, label='ROC Fold %d(area = %0.2f)' % (i + 1, roc_auc))
    i +=1    
    
plt.plot([0,1], [0,1], linestyle='--', color=(0.6, 0.6, 0.6), label='random guessing')
mean_tpr /= kfold.n_splits
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, 'k--', label='mean ROC (area = %0.2f)' % mean_auc, lw=2)
plt.plot([0,0,1], 
         [0,1,1], 
         lw=2, 
        linestyle=':',
        color='black',
        label='perfect performance')

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
plt.title('Receiver Operator Characteristic')
plt.legend(loc="lower right")
plt.show()


## Combining transformers and estimators in a pipeline

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipeline_lr = Pipeline([('scl', StandardScaler()), 
                        ('pca', PCA(n_components=2)),
                        ('clf', LogisticRegression(C=1e7, solver='liblinear', random_state=1))])
pipeline_lr.fit(X_train, y_train)
print('Test accuracy: %.3f' %pipeline_lr.score(X_test, y_test))