https://www.dropbox.com/s/py7m9jsfquhb6w4/Photo%2006.04.15%2000%2024%2020.png

# A classification problem
Breast Cancer Wisconsin Diagnostic Database. The dataset includes information about breast cancer tumors, as well as classification labels: malignant or benign (self-note: pronunciation [bɪˈnaɪn]). There are 569 instances, including information on 30 attributes (or characteristics, features...) such as tumor radius, texture, smoothness, and area.

We will build a machine learning model to use tumor information to predict whether it is malignant or benign.

The first step is to import **sklearn** and the dataset **breast_cancer**.



In [0]:
import sklearn
from sklearn.datasets import load_breast_cancer
print(load_breast_cancer)

Load the dataset into the *data* variable and split different types of information.

In [0]:
data = load_breast_cancer()

classes = data['target_names']
labels = data['target']

attributes = data['feature_names']
instances = data['data']

len(instances)
print(labels)

The class names are malignant and benign, which are then mapped to binary values (0 or 1). Our goal is to be able to diagnose patients according to tumor characteristics.

This type of classification is possible when a learning algorithm induces a model based on the relationships between labels and attributes.


In [0]:
print(classes) # only names

In [0]:
labels[10:30] # y: only values

In [0]:
sum(labels) / len(labels) # out of curiosity...

In [0]:
print(attributes) # only names

In [0]:
print(instances[2:4]) # X: only values

To evaluate the performance of a classifier, we must always test the model on unlabeled, new individuals, i.e. those whose classes are unknown to the induced model.

Data should be divided into two parts to simulate the existence of new individuals before constructing a model: training and testing sets.

We use the training set to train and evaluate/select a model during the development stage. *More on this later*.

Then we use the induced (trained) model to make predictions in the testing set. This approach gives us an estimation of model generalization performance and robustness.

The train_test_split function helps us with this task. For now, we will experiment using directly the testing data for simplicity, but at the end we will simulate a more realistic training/evaluation/selection/testing workflow.


In [0]:
from sklearn.model_selection import train_test_split
train, test, train_labels, test_labels = train_test_split(instances,
                                                          labels,
                                                          test_size=0.33,
                                                          random_state=42)
print('# training instances', len(train))
print('# testing instances', len(test))

We now have a testing set that represents 33% of the original dataset.
The remaining data (train) forms the training data.

There are many algorithms for machine learning, and each of them has strengths and weaknesses (*learning bias*, complexity, interpretability, ...).

A simple and fast classification algorithm is Naive Bayes (NB).

Let's import the GaussianNB module and induce the model with the fit () method.

In [0]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(train, train_labels)



We can use the induced model to make predictions in our testing set using the predict () function.
The predict () function returns an array of predicted labels for each data instance in the testing set. We can then print out the predictions to get an idea of what the model has to say about the health of individuals in the testing set.

In [0]:
preds = gnb.predict(test)
print(preds)

In [0]:
sum(preds) / len(test), sum(preds) # out of curiosity..., what is the prevalence of healthy individuals in prediction?

In [0]:
sum(test_labels) / len(test), sum(test_labels) # out of curiosity..., what is the prevalence of healthy individuals in reality?

It seems that at least one person could have died should we rely on Naive Bayes. Let's check how many false negatives are there.

In [0]:
false_negatives = (1 - test_labels) * preds
sum(false_negatives)

Actually, 6 people should not trust in Naive Bayes.

In [0]:
false_positives = test_labels * (1 - preds)
sum(false_positives)

And 5 people probably would be needlessly worried and screened again in more accurate/expensive laboratory tests.

We could also evaluate the accuracy of the predicted values of our model by using sklearn's accuracy_score() function. However it doesn't take into account the different types of errors. Precision, recall, f-score and other measures can do a better job in medical areas (and are included in sklearn). We will stick with the 'miss rate' here for simplicity.

In [0]:
from sklearn.metrics import accuracy_score
print(accuracy_score(test_labels, preds))

In [0]:
sum(false_negatives) / sum(1 - test_labels) # miss rate [ERRATA: it was 'preds' in denominator]

# Cross validation
For a better estimatation, and given the small number of examples (569), we can use cross-validation.

In [0]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(gnb, train, train_labels, cv=10)
scores

Considering a confidence interval of 95% (p-value=0.05)...

In [0]:
import numpy
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

...95 times out of a hundred attempts the accuracy will be between 0.91 and 0.97 (according to https://www.mathsisfun.com/data/confidence-interval.html).

# Neural network

How well would a neural network perform? Hopefully, nothing *deep* needed here... 

In [0]:
# cleaning excess of output regarding MLP convergence
import warnings, os
warnings.simplefilter("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

from sklearn.neural_network import MLPClassifier as MLP

mlp = MLP()
model = mlp.fit(train, train_labels)
model

In [0]:
preds = mlp.predict(test)
false_negatives = (1 - test_labels) * preds
sum(false_negatives)

2 less people at risk.

In [0]:
false_positives = test_labels * (1 - preds)
sum(false_positives)

3 less worried people. However, the cross-validated accuracy is "worse" (0.93 < 0.94). How is that possible?

In [0]:
scores = cross_val_score(mlp, train, train_labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

We can implement our own score function to cross-validate. Function fn, or 'number of false negatives'.

In [0]:
from sklearn.metrics import make_scorer
def fn(y_true, y_pred): return sum((1 - y_true) * y_pred)
scoring = make_scorer(fn)
scoring

In [0]:
scores = cross_val_score(mlp, train, train_labels, cv=10, scoring=scoring)
print("FN: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

Trying to improve it, taking advantage of probability predictions (without cross-validation for simplicity).

In [0]:
probs = mlp.predict_proba(test)[:,1] # taking second column, since there the value 1 means class 1 (fst col = 1 - snd col)
print(probs[:6])



In [0]:
def rectify(threshold): return lambda predictions: [0 if x < threshold else 1 for x in predictions]
rectify(0.5)(probs[:6])

In [0]:
rectify(0.5)(probs[:20])

In [0]:
rectify(0.9)(probs[:20])

In [0]:
rectify(0.99)(probs[:20])

In [0]:
rectify(0.999)(probs[:20])

Only malignant tumors predicted. Going in the opposite direction...

In [0]:
rectify(0.5)(probs[:20])

In [0]:
rectify(0.1)(probs[:20])

In [0]:
rectify(0.01)(probs[:20])

In [0]:
rectify(0.001)(probs[:20])

In [0]:
rectify(0.000001)(probs[:20])

It seems like MLP refuses to consider all tumors benign. Some patients are lucky in the sense of being correctly diagnosed.

As long as the threshold has some effect on predictions, we can calibrate it to avoid leaving patients undiagnosed, at the expense of the healthy ones that will have to make useless additional tests.

In [0]:
calibrated_preds = rectify(0.5)(probs)
false_negatives = (1 - test_labels) * calibrated_preds
false_positives = test_labels * (1 - numpy.array(calibrated_preds))
print('usual threshold... undiagnosed:', sum(false_negatives), '     wasting additional tests:', sum(false_positives))

In [0]:
calibrated_preds = rectify(0.6)(probs)
false_negatives = (1 - test_labels) * calibrated_preds
false_positives = test_labels * (1 - numpy.array(calibrated_preds))
print('usual threshold... undiagnosed:', sum(false_negatives), '     wasting additional tests:', sum(false_positives))

In [0]:
calibrated_preds = rectify(0.7)(probs)
false_negatives = (1 - test_labels) * calibrated_preds
false_positives = test_labels * (1 - numpy.array(calibrated_preds))
print('usual threshold... undiagnosed:', sum(false_negatives), '     wasting additional tests:', sum(false_positives))

Losing money, but saving lives.

In [0]:
calibrated_preds = rectify(0.4)(probs) # Out of curiosity...
false_negatives = (1 - test_labels) * calibrated_preds
false_positives = test_labels * (1 - numpy.array(calibrated_preds))
print('usual threshold... undiagnosed:', sum(false_negatives), '     wasting additional tests:', sum(false_positives))

In [0]:
calibrated_preds = rectify(0.3)(probs)
false_negatives = (1 - test_labels) * calibrated_preds
false_positives = test_labels * (1 - numpy.array(calibrated_preds))
print('usual threshold... undiagnosed:', sum(false_negatives), '     wasting additional tests:', sum(false_positives))

Saving money (high *precision*), but losing lives (*\"low\" recall*).

See more here:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

# A harder non medical problem

Vegetal cover type/satellite data. Undersampled due to time/hardware restrictions.

In [0]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.datasets import covtype
data = covtype.fetch_covtype()
from imblearn.under_sampling import RandomUnderSampler
cc = RandomUnderSampler(random_state=42)
dados, rótulos = cc.fit_resample(data['data'], data['target'])
train, test, train_labels, test_labels = train_test_split(dados,
                                                          rótulos,
                                                          test_size=0.33,
                                                          random_state=42)
len(train)

Estimation for several classifiers.

In [0]:
mlp = MLP()
scores = cross_val_score(mlp, train, train_labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

In [0]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
scores = cross_val_score(gnb, train, train_labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

In [0]:
from sklearn.neighbors.classification import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)
scores = cross_val_score(knn, train, train_labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

In [0]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100, random_state = 42)
scores = cross_val_score(rf, train, train_labels, cv=10)
print("Accuracy: %0.3f (+/- %0.3f)" % (scores.mean(), scores.std() * 2 / numpy.sqrt(len(scores))))

Now that we have chosen an algorithm, does it generalize with the same accuracy on unseen data? 0.003 st. dev. suggests low variability.

In [0]:
real_model = rf.fit(train, train_labels)
preds = real_model.predict(test)
print(accuracy_score(test_labels, preds))

Indeed, the generalization accuracy is not far from the model selection accuracy.

Based on:

https://www.digitalocean.com/community/tutorials/como-construir-um-classificador-de-machine-learning-em-python-com-scikit-learn-pt
