- Read the dataset from the URL into a dataframe.
- Display the first few rows to make sure it was read properly.

In [1]:
import pandas as pd
# Reading the dataset into a pandas dataframe object
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
                header=None)
# Print some data to make sure the data was properly read into the dataframe
print df.head(n=3)

         0  1      2      3      4       5        6        7       8   \
0    842302  M  17.99  10.38  122.8  1001.0  0.11840  0.27760  0.3001   
1    842517  M  20.57  17.77  132.9  1326.0  0.08474  0.07864  0.0869   
2  84300903  M  19.69  21.25  130.0  1203.0  0.10960  0.15990  0.1974   

        9    ...        22     23     24      25      26      27      28  \
0  0.14710   ...     25.38  17.33  184.6  2019.0  0.1622  0.6656  0.7119   
1  0.07017   ...     24.99  23.41  158.8  1956.0  0.1238  0.1866  0.2416   
2  0.12790   ...     23.57  25.53  152.5  1709.0  0.1444  0.4245  0.4504   

       29      30       31  
0  0.2654  0.4601  0.11890  
1  0.1860  0.2750  0.08902  
2  0.2430  0.3613  0.08758  

[3 rows x 32 columns]


Let us get a sense of the size of the dataset we are dealing with

In [2]:
rows, columns = df.shape
print "#rows: ", rows
print "#columns: ", columns

#rows:  569
#columns:  32


The first column is just an ID. The second column is the actual diagnosis for the tumour - M(Malignant) and B(Benign).

The column numbers 3-32 include the 30 features of the dataset.

Assign the 30 features of the dataset and the target variable to separate numpy arrays. We will get the same result in the next 2 lines if we replace ```iloc``` with ```loc```

In [3]:
X = df.iloc[:, 2:].values
Y = df.iloc[:, 1].values

Transform the class labels from their original string representation (M & B) to integers. We perform all these tasks using the preprocessing libraries available in scikit-learn

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(Y)
print Y[:50]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 0 1 0 0]


We can now see that the Y array has been converted from a string label (M & B) to a integer label (1 & 0). Let us now convert the dataset into training and test dataset.

In [5]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

Creating a pipeline with the following steps chained to each other
- For optimal performance transform all feature values into the same scale, i.e. standardize the columns before feeding them to the classifier.
- Compress the initial 30 dimensional data into a lower 2 dimensional space using Principal Component Analysis (PCA)
- Apply the logistic regression algorithm

In [6]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

Define the steps of the pipeline, each step is a tuple: a name and either a transformer or estimator object

In [7]:
classification_pipeline = Pipeline([('standard_scaler', StandardScaler()),
                                   ('pca', PCA(n_components=2)),
                                   ('classifier', LogisticRegression(random_state=1))])
classification_pipeline.fit(X_train, Y_train)

Pipeline(steps=[('standard_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, n_components=2, whiten=False)), ('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Compute the accuracy of the model on test data

In [8]:
accuracy = classification_pipeline.score(X_test, Y_test)
print "Test Accuracy: %.3f" % accuracy

Test Accuracy: 0.947


Instead of just a single test on test data, let's do a startified k-fold cross validation on training data

In [9]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=classification_pipeline, X=X_train, y=Y_train, cv=10, n_jobs=1)
import numpy as np
print "CV Accuracy %.3f +/- %.3f" % (np.mean(scores), np.std(scores))

CV Accuracy 0.950 +/- 0.029


We can now plot the confusion matrix

The answer we get is a list of lists. The first list represents the class 0 (-ve) and the second list represents the class 1 (+ve). To convert this into the convention we saw in class, stack the 2 lists, one in each row to get

[71, 1] : [True positive, False positive]  
[5, 37] : [False negative, True negative]  
Rows are TRUE/OBSERVED CLASS. Columns are PREDICTED CLASS above.

Now switch the elements on the diagonal to get  
[37, 1] : [True positive, False positive]  
[5, 71] : [False negative, True negative]  
Rows are PREDICTED CLASS. Columns are TRUE/OBSERVED CLASS above.

In [10]:
from sklearn.metrics import confusion_matrix
y_pred = classification_pipeline.predict(X_test)
confusionMatrix = confusion_matrix(y_true=Y_test, y_pred=y_pred)
print confusionMatrix

[[71  1]
 [ 5 37]]


Computing precision, recall and F measure. All this is a part of the classification report shown below. Classification report gives these measures for both the positive and negative classes.

In [12]:
from sklearn.metrics import classification_report
print "", classification_report(y_true=Y_test, y_pred=y_pred)

              precision    recall  f1-score   support

          0       0.93      0.99      0.96        72
          1       0.97      0.88      0.93        42

avg / total       0.95      0.95      0.95       114



Now we will get the same statistics for the positive class

In [13]:
from sklearn.metrics import precision_score, recall_score, f1_score
print "Precision: %.3f" % precision_score(y_true=Y_test, y_pred=y_pred)
print "Recall: %.3f" % recall_score(y_true=Y_test, y_pred=y_pred)
print "F1: %.3f" % f1_score(y_true=Y_test, y_pred=y_pred)

Precision: 0.974
Recall: 0.881
F1: 0.925


Now we apply the logistic regression model with different parameters and try to finetune these hyperparameters using the Grid Search approach.  
Any parameter provided when constructing an estimator may be optimized in this manner.

Specifically, to find the names and current values for all the parameters for a given estimator, use: ```estimator.get_params()```  

A list of available parameters to tune can be displayed by using ```classification_pipeline.get_params().keys()```

The param name is constructed as the string name given to it in the pipeline constructor followed by 2 underscores and then the param name

In [15]:
from sklearn.grid_search import GridSearchCV
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]
print "Parameters available for tuning: ", classification_pipeline.get_params().keys()
param_grid = [{"classifier__penalty": ['l1', 'l2'], "classifier__C": param_range}]
# In the above line, each set of items enclosed in {} defines one grid. Here we only have one grid, but we could have multiple grids to explore
gs = GridSearchCV(estimator=param_grid,
                 param_grid=param_grid,
                 scoring='accuracy',
                 cv=10,
                 n_jobs=-1)
gs.fit(X_train, Y_train)
print "Best accuracy score: ", gs.best_score_
print "Best parameters: ", gs.best_params_

Parameters available for tuning:  ['standard_scaler__copy', 'pca__whiten', 'classifier__max_iter', 'classifier__C', 'classifier__multi_class', 'standard_scaler__with_mean', 'classifier__intercept_scaling', 'classifier__warm_start', 'pca', 'classifier__class_weight', 'pca__copy', 'classifier__solver', 'classifier__dual', 'classifier__fit_intercept', 'classifier__n_jobs', 'classifier__verbose', 'pca__n_components', 'classifier__tol', 'standard_scaler__with_std', 'standard_scaler', 'steps', 'classifier', 'classifier__random_state', 'classifier__penalty']


TypeError: estimator should a be an estimator implementing 'fit' method, [{'classifier__C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0], 'classifier__penalty': ['l1', 'l2']}] was passed