- Read the dataset from the URL into a dataframe.
- Display the first few rows to make sure it was read properly.

In [2]:
import pandas as pd
# Reading the dataset into a pandas dataframe object
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data',
                header=None)
# Print some data to make sure the data was properly read into the dataframe
print df.head(n=3)

         0  1      2      3      4       5        6        7       8   \
0    842302  M  17.99  10.38  122.8  1001.0  0.11840  0.27760  0.3001   
1    842517  M  20.57  17.77  132.9  1326.0  0.08474  0.07864  0.0869   
2  84300903  M  19.69  21.25  130.0  1203.0  0.10960  0.15990  0.1974   

        9    ...        22     23     24      25      26      27      28  \
0  0.14710   ...     25.38  17.33  184.6  2019.0  0.1622  0.6656  0.7119   
1  0.07017   ...     24.99  23.41  158.8  1956.0  0.1238  0.1866  0.2416   
2  0.12790   ...     23.57  25.53  152.5  1709.0  0.1444  0.4245  0.4504   

       29      30       31  
0  0.2654  0.4601  0.11890  
1  0.1860  0.2750  0.08902  
2  0.2430  0.3613  0.08758  

[3 rows x 32 columns]


Let us get a sense of the size of the dataset we are dealing with

In [3]:
rows, columns = df.shape
print "#rows: ", rows
print "#columns: ", columns

#rows:  569
#columns:  32


The first column is just an ID. The second column is the actual diagnosis for the tumour - M(Malignant) and B(Benign).

The column numbers 3-32 include the 30 features of the dataset.

Assign the 30 features of the dataset and the target variable to separate numpy arrays. We will get the same result in the next 2 lines if we replace ```iloc``` with ```loc```

In [4]:
X = df.iloc[:, 2:].values
Y = df.iloc[:, 1].values

Transform the class labels from their original string representation (M & B) to integers. We perform all these tasks using the preprocessing libraries available in scikit-learn

In [6]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Y = le.fit_transform(Y)
print Y[:50]

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 0 1 1 1 1 1 1 1 1 0 1 0 0]


We can now see that the Y array has been converted from a string label (M & B) to a integer label (1 & 0). Let us now convert the dataset into training and test dataset.

In [7]:
from sklearn.cross_validation import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

Creating a pipeline with the following steps chained to each other
- For optimal performance transform all feature values into the same scale, i.e. standardize the columns before feeding them to the classifier.
- Compress the initial 30 dimensional data into a lower 2 dimensional space using Principal Component Analysis (PCA)
- Apply the logistic regression algorithm

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

Define the steps of the pipeline, each step is a tuple: a name and either a transformer or estimator object

In [9]:
classification_pipeline = Pipeline([('standard_scaler', StandardScaler()),
                                   ('pca', PCA(n_components=2)),
                                   ('classifier', LogisticRegression(random_state=1))])
classification_pipeline.fit(X_train, Y_train)

Pipeline(steps=[('standard_scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('pca', PCA(copy=True, n_components=2, whiten=False)), ('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=1, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Compute the accuracy of the model on test data

In [10]:
accuracy = classification_pipeline.score(X_test, Y_test)
print "Test Accuracy: %.3f" % accuracy

Test Accuracy: 0.947


Instead of just a single test on test data, let's do a startified k-fold cross validation on training data

In [None]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=classification_pipeline, X=X_train, y=Y_train, cv=10, n_jobs=1)
import numpy as np
print "CV Accuracy %.3f +/- %.3f" % (np.mean(scores), np.std(scores))