## Wisconsin Diagnostic Breast Cancer (WDBC) Dataset with Decision Tree Classifier

I am writing the explorative analysis of the dataset using a series of Python (ver.2.7.11) packages that facilitate the access to the data as well as the statistical analysis. In particular, I am using Panda (ver.0.17), scikit-learn (ver.0.17) together with the more common numerical packages available for Python such as numpy and matplotlib. I am also using jupyter notebooks for this type of quick data analysis. I usually save everything into a normal python script after I streamline the major features of the predictive model.

In [190]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

import urllib
from math import sqrt
from sklearn import tree
import pylab as pl
import pandas as pd
import StringIO
from pydot import graph_from_dot_data as gdot
from sklearn.externals.six import StringIO 

The description of the attibutes of the dataset can be found on the UCI Machine Learning Repository website (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.names):
* radius (mean of distances from center to points on the perimeter)
* texture (standard deviation of gray-scale values)
* perimeter
* area
* smoothness (local variation in radius lengths)
* compactness (perimeter^2 / area - 1.0)
* concavity (severity of concave portions of the contour)
* concave points (number of concave portions of the contour)
* symmetry 
* fractal dimension ("coastline approximation" - 1)
 
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features.  For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

In [191]:
# URL for the Wisconsin Diagnostic Breast Cancer (WDBC) (UCI Machine Learning Repository)
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"
# download the file
raw_data = urllib.urlopen(url)
# load the file as a numpy matrix
attributes = ['id','cancer_type','mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area',
       'mean_smoothness', 'mean_compactness', 'mean_concavity',
       'mean_concave_points', 'mean_symmetry', 'mean_fractal_dimension',
       'radius_error', 'texture_error', 'perimeter_error', 'area_error',
       'smoothness_error', 'compactness_error', 'concavity_error',
       'concave_points_error', 'symmetry_error', 'fractal_dimension_error',
       'worst_radius', 'worst_texture', 'worst_perimeter', 'worst_area',
       'worst_smoothness', 'worst_compactness', 'worst_concavity',
       'worst_concave_points', 'worst_symmetry', 'worst_fractal_dimension']
bcd =  np.genfromtxt(raw_data, delimiter=',', dtype=None, names= attributes)

In [192]:
# demo numpy matrix to Pandas DataFrame
bcdf = pd.DataFrame(data=bcd, columns=attributes)

I split the dataset in a training set randomly sampled and the remaining 20% will be used to test my prediction model.


In [199]:
train = bcdf.sample(frac=0.8)
test  = bcdf.drop(train.index)

I manipulate the data

In [200]:
labels = train.columns[2:32]
X = train.ix[:,2:32]
y = train.ix[:,1]

In [201]:
# create decision tree
dt_train = tree.DecisionTreeClassifier(max_depth=500, min_samples_leaf=1)
dt_train.fit(X,y)
tree.export_graphviz(dt_train,feature_names=labels, out_file='tree_train.dot') 

To create a png of the tree we can run the following command from terminal:

```dot -Tpng tree_train.dot -o tree_train.png```

![fig:tree_train](tree_train.png)

Short comment

In [None]:
labels = test.columns[2:32]
X = test.ix[:,2:32]
y = test.ix[:,1]

Let's test the model with a prediction on the remaining 20% of the dataset 

In [208]:
test_predict = dt_train.predict(X)

What's better than a cross-tabulation to test the success rate. We obtain an accuracy of 94.7% of the model on the a the test dataset (114 observations).

In [209]:
pd.crosstab(test['cancer_type'], test_predict, rownames=['actual'], colnames=['preds'])

preds,B,M
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
B,71,4
M,2,37


We can create decision tree also on the remaining test dataset. It is not really usefull for the analysis, but it shows how sensitive is the tree shape to the dataset. Which I think it is one of the limitations of the decision tree method partially solved by using Random Forest.


In [212]:
dt_test = tree.DecisionTreeClassifier(max_depth=500, min_samples_leaf=1)
dt_test.fit(X, y)
tree.export_graphviz(dt_test,feature_names=labels, out_file='tree_test.dot') 

![fig:tree_test](tree_test.png)