The Breast Cancer Wisconsin (Diagnostic) Data Set contains 30 columns and 569 rows, corresponding to 569 samples
of biopsied tissue. The tissue for each sample is an image, and 10 characteristics of the nuclei of cells present
in each image are characterized according to the following parameters:

1. Radius
2. Texture
3. Perimeter
4. Area
5. Smoothness
6. Compactness
7. Concavity
8. Number of concave portions of contour
9. Symmetry
10. Fractal dimension

The following ten columns contain the standard deviation of each of the attributes from 1 to 10 and the last ten columns indicate the maximum measurement of the correspond feature 1-10 obtained in a single image.

The dataset is divided already in (X_train, Y_train) and (X_test, Y_test) sets, where the Y sets are either 0 or 1, depending on whether the tissue was found to be benign or malignant.

This is a classification problem, which involves a binary choice, based on several parameters.

The initial approach will be training a Decision Tree using all the parameters available. The quality of the training model will be calculated via the area under the ROC curve. 

Thanks to the function GridSearchCV, I am able to observe how the quality of the model changes for different settings of the Decision Tree input parameters, like the max_depth, which is the maximum depth that the tree can achieve. 

My suspicion is that using all the attributes of the training set blindly generates a loss of information which is quite crucial: the standard deviation and maximum values of the attributes are intrinsically related to the attributes themselves, so it is necessary to find a way to communicate this relationship to the model. A bit like what happens when fitting some raw data with errors included to a model: the presence of the errors helps determining the model more accurately.

The variable I though of introducing for each of the ten attributes is what I very boringly called "attributes_relations_[i]", and is simply defined as: (attribute[max]-attribute[mean])/attribute[std]. Basically I am determining how many standard deviations from the maximum value of the feature measured the mean is. I don't have any background in the subject treated, but I would argue that a mean value close to the maximum value implies that the attribute is homogeneous: an average radius of 2mm with maximum radius of 10mm tells me that the feature observed is extremely irregular, which I understand not being good news.

In the following I will present my analysis step by step and draw some conclusions.

1)First things first. Load the data and rename all the columns with the name of the feature rather than a number, and check the quality of the datasets:


In [None]:
import pandas as pd
trainY= pd.read_csv('trainY.csv', header=None)
trainX= pd.read_csv('trainX.csv', header=None)
testY= pd.read_csv('testY.csv', header=None)
testX= pd.read_csv('testX.csv', header=None)

#new names for the colums for train and test datasets
trainX.columns = ['Radius','Texture','Perimeter','Area','Smoothness','Compactness','Concavity',
                  'Number of concave portions of contour','Symmetry','Fractal dimension',
                  'std_Radius', 'std_Texture', 'std_Perimeter', 'std_Area', 'std_Smoothness', 'std_Compactness',
                  'std_Concavity',
                  'std_Number of concave portions of contour', 'std_Symmetry', 'std_Fractal dimension',
                  'max_Radius', 'max_Texture', 'max_Perimeter', 'max_Area', 'max_Smoothness', 'max_Compactness',
                  'max_Concavity',
                  'max_Number of concave portions of contour', 'max_Symmetry', 'max_Fractal dimension'
                  ]
testX.columns = ['Radius','Texture','Perimeter','Area','Smoothness','Compactness','Concavity',
                  'Number of concave portions of contour','Symmetry','Fractal dimension',
                  'std_Radius', 'std_Texture', 'std_Perimeter', 'std_Area', 'std_Smoothness', 'std_Compactness',
                  'std_Concavity',
                  'std_Number of concave portions of contour', 'std_Symmetry', 'std_Fractal dimension',
                  'max_Radius', 'max_Texture', 'max_Perimeter', 'max_Area', 'max_Smoothness', 'max_Compactness',
                  'max_Concavity',
                  'max_Number of concave portions of contour', 'max_Symmetry', 'max_Fractal dimension'
                  ]

#Check for presence of non-float values (I'm expecting every value is a float)

trainX.info()
testX.info()

2)I want to train the dataset with a decision tree, but I don't know which depth is optimal, that is, when it is best to prune the tree to avoid overfitting. 
All I know is that the minimum depth must be ⌈log2(n)+1⌉ where n is the number of samples (569 in this case), so thanks to the function GridSearchCV() I can set the depth to different values and obtain the optimal depth: the depth for which the quality of the model (AUC ROC) is highest.

In [None]:
from sklearn import tree
from sklearn.model_selection import GridSearchCV

treeCheck = tree.DecisionTreeClassifier(random_state=0)
parameters={'max_depth': [11,15,20,25,30,35]}
clf = GridSearchCV(treeCheck, parameters, scoring='roc_auc', cv=3)
clf.fit(x, y)

print(clf.best_estimator_)

The best estimator results to be...

In [None]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=11, #here's the tree depth
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

... a Decision Tree with minimum depth.

3)I calculate the AUC ROC curve obtained predicting the data in testX with a Decision Tree with max_depth=11:

In [None]:
from sklearn.metrics import roc_curve, auc

pred=clf.predict(testX)
fpr, tpr, thresholds = roc_curve(testY, pred) 
aucroc=auc(fpr, tpr)

I obtain AUC ROC=0.8375. 
Let's see if I can improve it introducing the variable "attributes_relations_[i]".

4)I add the 10 new columns containing "attributes_relations_[i]" for every feature, both to the train and test X data sets:

In [None]:
for i in range(0,10):

            newCol = (testX[testX.columns[20+i]] - testX[testX.columns[i]])/testX[testX.columns[10+i]] #(max-mean)/std
            testX['attributes_relations_' + str(list_ib[i]) ] = newCol 
            
            

for i in range(0,10):

            newCol = (trainX[trainX.columns[20+i]] - trainX[trainX.columns[i]])/trainX[trainX.columns[10+i]] #(max-mean)/std
            trainX['attributes_relations_' + str(list_ib[i]) ] = newCol 

Now, the only attributes of the data set I am interested to are given by the mean values of the features and their "attribute_relations_[i]" value, so:

In [None]:
list_ib = trainX.columns.values
attributes=[]
count=-1
for i in list_ib:
    count=count+1
    if count<10 or count>29:
     attributes.append(i) #the only attributes I care about are in the first ten and last ten columns
    
    
trainSet=trainX[attributes]
testSet=testX[attributes]

Because some mean, max and standard deviation values are 0 (like for the attribute 'Concavity' in line 223 of the training set), the new set of columns will have some Nan. I eliminate them and replace them with 0 (testSet/trainSet.fillNa(0)).

5)As above, I use GridSearchCV() to determine the tree optimal depth. The code is the same and also the result.

So I train the Decision Tree:

In [None]:
clf = tree.DecisionTreeClassifier(max_depth=11,random_state=0)
clf.fit(trainSet, y)

from sklearn.metrics import roc_curve, auc
pred=clf.predict(testSet)
fpr, tpr, thresholds = roc_curve(testY, pred) 
aucroc=auc(fpr, tpr)

The AUC ROC result is of 0.86, which is slightly higher than what obtained previously. 

This means that some information gets lost if we don't consider the intrinsic relation among the variables.

6)Finally, let's see if it is true that if there is a big difference between mean and max value of a feature (like the radius) then the tissue is cancerous. 

To find out, I produce a scatter-plot of the "attributes_relations_[Radius]" versus "Radius" for both the real data and the data predicted by the model:

In [None]:
#get the index of the 0 (benign) and 1 (malignant) values from the actual data
ind_negatives=testY[0][testY[0]==0].index
ind_pos=testY[0][testY[0]==1].index

#get the index of the 0 (benign) and 1 (malignant) values from the predicted data
import numpy as np
ind_posModel =  np.nonzero(pred)[0]
ind_negativesModel = np.where(pred == 0)[0]

#plot
import pylab as plt
plt.subplot(1, 2, 1)
plt.plot(testSet['attributes_relations_Radius'][ind_negatives],testSet['Radius'][ind_negatives],'bo',label='Benign')
plt.plot(testSet['attributes_relations_Radius'][ind_pos],testSet['Radius'][ind_pos],'ro', label='Malignant')
plt.legend(loc='upper left')
plt.xlabel('Radius (mm)')
plt.ylabel('(MaxRadius-MeanRadius)/StdRadius')
plt.title('Actual data')
plt.subplot(1, 2, 2)
plt.plot(testSet['attributes_relations_Radius'][ind_negativesModel],testSet['Radius'][ind_negativesModel],'bo',label='Benign')
plt.plot(testSet['attributes_relations_Radius'][ind_posModel],testSet['Radius'][ind_posModel],'ro', label='Malignant')
plt.legend(loc='upper left')
plt.xlabel('Radius (mm)')
plt.ylabel('(MaxRadius-MeanRadius)/StdRadius')
plt.title('Predicted by Decision Tree')

The result is the following:

![title](C:/Users/adach/Desktop/ModelandData.PNG)

As it can be seen, in general this criterion is a good predictor, but not always. 

It can be noticed how the Decision Tree delivers false negatives in almost 10% of the cases, meaning that either the data set needs to be widened, or that another predictor should be found.