In [52]:
import graphlab

In [53]:
data1 = graphlab.SFrame('breast-cancer-wisconsin.csv')
data3 = graphlab.SFrame('ovarian.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int,int,int,str,int,int,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[float,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [54]:
data1


id,clump_thickness,size_uniformity,shape_uniformity,marginal_adhesion,epithelial_size,bare_nuclei,bland_chromatin
1000025,5,1,1,1,2,1,3
1002945,5,4,4,5,7,10,3
1015425,3,1,1,1,2,2,3
1016277,6,8,8,1,3,4,3
1017023,4,1,1,3,2,1,3
1017122,8,10,10,8,7,10,9
1018099,1,1,1,1,2,10,3
1018561,2,1,2,1,2,1,3
1033078,2,1,1,1,2,1,1
1033078,4,2,1,1,2,1,2

normal_nucleoli,mitoses,class
1,1,2
2,1,2
1,1,2
7,1,2
1,1,2
7,1,4
1,1,2
1,1,2
1,5,2
1,1,2


In [55]:
graphlab.canvas.set_target('ipynb')
data1.show(view="Summary")

### Visualizing Data (Standard)

In [56]:
data1.show(view="Line Chart", x="clump_thickness", y='id')

In [58]:
data1.show(view="Bar Chart", x="bland_chromatin", y='id')

In [60]:
data1.show(view="BoxWhisker Plot", x="marginal_adhesion", y='id')

In case of standard data, we observe the following trend: The higher the value of the feature (such as bland chromatin, marginal adhesion), the higher the change for cancer malignancy, and in turn increased probability for mortality.

We model the above data into two classification models: ANN and SVM, to asses how different data sets affect model metrics such as efficiency, specificity etc. 

### Visualizing Data (Gene expression)

In [61]:
data3

M/Z,Intensity
-7.8602611e-05,4.1045752
2.1773576e-07,4.1106083
9.6021472e-05,4.0542986
0.00036601382,4.1206637
0.00081019477,4.0221217
0.0014285643,3.9537456
0.0022211225,3.8833585
0.0031878693,3.9698341
0.0043288047,4.0180995
0.0056439287,4.0100553


In [62]:
data3.show(view="Summary")

In [63]:
data3.show(view="Line Chart", x="M/Z", y="Intensity")

In [65]:
data3.show(view="BoxWhisker Plot", x="M/Z", y="Intensity")

The cancer data for genetic dataset is highly spread and therefore it is difficult to observe a consistent trend among the data entries. However, for binary classification purposes it has been medically proven that an intensity of more than 50 is more likely to engender ovarian/prostate cancer. 

Just like breast cancer, where physical examination reveals features/attributes of data sets. Here, the examination marke is given by the process of CA125, leading to the following data set.

### Creating the artificial neural network model

In [66]:
train_data, test_data = data1.random_split(0.8)
feature_array = ['clump_thickness', 'size_uniformity', 'shape_uniformity', 'marginal_adhesion', 'bland_chromatin', 'mitoses', 'epithelial_size', 'normal_nucleoli']
model_neural = graphlab.neuralnet_classifier.create(train_data, target='class', features=feature_array)

Using network:

### network layers ###
layer[0]: FullConnectionLayer
  init_sigma = 0.01
  init_random = gaussian
  init_bias = 0
  num_hidden_units = 10
layer[1]: SigmoidLayer
layer[2]: FullConnectionLayer
  init_sigma = 0.01
  init_random = gaussian
  init_bias = 0
  num_hidden_units = 2
layer[3]: SoftmaxLayer
### end network layers ###

### network parameters ###
learning_rate = 0.001
momentum = 0.9
### end network parameters ###

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [67]:
model_neural.show(view="Summary")

In [68]:
pred = model_neural.classify(test_data)

In [69]:
pred

row_id,class,probability
0,2,0.587500989437
1,2,0.585788607597
2,2,0.588369131088
3,2,0.588559150696
4,2,0.584629893303
5,2,0.581713438034
6,2,0.588684380054
7,2,0.585021317005
8,2,0.586781799793
9,2,0.586582005024


In [70]:
eval = model_neural.evaluate(test_data)

In [71]:
eval

{'accuracy': 0.680272102355957, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 2
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      2       |        2        |  100  |
 |      4       |        2        |   47  |
 +--------------+-----------------+-------+
 [2 rows x 3 columns]}

In [72]:
pred_topk = model_neural.predict_topk(test_data, k=2)

In [73]:
pred_topk

row_id,class,probability
0,2,0.587500989437
0,4,0.412498980761
1,2,0.585788607597
1,4,0.414211392403
2,2,0.588369131088
2,4,0.411630839109
3,2,0.588559150696
3,4,0.411440849304
4,2,0.584629893303
4,4,0.415370106697


In [74]:
model_neural.show(view="Evaluation")

### Comparing with the support vector machine model.

In [75]:
model_vector = graphlab.svm_classifier.create(train_data, target='class', features=feature_array, validation_set='auto', verbose=True)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [76]:
model_vector.show()

In [77]:
prediction_vector = model_vector.predict(test_data)

In [78]:
prediction_vector.show()

In [79]:
eval_ = model_vector.evaluate(test_data)

In [81]:
eval_

{'accuracy': 0.9863945578231292, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 3
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      4       |        4        |   47  |
 |      2       |        4        |   2   |
 |      2       |        2        |   98  |
 +--------------+-----------------+-------+
 [3 rows x 3 columns], 'f1_score': 0.9791666666666666, 'precision': 0.9591836734693877, 'recall': 1.0}

We see that for a data set with less entries, support vector machine performs better as it translates into a linear model which evaluates better compared to artificial neural network.

## Using above techniques on BUPA Liver Cancer Data set

In [82]:
data2 = graphlab.SFrame('bupa.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,int,int,int,int,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [83]:
data2

mcv,alkphos,sgpt,sgot,gammagt,drink,class
85,92,45,27,31,0.0,1
85,64,59,32,23,0.0,2
86,54,33,16,54,0.0,2
91,78,34,24,36,0.0,2
87,70,12,28,10,0.0,2
98,55,13,17,17,0.0,2
88,62,20,17,9,0.5,1
88,67,21,11,11,0.5,1
92,54,22,20,7,0.5,1
90,60,25,19,5,0.5,1


In [84]:
graphlab.canvas.set_target('ipynb')
data2.show(view="Summary")

In [85]:
data2.show(view="Line Chart", x="sgpt", y='class')

#### Observation: SGPT for patients are higher for malignant ones compared to those of benign ones. Similar trend is observed for other features as well.

In [86]:
data2.show(view="Bar Chart", x="alkphos", y='class')

In [87]:
data2.show(view="BoxWhisker Plot", x="sgot", y='class')

In [88]:
train_data2, test_data2 = data2.random_split(0.8)
feature_array2 = ['alkphos', 'sgot', 'sgpt', 'mcv', 'gammagt', 'drink']
model_neural2 = graphlab.neuralnet_classifier.create(train_data2, target='class', features=feature_array2)

Using network:

### network layers ###
layer[0]: FullConnectionLayer
  init_sigma = 0.01
  init_random = gaussian
  init_bias = 0
  num_hidden_units = 10
layer[1]: SigmoidLayer
layer[2]: FullConnectionLayer
  init_sigma = 0.01
  init_random = gaussian
  init_bias = 0
  num_hidden_units = 2
layer[3]: SoftmaxLayer
### end network layers ###

### network parameters ###
learning_rate = 0.001
momentum = 0.9
### end network parameters ###

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [89]:
model_neural2.show(view="Summary")

In [90]:
pred2 = model_neural2.classify(test_data2)

In [91]:
pred2

row_id,class,probability
0,2,0.514295995235
1,2,0.514591932297
2,2,0.51342612505
3,2,0.513011157513
4,2,0.51477253437
5,2,0.513086438179
6,2,0.513864278793
7,2,0.513604283333
8,2,0.51284044981
9,2,0.517160654068


In [92]:
eval2 = model_neural2.evaluate(test_data2)

In [93]:
eval2

{'accuracy': 0.6393442749977112, 'confusion_matrix': Columns:
 	target_label	int
 	predicted_label	int
 	count	int
 
 Rows: 2
 
 Data:
 +--------------+-----------------+-------+
 | target_label | predicted_label | count |
 +--------------+-----------------+-------+
 |      1       |        2        |   22  |
 |      2       |        2        |   39  |
 +--------------+-----------------+-------+
 [2 rows x 3 columns]}

We see that neural network struggles to find a good accuracy and precision due to low number of entries in the data set. (345 to be exact.)

#### Modelling with SVM (Liver Cancer)

In [94]:
model_vector2 = graphlab.svm_classifier.create(train_data2, target='class', features=feature_array2, validation_set='auto', verbose=True)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [95]:
model_vector2.show(view = "Summary")

In [96]:
prediction_2 = model_vector2.predict(test_data2)

In [97]:
prediction_2.show()


In [98]:
evaluation_2 = model_vector2.evaluate(test_data2)

In [99]:
model_vector2.show(view="Evaluation")