# Question1: Cancer Diagnosis Using Machine Learning

In [1]:
# import libraries and packages
import numpy as np
import pandas as pd

## a - Read the dataset file “Cancer.csv” and assign it to a Pandas DataFrame

In [2]:
cancer_df = pd.read_csv("https://github.com/mpourhoma/CS4661/raw/master/Cancer.csv")

Check out the dataset. As you see, the dataset includes 9 numerical features. The last column is the binary label (“1” means it is a malignant cancer, “0” means it is a benign tumor). You will use all 9 features in this homework.

In [3]:
cancer_df[::10]

Unnamed: 0,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses,Malignant_Cancer
0,5,1,1,1,2,1,3,1,1,0
10,5,3,3,3,2,3,4,4,1,1
20,5,4,4,9,2,10,5,6,1,1
30,9,5,8,1,2,3,2,1,5,1
40,5,3,5,5,3,3,4,10,1,1
50,5,1,3,1,2,1,2,1,1,0
60,2,2,2,1,1,1,7,1,1,0
70,1,1,1,1,2,1,3,1,1,0
80,10,3,5,1,10,5,3,10,2,1
90,1,3,1,2,2,2,5,3,2,0


## b - Use sklearn functions to split the dataset into testing and training sets with the following parameters: test_size=0.35, random_state=3.

In [4]:
# create a python list of feature names that would like to pick from the dataset:
feature_cols = ['Clump_Thickness','Uniformity_of_Cell_Size','Uniformity_of_Cell_Shape',
                'Marginal_Adhesion','Single_Epithelial_Cell_Size','Bare_Nuclei',
                'Bland_Chromatin','Normal_Nucleoli','Mitoses']

# use the above list to select the features from the original DataFrame
X = cancer_df[feature_cols] 

# select a Series of labels (the last column) from the DataFrame
y = cancer_df['Malignant_Cancer']

# print the first 5 rows
X.head()

# print("============")
# print(y.head())

Unnamed: 0,Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses
0,5,1,1,1,2,1,3,1,1
1,5,4,4,5,7,10,3,2,1
2,3,1,1,1,2,2,3,1,1
3,6,8,8,1,3,4,3,7,1
4,4,1,1,3,2,1,3,1,1


In [5]:
from sklearn.model_selection import train_test_split

# Randomly splitting the original dataset into training set and testing set:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35, random_state=3)

# print the size of the traning set:
print(X_train.shape)
print(y_train.shape)

# print the size of the testing set:
print(X_test.shape)
print(y_test.shape)

(97, 9)
(97,)
(53, 9)
(53,)


## c - Use “Decision Tree Classifier” to predict Cancer based on the training/testing datasets that you built in part (b).  Then, calculate and report the accuracy of your classifier. 

### Use this command to define your tree: 
my_DecisionTree = DecisionTreeClassifier(random_state=3).

In [6]:
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier


my_decisiontree = DecisionTreeClassifier(random_state=3)

my_decisiontree.fit(X_train, y_train)

y_predict_dt = my_decisiontree.predict(X_test)

accuracy =  accuracy_score(y_test, y_predict_dt)


print("Decision Tree: ", accuracy)

Decision Tree:  0.8301886792452831


## d - Now, we want to perform a new Ensemble Learning method called “Bagging” based on Voting on 19 decision tree classifiers.

Note: you should write your own code to perform Bagging (don’t use scikit-learn functions for Bagging!)
To do so, you need to perform bootstrapping first. You can write a “for” loop with loop variable i = 0...18.

### In each iteration of the loop, you have to:

1. Make a bootstarp sample of the original “Training” Dataset (build in part(b)) with the
size of bootstarp_size = 0.8*(Size of the original dataset). You can use the following command to generate a random bootstrap dataset (“i" is the variable of the loop, so the random_state changes in each iteration):
resample(X_train, n_samples = bootstarp_size , random_state=i , replace = True)

2. Define and train a new base decision tree classifier on this dataset in each iteration: 
Base_DecisionTree = DecisionTreeClassifier(random_state=3).
   
3. Perform prediction using “this base classifier” on the original “Testing” Dataset X_test (build in part(b)), and save the prediction results for all testing samples.

After finishing the “for” loop, you should have 19 different predictions for EACH sample in your testing set. 

Then, Perform Voting to make the final decision on each data sample based on the votes of all 19 classifiers.

Finally, calculate and report the final accuracy of your Bagging (Voting) method.

#### Note: You do NOT need to calculate the accuracy of each one of the base classifiers in each round of the loop! You have to just perform Voting to make the final decision on each data sample, and then calculate the accuracy on the final results.

In [12]:
# make predictionList for each bootstarp sample
predictionList = []

# get the bootstarp_size which is smaller than the original one
bootstarp_size = int(np.floor(0.8 * X_train.shape[0]))

# import resample from sklearn
from sklearn.utils import resample

# 19 classifiers  
for i in range(18):
    
    # new trainning set 
    newX_train, newY_train = resample(X_train, y_train, n_samples = bootstarp_size , random_state=i , replace = True)
    
    # new Decision Tree model
    Base_DecisionTree = DecisionTreeClassifier(random_state=3)
    
    # train the new model with new trainning set
    Base_DecisionTree.fit(newX_train, newY_train)
    
    # use the new tranning model to predict 
    y_predict_dt = Base_DecisionTree.predict(X_test)
    
    # add a new prediction to predictionList
    predictionList.append(y_predict_dt)
    
# after performing loop, we have a predictionList which has shape 18 x 53 
# 18 rows -> 19 classifiers (0 - 18)
# 53 columns -> 53 testing sample
# in order words, each column would be the prediction for each testing sample 


# Step 1: convert predictionList to matrix with np.matrix
# Step 2: transpose rows and cols of the prevous matrix and called it predictionMatrix
# Reasons: when we transpose rows and cols, we will have 53 x 18 -> 53 rows (samples) x 18 cols (classifiers)
#         => each row = each sample will contain the predictions from all classifiers.

predictionMatrix = np.transpose(np.matrix(predictionList))
# print(predictionMatrix)

# Step 3: now we have predictionMatrix with shape 53 x 18.
# Step 4: perform voting by getting the majority values of each row. 
# Reason: each column represents for each prediction from each classifier.

# import Counter to calculate the frequency of distinct values
from collections import Counter

# prediction after voting
y_predict_afterVoting = []

# loop through all testing sample (53 rows = 53 testing samples)
for row in predictionMatrix:
    
    # convert each row's values to list 
    # with the first element in the list is an array of all predictions from 18 classifiers
    arr = row.tolist()
    
    # initilize counter for first element of the list
    c = Counter(arr[0])
    
    # return value = majority value and count = the frequency of appearance 
    value, count = c.most_common()[0]
    
    # we only need value -> add value to y_predict
    y_predict_afterVoting.append(value)
    
# after looping through all testing sample, we will have y_predict_afterVoting = lables of all testing samples
# then, we can calculate the accuracy with original y_test
accuracy =  accuracy_score(y_test, y_predict_afterVoting)

print("After voting: ", accuracy)

After voting:  0.9056603773584906



# e - Use scikit-learn “Random Forest” classifier to predict Cancer based on the training/testing datasets that you built in part (b). 

### Then, calculate and report the accuracy of your classifier. Use this command to import and define your classifier:

##### from sklearn.ensemble import RandomForestClassifier
##### my_RandomForest = RandomForestClassifier(n_estimators = 19, bootstrap = True, random_state=3)

### Similar to previous syntax, use my_RandomForest.fit for training your random forest classifier and my_RandomForest.predict for prediction.



In [8]:
from sklearn.ensemble import RandomForestClassifier

my_RandomForest = RandomForestClassifier(n_estimators = 19, bootstrap = True, random_state=3)

# for training: 
my_RandomForest.fit(X_train, y_train)

# for testing/prediction:  
y_predict_rf = my_RandomForest.predict(X_test)

accuracy =  accuracy_score(y_test, y_predict_rf)

print("Random Forest: ", accuracy)

Random Forest:  0.9245283018867925
