# [Breast Cancer Wisconsin (Original) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29)

Samples arrive periodically as Dr. Wolberg reports his clinical cases. The database therefore reflects this chronological grouping of the data. This grouping information appears immediately below, having been removed from the data itself:

- Group 1: 367 instances (January 1989)
- Group 2:  70 instances (October 1989)
- Group 3:  31 instances (February 1990)
- Group 4:  17 instances (April 1990)
- Group 5:  48 instances (August 1990)
- Group 6:  49 instances (Updated January 1991)
- Group 7:  31 instances (June 1991)
- Group 8:  86 instances (November 1991)
#### ----------------------------------------------------------------------------------------
Total:   699 points (as of the donated datbase on 15 July 1992)

Note that the results summarized above in Past Usage refer to a dataset of size 369, while Group 1 has only 367 instances.  This is because it originally contained 369 instances; 2 were removed.  The following statements summarizes changes to the original Group 1's set of data:

   -   Group 1 : 367 points: 200B 167M (January 1989)
   -   Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
   -   Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
   -  Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
   -  Changed 0 to 1 in field 6 of sample 1219406
   -  Changed 0 to 1 in field 8 of following sample:
   -  1182404,2,3,1,1,1,2,0,1,1,1

The data can be accessed at [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data), and its description is given [here](https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.names).
### Number of Instances: 699
### Number of Attributes: 10 plus the class attribute

Attribute Information: (class attribute has been moved to last column)

   1. Sample code number            id number
   2. Clump Thickness               1 - 10
   3. Uniformity of Cell Size       1 - 10
   4. Uniformity of Cell Shape      1 - 10
   5. Marginal Adhesion             1 - 10
   6. Single Epithelial Cell Size   1 - 10
   7. Bare Nuclei                   1 - 10
   8. Bland Chromatin               1 - 10
   9. Normal Nucleoli               1 - 10
   9. Mitoses                       1 - 10
   9. Class:                        
  10. 2 for benign 
  11. 4 for malignant
  
- Missing attribute values:
   There are 16 instances in Groups 1 to 6 that contain a single missing 
   (i.e., unavailable) attribute value, now denoted by "?".
- Class distribution:
    - Benign: 458 (65.5%)
    - Malignant: 241 (34.5%)

## Members
- #### Usama Saeed [BSEF14M547]
- #### H. Usama Tariq [BSEF14M556]

In [1]:
import pandas as pd 
import numpy as np

# Reading data from CSV file into a DataFrame from link
data_frame = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', 
                 sep=',', header=None ) 

# Printing data for own convenience 
# pd.options.display.max_rows=700
# pd.options.display.max_columns=15
# print data_frame

In [2]:
# Saving Read data into another DataFrame
data_frame_1 = data_frame

# Column names of data and assigning those name to data
columns = ["Sample Code Number", "Clump Thickness", "Uniformity of Cell Size", 
       "Uniformity of Cell Shape", "Marginal Adhesion", "Single Epithelial Cell Size", 
        "Bare Nuclei", "Bland Chromatin", "Normal Nucleoli", "Mitoses", "Class"]
#df1 = df1.rename(columns=col) 
data_frame_1.columns = columns

# Printing data for own convenience 
# pd.options.display.max_rows=700
# pd.options.display.max_columns=15
# print data_frame


In [3]:
from sklearn.preprocessing import Imputer

# Cleaning the Data
data_frame_1 = data_frame_1.replace('?', np.nan)

# Replacing All Missing Numeric Values with Mode  
imputer = Imputer(missing_values=np.nan, strategy='most_frequent', axis=0)
for i in list(data_frame_1):
    data_frame_1[[i]]=imputer.fit_transform(data_frame_1[[i]])

# Separating Target 'Y' from Data
Y = data_frame_1['Class']
del data_frame_1['Class']
Y = [0 if i==2 else 1 for i in Y ]

# Printing data for own convenience 
# pd.options.display.max_rows=700
# pd.options.display.max_columns=15
# print data_frame_1
# print Y

In [4]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score, KFold
from sklearn.model_selection import GridSearchCV

# Scaling the Data
scalar = MinMaxScaler()
data_frame_1_scale = scalar.fit_transform(data_frame_1)

# Cross-Validation folds
cv = KFold(n_splits=10, shuffle=True)

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

def KNeighbors_Classifier(x, y):
    params = dict(n_neighbors = list(range(1,31)))
    grid = GridSearchCV(KNeighborsClassifier(), params, n_jobs=-1, cv=cv)
    grid.fit(x,y)
    return (grid.best_score_,grid.best_params_)

def Gaussian_NB(x, y):
    params = dict(priors = [None])
    grid = GridSearchCV(GaussianNB(), params, n_jobs=-1, cv=cv)
    grid.fit(x,y)
    return (grid.best_score_,grid.best_params_)

def DecisionTree_Classifier(x, y, nf):
    params = dict(max_depth= list(range(1,nf)), max_features= [nf])
    grid = GridSearchCV(DecisionTreeClassifier(), params, n_jobs=-1, cv=cv)
    grid.fit(x,y)
    return (grid.best_score_,grid.best_params_)

def RandomForest_Classifier(x, y, nf):
    params = dict(max_depth= list(range(1,nf)), max_features= [nf], n_estimators=[nf])
    grid = GridSearchCV(RandomForestClassifier(), params, n_jobs=-1, cv=cv)
    grid.fit(x,y)
    return (grid.best_score_,grid.best_params_)
        
def SupportVector_Classifier(x, y):
    params = dict(C = [0.001, 0.01, 0.1, 1, 10, 15, 20, 25], gamma= [0.001, 0.01, 0.1, 1])
    grid = GridSearchCV(SVC(), params, n_jobs=-1, cv=cv)
    grid.fit(x,y)
    return (grid.best_score_,grid.best_params_)


In [6]:
KNeighbors_Classifier_score, KNeighbors_Classifier_params = KNeighbors_Classifier(data_frame_1_scale, Y)
Gaussian_NB_score, Gaussian_NB_params = Gaussian_NB(data_frame_1_scale, Y)
DecisionTree_Classifier_score, DecisionTree_Classifier_params = DecisionTree_Classifier(data_frame_1_scale, 
                                                                                        Y, data_frame_1_scale.shape[1])
RandomForest_Classifier_score, RandomForest_Classifier_params = RandomForest_Classifier(data_frame_1_scale, 
                                                                                        Y, data_frame_1_scale.shape[1])
SupportVector_Classifier_score, SupportVector_Classifier_params = SupportVector_Classifier(data_frame_1_scale, Y)

In [7]:
print "KNeighbors Classifier Score : ", KNeighbors_Classifier_score
print "KNeighbors Classifier Params: ", KNeighbors_Classifier_params
print "-------------------------------------------------------"
print "Gaussian Naive Bayes Score : ", Gaussian_NB_score
print "Gaussian Naive Bayes Params : ", Gaussian_NB_params
print "-------------------------------------------------------"
print "Decision Tree Classifier Score : ", DecisionTree_Classifier_score
print "Decision Tree Classifier Params : ", DecisionTree_Classifier_params
print "-------------------------------------------------------"
print "Random Forest Classifier Score : ", RandomForest_Classifier_score
print "Random Forest Classifier Params : ", RandomForest_Classifier_params
print "-------------------------------------------------------"
print "Support Vector Classifier Score : ", SupportVector_Classifier_score
print "Support Vector Classifier Params : ", SupportVector_Classifier_params
print "-------------------------------------------------------"

KNeighbors Classifier Score :  0.9699570815450643
KNeighbors Classifier Params:  {'n_neighbors': 8}
-------------------------------------------------------
Gaussian Naive Bayes Score :  0.9599427753934192
Gaussian Naive Bayes Params :  {'priors': None}
-------------------------------------------------------
Decision Tree Classifier Score :  0.9456366237482118
Decision Tree Classifier Params :  {'max_features': 10L, 'max_depth': 3}
-------------------------------------------------------
Random Forest Classifier Score :  0.9585121602288984
Random Forest Classifier Params :  {'max_features': 10L, 'n_estimators': 10L, 'max_depth': 9}
-------------------------------------------------------
Support Vector Classifier Score :  0.9670958512160229
Support Vector Classifier Params :  {'C': 0.1, 'gamma': 1}
-------------------------------------------------------


In [8]:
from sklearn.feature_selection import SelectPercentile
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import accuracy_score


data_frame_1_poly = PolynomialFeatures(degree = 2).fit(data_frame_1_scale).transform(data_frame_1_scale)
#print data_frame_1_poly.shape

no_features = range(10, 100, 15)

best_accuracy = []


for i in no_features:
    accuracy = 0
    data_frame_1_select = SelectPercentile(percentile=i).fit(data_frame_1_poly,Y).transform(data_frame_1_poly)
    #print data_frame_1_select.shape[1]
    temp = KNeighbors_Classifier(data_frame_1_select, Y)
    if temp[0] > accuracy:
        accuracy_temp = ("KNeighbors_Classifier",data_frame_1_select.shape[1], temp[0], temp[1])
        accuracy = temp[0]
    
    temp = Gaussian_NB(data_frame_1_select, Y)
    if temp[0] > accuracy:
        accuracy_temp = ("Gaussian_NB",data_frame_1_select.shape[1], temp[0], temp[1])
        accuracy = temp[0]
    
    temp = DecisionTree_Classifier(data_frame_1_select, Y, data_frame_1_select.shape[1])
    if temp[0] > accuracy:
        accuracy_temp = ("DecisionTree_Classifier",data_frame_1_select.shape[1], temp[0], temp[1])
        accuracy = temp[0]
    
    temp = RandomForest_Classifier(data_frame_1_select, Y, data_frame_1_select.shape[1])
    if temp[0] > accuracy:
        accuracy_temp = ("RandomForest_Classifier",data_frame_1_select.shape[1], temp[0], temp[1])
        accuracy = temp[0]
    
    temp = SupportVector_Classifier(data_frame_1_select, Y)
    if temp[0] > accuracy:
        accuracy_temp = ("SupportVector_Classifier",data_frame_1_select.shape[1], temp[0], temp[1])
        accuracy = temp[0]
        
    best_accuracy.append(accuracy_temp)
        
from operator import itemgetter

best_accuracy = sorted(best_accuracy, key=itemgetter(2), reverse=True)


  f = msb / msw


In [9]:
for i in best_accuracy:
    print ("'%s', with selected features '%d', and params '%s' provides Best Accuracy: %f" % (i[0], i[1], i[3], i[2]))

'KNeighbors_Classifier', with selected features '36', and params '{'n_neighbors': 5}' provides Best Accuracy: 0.971388
'KNeighbors_Classifier', with selected features '46', and params '{'n_neighbors': 3}' provides Best Accuracy: 0.971388
'KNeighbors_Classifier', with selected features '17', and params '{'n_neighbors': 7}' provides Best Accuracy: 0.969957
'SupportVector_Classifier', with selected features '26', and params '{'C': 0.01, 'gamma': 1}' provides Best Accuracy: 0.969957
'KNeighbors_Classifier', with selected features '56', and params '{'n_neighbors': 3}' provides Best Accuracy: 0.968526
'Gaussian_NB', with selected features '7', and params '{'priors': None}' provides Best Accuracy: 0.964235


In [10]:
for i in best_accuracy:
    print i[0], " with selected features '", i[1], "', and params '", i[3], "', provides Accuracy :'", i[2], "'", '\n'

KNeighbors_Classifier  with selected features ' 36 ', and params ' {'n_neighbors': 5} ', provides Accuracy :' 0.9713876967095851 ' 

KNeighbors_Classifier  with selected features ' 46 ', and params ' {'n_neighbors': 3} ', provides Accuracy :' 0.9713876967095851 ' 

KNeighbors_Classifier  with selected features ' 17 ', and params ' {'n_neighbors': 7} ', provides Accuracy :' 0.9699570815450643 ' 

SupportVector_Classifier  with selected features ' 26 ', and params ' {'C': 0.01, 'gamma': 1} ', provides Accuracy :' 0.9699570815450643 ' 

KNeighbors_Classifier  with selected features ' 56 ', and params ' {'n_neighbors': 3} ', provides Accuracy :' 0.9685264663805436 ' 

Gaussian_NB  with selected features ' 7 ', and params ' {'priors': None} ', provides Accuracy :' 0.9642346208869814 ' 



# Conclusion

The Best Accuracy on the data with original features (i.e 10 in number) is ["96.99%"]()while after performing Feature Engineering (Degree=2) the Best Accuracy is ["97.13%"]() with '36' features. Though after performing the process of feature selection, accuracy increased just ["0.14%"]() i.e a very minimal difference. This shows that our model is already performing good results without feature selection.