In this project, we will be working on a breast cancer dataset and building a Support Vector Machine Model to classify whether a certain cell is at a benign or malignant stage. 

### 1. Importing necessary libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Train test split of data
from sklearn.model_selection import train_test_split

#Modelling
from sklearn import svm

#Evaluation
from sklearn.metrics import classification_report

### 2. Dataset description

There are 700 records and each record has 11 characteristics. The columns/characteristics include 9 predictors, the sample ID, and class of the cell. The fields in each record are:<br>

| Field name | Description |
| -- | -- |
| ID | Identifier |
|Clump|ClumpThickness|<br>
|UnifSize|Uniformity of cell size|<br>
|UnifShape|Uniformity of cell shape|<br>
|MargAdh|Marginal Adhesion|<br>
|SingEpiSize|Single Epithelial Cell Size|<br>
|BareNuc|Bare Nuclei|<br>
|BlandChrom|Bland Chromatin|<br>
|NormNucl|Normal Nucleoli|<br>
|Mit|Mitosis|<br>
|Class|Benign or malignant|<br>

### 3. Importing and analysing dataset

In [3]:
cellDataFrame = pd.read_csv('cell_samples.csv')
cellDataFrame.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [4]:
cellDataFrame.shape

(699, 11)

In [5]:
#count records under each attribute and check if there are any missing attributes
cellDataFrame.count()

ID             699
Clump          699
UnifSize       699
UnifShape      699
MargAdh        699
SingEpiSize    699
BareNuc        699
BlandChrom     699
NormNucl       699
Mit            699
Class          699
dtype: int64

In [6]:
#Counting the number of malignant and benign cells in the dataset2
cellDataFrame['Class'].value_counts()

2    458
4    241
Name: Class, dtype: int64

2 implies Benign and 4 implies Malignant

In [18]:
cellDataFrame.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

### 4. Split the dataset based on the classes

In [7]:
# combination of over-sampling the minority class and under-sampling the majority class can 
# achieve better classifier performance
# here the minority class is the malignant cells 
malignantDataFrame = cellDataFrame[cellDataFrame['Class']==4][0:200]
malignantDataFrame.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
5,1017122,8,10,10,8,7,10,9,7,1,4
12,1041801,5,3,3,3,2,3,4,4,1,4
14,1044572,8,7,5,10,7,9,5,5,4,4
15,1047630,7,4,6,4,6,1,4,3,1,4
18,1050670,10,7,7,6,4,10,4,1,2,4


In [8]:
benignDataFrame = cellDataFrame[cellDataFrame['Class']==2][0:200]
benignDataFrame.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


### 5. Modify dataset based on requirements

In [11]:
#convert 'BareNuc' from object datatype to int datatype
cellDataFrame = cellDataFrame[pd.to_numeric(cellDataFrame['BareNuc'], errors='coerce').notnull()]
cellDataFrame['BareNuc'] = cellDataFrame['BareNuc'].astype('int')
cellDataFrame.dtypes

ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int32
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object

### 7. Remove unwated columns
We will remove columns that won't help is in prediction (ID and class)


In [12]:
cellDataFrame.columns

Index(['ID', 'Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize',
       'BareNuc', 'BlandChrom', 'NormNucl', 'Mit', 'Class'],
      dtype='object')

In [13]:
featureDataFrame = cellDataFrame[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize',
       'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]

#convert featureDataFrame into numpy n-dimensional array
#independent variable
X=np.asarray(featureDataFrame)

#dependent variable
y=np.asarray(cellDataFrame['Class'])

X[0:5]

array([[ 5,  1,  1,  1,  2,  1,  3,  1,  1],
       [ 5,  4,  4,  5,  7, 10,  3,  2,  1],
       [ 3,  1,  1,  1,  2,  2,  3,  1,  1],
       [ 6,  8,  8,  1,  3,  4,  3,  7,  1],
       [ 4,  1,  1,  3,  2,  1,  3,  1,  1]], dtype=int64)

### 8. Divide the data into train/test set

In [14]:
'''
cellDataFrame (100) --> Train (80 rows) / Test (20 rows)

Train(X,y) ## X is a 2D array an y is a 1D array
Test(X, y)
'''

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=4) #random state is used to generate a random number 

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(546, 9)
(137, 9)
(546,)
(137,)


### 9. Modelling

SVC - Support Vector Classifier - those data points near the hyperplane whose perpendicular distance to the hyperplane <br>
If we sum that distance up of all the points near the hyperplane and maximize it such data points would be called SVC <br>
The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kerneling. <br>
The mathematical function used for the trandformation is known as kernel function, and can be of different types, such as, 

1. Linear
2. Polynomial
3. Radial Basis Function (RBF)
4. Sigmoid

Each of these functions has its characteristics, its pros and cons, and its equations, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions and compare the results. 

C- The Regularization parameter - tells the SVM optimization how much you want to avoid misclassifying each training example. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.

In [15]:
classifier = svm.SVC(kernel='linear', gamma='auto', C=2)
classifier.fit(X_train, y_train)

y_predict = classifier.predict(X_test)

### 10. Evaluation

In [16]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           2       1.00      0.94      0.97        90
           4       0.90      1.00      0.95        47

    accuracy                           0.96       137
   macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137



precision = true_positive/(true_positive + false_positive)

recall = true_positive/(true_positive + false_negative) = true_positive/total_actual_positive

F1: harmonical mean of precision = 2*((precision * recall)/(precision + recall))

support: how many instances of the class were there

### Hyperparameter tuning for SVM model - C, gamma, epsilon, kernel 

##### 1. GridSearchCV
##### 2. RandomizedSearchCV