<h1><center>Support Vector Machine</center></h1>



---
### Learning Objective:
- Use SVM (Support Vector Machines) to build and train a model using human cell records, and classify cells to whether the samples are benign or malignant.

<img src="images/SVM0.png" style="width:260px;"> 


### When to use:

<img src="images/SVM6.png" style="width:250px;"> 

### NOTES:

<img src="images/SVM1.png" style="width:250px;"> 

<img src="images/SVM2.png" style="width:250px;"> 
<img src="images/SVM5.png" style="width:250px;"> 

<img src="images/SVM4.png" style="width:250px;"> 



---
#### Prepare Data
- SVM works by mapping data to a high-dimensional feature space so that data points can be categorized, even when the data are not otherwise linearly separable. 

#### Prepare Data
- A separator between the categories is found, then the data is transformed in such a way that the separator could be drawn as a hyperplane. 

#### Modeling
- Following this, characteristics of new data can be used to predict the group to which a new record should belong.
---

---
## Road Map

- Visualizing
> - overlay the malignant and benign data

- Cleaning
> - verify the datatypes of all the values
> - remove any rows w/out integer values

- Preparing
> - convert all values into an np.array
> - normalize data if values are vary in range
> - split the training and testing sets

- Modeling
> - fit the training sets into the algorithm

- Evaluating
> - create confusion matrix
> - compute F1 score of y_hat, y_test
---

## Cleaning:

1. Check data types of each value
- pd.dtypes

2. Verify column values
- df.columns

3. Drop rows containing unwanted rows
> From the docs

> - pandas.to_numeric(arg, errors='raise', downcast=None)

> - Convert argument to a numeric type.

Parameters:
> arg : scalar, list, tuple, 1-d array, or Series
> - Argument to be converted.

> - errors : {‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
> - If ‘raise’, then invalid parsing will raise an exception.
> - If ‘coerce’, then invalid parsing will be set as NaN.
> - If ‘ignore’, then invalid parsing will return the input.

4. convert object into integers
> - df.column = df.column.asytpe('int')

---

---
## Preparing:


select features
> - sklearn.model_selection.train_test_split(*arrays, **options)

split data
> - sklearn.model_selection.train_test_split(*arrays, **options)
> - options: test_size=0.2 (20% of data), random_state=4

---

---
## Modeling:


1. fit the model into SVM algo
> From the docs

> - class sklearn.svm.SVC(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

> C-Support Vector Classification.
> The implementation is based on libsvm. The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using sklearn.svm.LinearSVC or sklearn.linear_model.SGDClassifier instead, possibly after a sklearn.kernel_approximation.Nystroem transformer.

Parameters:
> - kernel : {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’
 
> Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).

Methods: 

> - fit(X, y) : Fit the SVM model according to the given training data.

> - predict(X) : Perform classification on samples in X.

> - score(X, y) : Return the mean accuracy on the given test data and labels.


2. Predict outputs from x_test data

---

---
## Evaluation:

1a. confusion matrix

> From the docs

> - sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)

Parameters:

> - label : sarray-like of shape (n_classes), default=None

> List of labels to index the matrix. This may be used to reorder or select a subset of labels. If None is given, those that appear at least once in y_true or y_pred are used in sorted order.

<br>
<br>
<br>
<br>


1b. classification report

> From the docs

> sklearn.metrics.classification_report(y_true, y_pred, *, labels=None, target_names=None, sample_weight=None, digits=2, output_dict=False, zero_division='warn')

Parameters:

> - y_true : 1d array-like, or label indicator array / sparse matrix
Ground truth (correct) target values.

> - y_pred : 1d array-like, or label indicator array / sparse matrix
Estimated targets as returned by a classifier.


<br>
<br>
<br>
<br>

2. F1 score

> From the docs

> - sklearn.metrics.f1_score(y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')

> Compute the F1 score, also known as balanced F-score or F-measure

> The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

> $ F1 = 2 * (precision * recall) / (precision + recall) $

> In the multi-class and multi-label case, this is the average of the F1 score of each class with weighting depending on the average parameter.

Parameters:
> - average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’, ‘samples’, ‘weighted’]
This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

> - 'binary': Only report results for the class specified by pos_label. This is applicable only if targets (y_{true,pred}) are binary.

> - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.

> - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

> - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

> - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification where this differs from accuracy_score).



---

In [74]:
import numpy as np 
import pandas as pd 
from sklearn import preprocessing, model_selection, svm, metrics

# Cleaning
df = pd.read_csv('datasets/cell_samples.csv')
drop_rows = pd.to_numeric(df['BareNuc'], errors='coerce').notnull()
df = df[drop_rows]
df['BareNuc'] = df['BareNuc'].astype('int')

output = ['Class']
features = ['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize','BareNuc', 'BlandChrom', 'NormNucl','Mit']

x = df[features].values
y = df[output].values

# Preparing
x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=4)

# modeling
y_hat = svm.SVC(kernel='linear').fit(x_train, y_train).predict(x_test)

# evaluation
confusion_matrix = metrics.confusion_matrix(y_test, y_hat, labels=[2,4])
print('Confusion Matrix: \n', confusion_matrix,'\n')

classification_report = metrics.classification_report(y_test, y_hat)
print('Classification report: \n %s' % classification_report, '\n')

f1_score = metrics.f1_score(y_test, y_hat, average='weighted')
print('F1 Score of the alogrithm is %.5f' % f1_score)

Confusion Matrix: 
 [[85  5]
 [ 0 47]] 

Classification report: 
               precision    recall  f1-score   support

           2       1.00      0.94      0.97        90
           4       0.90      1.00      0.95        47

   micro avg       0.96      0.96      0.96       137
   macro avg       0.95      0.97      0.96       137
weighted avg       0.97      0.96      0.96       137
 

F1 Score of the alogrithm is 0.96390


<h2>Disclaimer</h2>

This script was orginally from Coursera's [IBM AI Engineering course](https://www.coursera.org/professional-certificates/ai-engineer), authored by <a href="https://ca.linkedin.com/in/saeedaghabozorgi">Saeed Aghabozorgi</a> and was modifed to fit my needs. 