# Support Vector Machine practice notebook with breast cancer data set

In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a **non-probabilistic binary linear classifier** (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped ***so that the examples of the separate categories are divided by a clear gap that is as wide as possible***. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall. This gap is also called maximum margin and the SVM classifier is called ***maximum margin classifier***.

In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.
![SVM-1](./Images/SVM-1.png)

## Import libraries and load data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### Get the Data

We'll use the built in breast cancer dataset from Scikit Learn. Note the load function:

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
cancer = load_breast_cancer()

**The data set is presented in a dictionary form**

In [None]:
cancer.keys()

**We can grab information and arrays out of this dictionary to create data frame and understand the features**

**The description of features are as follows**

In [None]:
print(cancer['DESCR'])

**Show the feature names**

In [None]:
cancer['feature_names']

## Set up the DataFrame

In [None]:
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
df.info()

In [None]:
df.describe()

**Is there any missing data?**

In [None]:
np.sum(pd.isnull(df).sum()) # Sum of the count of null objects in all columns of data frame

**What are the 'target' data in the data set?**

In [None]:
cancer['target']

** Adding the target data to the DataFrame**

In [None]:
df['Cancer'] = pd.DataFrame(cancer['target'])
df.head()

## Exploratory Data Analysis


### Check the relative counts of benign (0) vs malignant (1) cases of cancer

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Cancer',data=df,palette='RdBu_r')

### Run a 'for' loop to draw boxlots of all the mean features (first 10 columns) for '0' and '1' CANCER OUTCOME

In [None]:
l=list(df.columns[0:10])
for i in range(len(l)-1):
    sns.boxplot(x='Cancer',y=l[i], data=df, palette='winter')
    plt.figure()

### Not all the features seperate out the cancer predictions equally clearly
**For example, from the following two plots it is clear that smaller area generally is indicative of positive cancer detection, while nothing concrete can be said from the plot of mean smoothness**

In [None]:
f,(ax1, ax2) = plt.subplots(1, 2, sharey=True,figsize=(12,6))
ax1.scatter(df['mean area'],df['Cancer'])
ax1.set_title("Cancer cases as a function of mean area", fontsize=15)
ax2.scatter(df['mean smoothness'],df['Cancer'])
ax2.set_title("Cancer cases as a function of mean smoothness", fontsize=15)

## Training and prediction

### Train Test Split

In [None]:
df_feat = df.drop('Cancer',axis=1) # Define a dataframe with only features
df_feat.head()

In [None]:
df_target = df['Cancer'] # Define a dataframe with only target results i.e. cancer detections
df_target.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_feat, df_target, test_size=0.30, random_state=101)

In [None]:
y_train.head()

### Train the Support Vector Classifier

In [None]:
from sklearn.svm import SVC

In [None]:
model = SVC()

In [None]:
model.fit(X_train,y_train)

### Predictions and Evaluations

In [None]:
predictions = model.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

**Notice that we are classifying everything into a single class! This means our model needs to have it parameters adjusted (it may also help to normalize the data)**

In [None]:
print(confusion_matrix(y_test,predictions))

**As expected, the classification report card is bad**

In [None]:
print(classification_report(y_test,predictions))

## Gridsearch

Finding the right parameters (like what C or gamma values to use) is a tricky task! But luckily, Scikit-learn has the functionality of trying a bunch of combinations and see what works best, built in with GridSearchCV! The CV stands for cross-validation.

**GridSearchCV takes a dictionary that describes the parameters that should be tried and a model to train. The grid of parameters is defined as a dictionary, where the keys are the parameters and the values are the settings to be tested.** 


**Compare the optimized model performance vs. the default SVC defined above**