## Support Vector Machines

In this practice, we will use the same data sets we have used in [Linear Discriminant Analysis practice notebook](Linear_Discriminant_Analysis.ipynb) with the support vector machine classifiers. Take a look at that practice first if you haven't done so yet. 

We will start with the first data set that has two linearly separable classes. 

In [None]:
data1 <- read.csv("../../../datasets/toydata/data1.csv",header=TRUE)

# For SVM, we need to make sure class is a factor.
data1$class <- factor(data1$class)
str(data1)
# Visualize the data
library(ggplot2)
pl1 <- ggplot(data1, aes(X, Y)) + geom_point(aes(colour=factor(class),shape=factor(class))) #+ theme(legend.position="none")
pl1

The classes labeled as "0" and "1" are *linearly separable*; we can draw a linear decision boundary to separate them. Let's support vector machine (SVM) to this data set. We will use the library "e1071" for SVMs, and "caret" library (classification and regression training) that has nice functions to deal with several aspects of classification process.

In [None]:
library(e1071)
library(caret)
svm_model1 = svm(class ~ ., data=data1, kernel="linear", cost=10, scale=FALSE)
summary(svm_model1)
plot(svm_model1,data1)

We trained a SVM with a *linear* kernel; and it learned a linear decision boundary. 
What you see as marked "X" in the above plot are the data points that serve as support vectors; 
there are three of them. Let's compute the confusion matrix, and as expected, we'll get perfect accuracy.  

In [None]:
pred=predict(svm_model1, data1[,-3])
conftable1=table(predict=pred, class=data1$class)
conftable1

# caret has a function to compute confusion matrix and 
# other things such as accuracy, sensitivity, specificity, etc.
confusionMatrix(data=pred, data1$class)

Let's apply the same to the second data set; **it's your turn.**

In [None]:
data2 <- read.csv("../../../datasets/toydata/data2.csv",header=TRUE)

data2$class <- factor(data2$class)
# Visualize the data
pl2 <- ggplot(data2, aes(X, Y)) + geom_point(aes(colour=factor(class),shape=factor(class))) + theme(legend.position="none")
pl2

In the above plot, you can see that there is an overlap between classes. This means that some of the samples of a class will be misclassified as the other class; these samples will be on the wrong side of the decision boundary. Let's see that. 

In [None]:
svm_model2 = svm(<what goes in here>)
summary(svm_model2)
plot(<what goes in here>)

Let's compute the confusion table. **Again, it's your turn.**

In [None]:
pred=predict(svm_model2, data2[,-3])
conftable2=table(<what goes in here>)
conftable2
# or do this
confusionMatrix(<what goes in here>)

Now, we will apply the same to the third data set where classes are not linearly separable. 
**It's your turn:**

In [None]:
data3 <- read.csv("../../../datasets/toydata/data3.csv",header=TRUE)

data3$class <- <what goes in here>
# Visualize the data
pl3 <- ggplot(data3, aes(X, Y)) + geom_point(aes(colour=factor(class),shape=factor(class))) + theme(legend.position="none")
pl3

In [None]:
svm_model3 = svm(<what goes in here>)
summary(svm_model3)
plot(<what goes in here>)
pred=predict(<what goes in here>)
conftable3=table(<what goes in here>)
conftable3
# or do this
confusionMatrix(<what goes in here>)

This is pretty bad; SVM with a linear kernel can't classify this data set. Luckily there are nonlinear kernels that we can use with SVM. Let's try a **radial basis function (RBF)** kernel with SVM, it's one of the most used kernels.

In [None]:
svm_model_rbf = svm(class ~ ., data=data3, kernel="radial", cost=10, scale=FALSE)
summary(svm_model_rbf)
plot(svm_model_rbf,data3)
pred=predict(svm_model_rbf, data3[,-3])
conftable_rbf=table(predict=pred, class=data3$class)
conftable_rbf

confusionMatrix(data=pred, data3$class)

As you can see, it does a pretty good job in classifying data; the decision boundary does not have to be linear any more, so this SVM model learns a boundary from the data that can be represented by radial basis functions. 


But there is a potential problem here. We trained and tested our model with the same exact data set. This can cause *memorization*; the model does not learn a decision boundary, it memorizes a boundary for this particular data set. We don't know how it will perform on the *new, unseen* observations. One of the most important aspects of learning algorithms is their ability to *generalize*; that is, to learn decision boundaries that are generalized enough to do well on unseen data. 

So, we need to separate our data set into a training subset and a testing subset; and train the model with the training set and test it (predict and compute the accuracy) with the testing set. There are different ways of it; we'll do two moethods: 1) split test, 2) cross-validation. 

### Split test 
Split test is simply splitting the data into training and testing sets; we can use 65% of the data for training, and the rest for testing (usually training set is larger than the testing set). 

Here's how we do it with caret library's functions: 

In [None]:
# define an 65%/35% train/test split of the dataset
split=0.65
# create indices that belong to the training set
trainIndex <- createDataPartition(data3$class, p=split, list=FALSE)
# pick the samples with those indices, they will be training set
train_set <- data3[trainIndex,]
# pick the rest of the samples, they will be testing set
test_set  <- data3[-trainIndex,]
# train a svm model with training set only
svm_model_rbf2 = svm(class ~ ., data=train_set, kernel="radial", cost=10, scale=FALSE)
summary(svm_model_rbf2)
plot(svm_model_rbf2,train_set)

In [None]:
# Now predict both training set and testing set outcomes of the model and compare.
predtr=predict(svm_model_rbf2, train_set[,-3])
predts=predict(svm_model_rbf2, test_set[,-3])

confusionMatrix(data=predtr, train_set$class)
confusionMatrix(data=predts, test_set$class)

Usually, the model is expected to do a better job on predictions for the training set; because that's what it has learned. If it has generalized well enough, it should also produce good performance for the testing set. The problem still continues though; how do we know that this particular training set represents the class distribution well enough? We need to repeat this split test a number of times and compute the mean and standard deviation of the accuracy. We can do this with more random splittings, or we can use the cross-validation approach.

### k-fold cross-validation
k-fold cross-validation splits the data set into *k* subsets (folds) and then picks one subset for testing and trains the model with the remaining *k-1* subsets. It does this for each fold; so for k=10, it'll end up doing 10 training and testing sessions. 

We can use cross validation to **tune the parameters** of the SVM model to get a better accuracy without the danger of memorization. We will use svm tuning functions for that. Let's see how. 

In [None]:
# start the random number generator with some arbitrary seed
set.seed(42)
# Setup for cross validation:
# sampling="cross" for cross-validation, 
# cross=10 for 10-fold,
# retain the best model and save the performance measures
tctrl <- tune.control(sampling="cross", cross=10, best.model=TRUE, performances=TRUE)                     
 
# now run the tune function to tune the parameters of the model 
# tune function will try to find the best parameters (gamma and cost), that means the parameters with the smallest error.
# it will try different gamma and cost values given as arguments (e.g. cost=1, cost=10, cost=100, etc.)
tuned_params_cv <- tune(svm, class ~ ., data=data3, kernel="radial", ranges=list(gamma=10^(-6:-1), cost=10^(0:2)), tunecontrol=tctrl)
summary (tuned_params_cv)

In [None]:
# gamma=0.1 and cost=100 are the best parameters
# now train a model with the tuned parameters.
svm_model_rbf_cv = svm(class ~ ., data=data3, kernel="radial", cost=100, gamma=0.1, scale=FALSE)
summary(svm_model_rbf_cv)
plot(svm_model_rbf_cv,data3)
# find predictions 
pred=predict(svm_model_rbf_cv, data3[,-3])
confusionMatrix(data=pred, data3$class)

In [None]:
# We should really use the training and testing sets here. Even though we did cross validation, it was for 
# parameter tuning, let's train a model with the training set. 
svm_model_rbf_cv2 = svm(class ~ ., data=train_set, kernel="radial", cost=100, gamma=0.1, scale=FALSE)
summary(svm_model_rbf_cv2)
plot(svm_model_rbf_cv2,train_set)

In [None]:
# Let's compare training and testing set accuracies. 
predtr=predict(svm_model_rbf_cv2, train_set[,-3])
predts=predict(svm_model_rbf_cv2, test_set[,-3])
confusionMatrix(data=predtr, train_set$class)
confusionMatrix(data=predts, test_set$class)


We find the parameters with cross validation and use separate sets for training and testing, this way we have an accurate picture of classification performance of the model. 

Now, apply the same ideas to the "XOR pattern" data set where we have two classes that are linearly nonseparable even though their samples seem to be nicely separated in the plot. 

**Again, it's your turn.** First do parameter tuning with 10-fold cross validation and then train model and test it just like above. 

In [None]:
data4 <- read.csv("../../../datasets/toydata/data4.csv",header=TRUE)

data4$class <- <what goes in here>
# Visualize the data
pl4 <- ggplot(data4, aes(X, Y)) + geom_point(aes(colour=factor(class),shape=factor(class))) + theme(legend.position="none")
pl4

In [None]:
tctrl <- tune.control(<what goes in here>)                     
tuned_params_cv <- tune(svm, class ~ ., data=data3, kernel="radial", ranges=list(gamma=10^(-6:-1), cost=10^(0:2)), tunecontrol=tctrl)
summary (tuned_params_cv)

In [None]:
# create training and testing sets
split=0.65
trainIndex <- createDataPartition(<what goes in here>)
train_set <- data4[<what goes in here>]
test_set  <- <what goes in here>

# train a svm model with training set only
svm_model_4 = svm(<what goes in here>)
summary(svm_model_4)
plot(svm_model_4,train_set)

In [None]:
predtr=predict(<what goes in here>)
predts=<what goes in here>
confusionMatrix(<what goes in here>)
confusionMatrix(data=predts, <what goes in here>)

Keep in mind that you should convert categorical variables to factors (just like we did in the above examples with the class variable) when using SVM. 

Here are some links to dig deeper: 

[A tour of machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)

[Comparing machine learning classifiers](http://tjo-en.hatenablog.com/entry/2014/01/06/234155)

[Training and testing concepts 1](http://machinelearningmastery.com/how-to-choose-the-right-test-options-when-evaluating-machine-learning-algorithms/)

[Training and testing concepts 2](http://machinelearningmastery.com/how-to-estimate-model-accuracy-in-r-using-the-caret-package/)