In [None]:
# install.packages('e1071', repos='http://cran.us.r-project.org')
# install.packages('ROCR', repos='http://cran.us.r-project.org')
# install.packages('lowess', repos='http://cran.us.r-project.org')

**Support vector classifier**

Create a data set which is not linearly separable.

In [None]:
set.seed(1)
x=matrix(rnorm(20*2), ncol=2)    # rnorm creates normally distributed variables (here: 40 variables in 20 x 2 matrix)
y=c(rep(-1,10), rep(1,10))   # create y-targets for first and second column
x[y==1,]=x[y==1,] + 1    # add +1 to those points with class 1
plot(x, col=(3-y))

In [None]:
dat=data.frame(x=x, y=as.factor(y))  # encode response as a factor variable to do classification (otherwise it does regression)
library(e1071)
svmfit=svm(y~., data=dat, kernel="linear", cost=10,scale=FALSE)

Support vectores are crosses

In [None]:
plot(svmfit, dat)

In [None]:
svmfit$index  # gives the indices of the support vectors

In [None]:
summary(svmfit)   # note: gamma is not relevant for linear kernal (= 1 / dim(data))

Note that $C$ is different from the one in the book. Here, smaller $C$ means more room for error.

In [None]:
svmfit=svm(y~., data=dat, kernel="linear", cost=0.1, scale=FALSE)
plot(svmfit, dat)
svmfit$index

Perform cross-validation for different $C$ values.

In [None]:
set.seed(1)
tune.out=tune(svm,y~.,data=dat,kernel="linear",ranges=list(cost=c(0.001, 0.01, 0.1, 1,5,10,100)))

In [None]:
summary(tune.out)

In [None]:
bestmod=tune.out$best.model
summary(bestmod)

Create test data set with same properties as training data set

In [None]:
xtest=matrix(rnorm(20*2), ncol=2)
ytest=sample(c(-1,1), 20, rep=TRUE)
xtest[ytest==1,]=xtest[ytest==1,] + 1
testdat=data.frame(x=xtest, y=as.factor(ytest))

In [None]:
ypred=predict(bestmod,testdat)
table(predict=ypred, truth=testdat$y)

In [None]:
svmfit=svm(y~., data=dat, kernel="linear", cost=.01,scale=FALSE)
ypred=predict(svmfit,testdat)
table(predict=ypred, truth=testdat$y)

Using a suboptimal cost parameter leads to one more incorrect classification.

Now create a data set where the classes are linearly separable.

In [None]:
x[y==1,]=x[y==1,]+0.5
plot(x, col=(y+5)/2, pch=19)

In [None]:
dat=data.frame(x=x,y=as.factor(y))
svmfit=svm(y~., data=dat, kernel="linear", cost=1e5)
summary(svmfit)

In [None]:
plot(svmfit, dat)

No errors were made. Nevertheless, we can choose to have a bigger margin to avoid overfitting

In [None]:
svmfit=svm(y~., data=dat, kernel="linear", cost=1)
summary(svmfit)
plot(svmfit,dat)

**Support vector machine**

Generate data with non-linear decision boundary, by combining three Gaussian distributions.

In [None]:
set.seed(1)
x=matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y=c(rep(1,150),rep(2,50))
dat=data.frame(x=x,y=as.factor(y))

In [None]:
plot(x, col=y)

In [None]:
train=sample(200,100)
svmfit=svm(y~., data=dat[train,], kernel="radial",  gamma=1, cost=1)
plot(svmfit, dat[train,])

In [None]:
summary(svmfit)

Increasing the cost decreases the training error. However, it will probably lead to overfitting

In [None]:
svmfit=svm(y~., data=dat[train,], kernel="radial",gamma=1,cost=1e5)
plot(svmfit,dat[train,])

In [None]:
set.seed(1)  # cross-validation to find optimal cost and gamma
tune.out=tune(svm, y~., data=dat[train,], kernel="radial", ranges=list(cost=c(0.1,1,10,100,1000),gamma=c(0.5,1,2,3,4)))
summary(tune.out)

In [None]:
table(true=dat[-train,"y"], pred=predict(tune.out$best.model,newdata=dat[-train,]))

**ROC curves**

Write a short function to plot ROC curves.

In [None]:
library(ROCR)
rocplot=function(pred, truth, ...){
   predob = prediction(pred, truth)
   perf = performance(predob, "tpr", "fpr")
   plot(perf,...)}

In [None]:
svmfit.opt=svm(y~., data=dat[train,], kernel="radial",gamma=2, cost=1,decision.values=T)
fitted=attributes(predict(svmfit.opt,dat[train,],decision.values=TRUE))$decision.values

In [None]:
par(mfrow=c(1,2))
rocplot(fitted,dat[train,"y"],main="Training Data")

By increasing $\gamma$ we get a more flexible fit

In [None]:
svmfit.flex=svm(y~., data=dat[train,], kernel="radial", gamma=50, cost=1, decision.values=T)
fitted=attributes(predict(svmfit.flex,dat[train,],decision.values=T))$decision.values
rocplot(fitted,dat[train,"y"],main="Training Data")
rocplot(fitted,dat[train,"y"],add=T,col="red")

However, we see we have made an overfit on the test data

In [None]:
fitted=attributes(predict(svmfit.opt,dat[-train,],decision.values=T))$decision.values
rocplot(fitted,dat[-train,"y"],main="Test Data")
fitted=attributes(predict(svmfit.flex,dat[-train,],decision.values=T))$decision.values
rocplot(fitted,dat[-train,"y"],add=T,col="red")

**SVM with multiple classes**

In [None]:
set.seed(1)
x=rbind(x, matrix(rnorm(50*2), ncol=2))
y=c(y, rep(0,50))
x[y==0,2]=x[y==0,2]+2
dat=data.frame(x=x, y=as.factor(y))
par(mfrow=c(1,1))
plot(x,col=(y+1))

In [None]:
svmfit=svm(y~., data=dat, kernel="radial", cost=10, gamma=1)
plot(svmfit, dat)

**Application to Gene Expression Data**

In [None]:
library(ISLR)
names(Khan)
dim(Khan$xtrain)
dim(Khan$xtest)
length(Khan$ytrain)
length(Khan$ytest)

In [None]:
table(Khan$ytrain)
table(Khan$ytest)

In [None]:
dat=data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
out=svm(y~., data=dat, kernel="linear",cost=10)
summary(out)
table(out$fitted, dat$y)

In [None]:
dat.te=data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))
pred.te=predict(out, newdata=dat.te)
table(pred.te, dat.te$y)