<a href="https://colab.research.google.com/github/ehsung/PUBH6886/blob/main/PUBH6886_RLabs12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PUBH 6886: R Lab 12

## Edward Sung

## 12/10/24

In [None]:
# Install Libraries
install.packages("kernlab")
install.packages("caret")
install.packages("ISLR2")

In [44]:
# Libraries
library(readr)
library(dplyr)
library(kernlab)
library(caret)

# Lecture 12 Codes

>[SVM Binary Classification Example](#updateTitle=true&folderId=1r-g8wJfeKaGd9a8yLWY3NoKaegmzbrzB&scrollTo=8_0qFifGyPr_)

>[SVM Multiple Class Example](#updateTitle=true&folderId=1r-g8wJfeKaGd9a8yLWY3NoKaegmzbrzB&scrollTo=hGLUHFqF6KSK)



# SVM Binary Classification Example

In [None]:
heart_data <- read_csv("/content/Heart.csv")

# convert ChestPain and Thal to factor variables
heart_data$ChestPain <- factor(heart_data$ChestPain)
heart_data$Thal <- factor(heart_data$Thal)
heart_data$AHD <- factor(heart_data$AHD)

# split into training and test sets
set.seed(1234)
heart_train_row <- sort(sample(1:303, size = 202))
heart_test_row <- setdiff(1:303, heart_train_row)
heart_data_train <- heart_data[heart_train_row,]
heart_data_test <- heart_data[heart_test_row,]

# remove subjects with missing data
heart_data_train <- heart_data_train[complete.cases(heart_data_train),]
heart_data_test <- heart_data_test[complete.cases(heart_data_test),]

In [46]:
set.seed(1234)

svm_lin_C10 <- ksvm(AHD ~., data = heart_data_train,
                    kernel = "vanilladot",
                    C = 10, scale = FALSE, cross = 10)

svm_lin_C10

 Setting default kernel parameters  


Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 10 

Linear (vanilla) kernel function. 

Number of Support Vectors : 72 

Objective Function Value : -631.0882 
Training error : 0.131313 
Cross validation error : 0.166842 

In [47]:
# first 5 indices for support vectors
svm_lin_C10@SVindex[1:5]

In [48]:
set.seed(1234)

svm_lin_C0.1 <- ksvm(AHD ~., data = heart_data_train,
                     kernel = "vanilladot",
                     C = 0.10, scale = FALSE, cross = 10)

svm_lin_C0.1

 Setting default kernel parameters  


Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 0.1 

Linear (vanilla) kernel function. 

Number of Support Vectors : 87 

Objective Function Value : -7.409 
Training error : 0.126263 
Cross validation error : 0.172105 

In [49]:
# first 5 indices for support vectors
svm_lin_C0.1@SVindex[1:5]

In [50]:
set.seed(1234)

svm_poly_C10 <- ksvm(AHD ~., data = heart_data_train,
                     kernel = "polydot",
                     kpar = list(degree = 2, scale = 1,
                     offset = 1),
                     C = 10, scale = FALSE, cross = 10)

svm_poly_C10

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 10 

Polynomial kernel function. 
 Hyperparameters : degree =  2  scale =  1  offset =  1 

Number of Support Vectors : 81 

Objective Function Value : -4.0618 
Training error : 0 
Cross validation error : 0.307632 

In [51]:
# first 5 indices for support vectors
svm_poly_C10@SVindex[1:5]

In [52]:
set.seed(1234)

svm_poly_C0.1 <- ksvm(AHD ~., data = heart_data_train,
                      kernel = "polydot",
                      kpar = list(degree = 2, scale = 1,
                      offset = 1),
                      C = 0.10, scale = FALSE, cross = 10)

svm_poly_C0.1

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 0.1 

Polynomial kernel function. 
 Hyperparameters : degree =  2  scale =  1  offset =  1 

Number of Support Vectors : 97 

Objective Function Value : -2.5734 
Training error : 0.025253 
Cross validation error : 0.297895 

In [53]:
# first 5 indices for support vectors
svm_poly_C0.1@SVindex[1:5]

In [54]:
set.seed(1234)

svm_rbf_C10 <- ksvm(AHD ~., data = heart_data_train,
                    kernel = "rbfdot",
                    kpar = list(sigma = 1),
                    C = 10, scale = FALSE, cross = 10)

svm_rbf_C10

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 10 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  1 

Number of Support Vectors : 198 

Objective Function Value : -92.5452 
Training error : 0 
Cross validation error : 0.388158 

In [55]:
# first 5 indices for support vectors
svm_rbf_C10@SVindex[1:5]

In [56]:
set.seed(1234)

svm_rbf_C0.1 <- ksvm(AHD ~., data = heart_data_train,
                     kernel = "rbfdot",
                     kpar = list(sigma = 1),
                     C = 0.10, scale = FALSE, cross = 10)

svm_rbf_C0.1

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 0.1 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  1 

Number of Support Vectors : 198 

Objective Function Value : -17.6501 
Training error : 0.469697 
Cross validation error : 0.468947 

In [57]:
# first 5 indices for support vectors
svm_rbf_C0.1@SVindex[1:5]

In [58]:
# create predictor matrix and response vector
# since the predictors are a mix of numerical and categorical variables,
# it is simplest to set up a model matrix that converts categorical variables
# to sets of dummy variables
f1 <- formula(AHD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs + RestECG + MaxHR +
                    ExAng + Oldpeak + Slope + Ca + Thal)

mf <- model.frame(f1, data = heart_data_train)
heart_data_train_X <- model.matrix(mf, data = heart_data_train)[,-1] # remove the intercept column
heart_data_train_Y <- heart_data_train$AHD

# set up tuning grid
tg_svmrbk <- expand.grid(C = c(0.001, 0.01, 0.1 , 1, 10),
                         sigma = seq(0.005, 0.10, by = 0.005))

set.seed(1234)

train_svmrbk <- train(x = heart_data_train_X, y = heart_data_train_Y,
                      method = "svmRadial", tuneGrid = tg_svmrbk,
                      trControl = trainControl(method = "cv", number = 10))

train_svmrbk$results[which.max(train_svmrbk$results$Accuracy),]

Unnamed: 0_level_0,C,sigma,Accuracy,Kappa,AccuracySD,KappaSD
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
50,0.1,0.05,0.8436591,0.6847598,0.0867336,0.1763347


In [59]:
# have to create dummy variables for categorical variables in test set
mftest <- model.frame(f1, data = heart_data_test)
heart_data_test_X <- model.matrix(mftest, data = heart_data_test)[,-1] # remove the intercept column

# obtain predicted classes
pred_test <- predict(train_svmrbk$finalModel, newdata = heart_data_test_X)

# construct confusion matrix
table(obs = heart_data_test$AHD, pred = pred_test) # accuracy = 80/99 = 0.81

     pred
obs   No Yes
  No  48   7
  Yes 12  32

# SVM Multiple Class Example

In [60]:
library(ISLR2)
str(Khan)

List of 4
 $ xtrain: num [1:63, 1:2308] 0.7733 -0.0782 -0.0845 0.9656 0.0757 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:63] "V1" "V2" "V3" "V4" ...
  .. ..$ : NULL
 $ xtest : num [1:20, 1:2308] 0.14 1.164 0.841 0.685 -1.956 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:20] "V1" "V2" "V4" "V6" ...
  .. ..$ : NULL
 $ ytrain: num [1:63] 2 2 2 2 2 2 2 2 2 2 ...
 $ ytest : num [1:20] 3 2 4 2 1 3 4 2 3 1 ...


In [61]:
# look at frequencies for outcome classes in training set
table(Khan$ytrain)


 1  2  3  4 
 8 23 12 20 

In [62]:
# set up training data
xtrain <- Khan$xtrain
xtest <- Khan$xtest
colnames(xtrain) <- paste0("x",1:2308)
colnames(xtest) <- paste0("x",1:2308)
ytrain_fac <- factor(Khan$ytrain)
ytest_fac <- factor(Khan$ytest)

# set up tuning grid
tg_svmlin <- expand.grid(C = c(0.001, 0.01, 0.1 , 1, 10, 100))

set.seed(1234)

train_svmlin <- train(x = xtrain, y = ytrain_fac,
                      method = "svmLinear", tuneGrid = tg_svmlin,
                      trControl = trainControl(method = "cv", number = 10))

train_svmlin

Support Vector Machines with Linear Kernel 

  63 samples
2308 predictors
   4 classes: '1', '2', '3', '4' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 57, 57, 57, 56, 57, 58, ... 
Resampling results across tuning parameters:

  C      Accuracy   Kappa
  1e-03  0.9857143  0.98 
  1e-02  0.9857143  0.98 
  1e-01  0.9857143  0.98 
  1e+00  0.9857143  0.98 
  1e+01  0.9857143  0.98 
  1e+02  0.9857143  0.98 

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 0.001.

In [63]:
# confusion matrix for training data
table(pred = predict(train_svmlin$finalModel), obs = ytrain_fac)

    obs
pred  1  2  3  4
   1  8  0  0  0
   2  0 23  0  0
   3  0  0 12  0
   4  0  0  0 20

In [64]:
# confusion matrix for test data
table(pred = predict(train_svmlin$finalModel, newdata = xtest), obs = ytest_fac)

    obs
pred 1 2 3 4
   1 3 0 0 0
   2 0 6 2 0
   3 0 0 4 0
   4 0 0 0 5