# MATH 3375 Examples Notebook #16

# Comparing Model Predictions and Metrics

We compare making predictions and computing metrics across classification techniques.  We also look at issues that can arise and ways to handle these issues.

Again, our response variable is **am**, the transmission type, where 0=automatic, and 1=manual.


In [None]:
#Look at data set
cars <- read.csv("cars2004.csv",stringsAsFactors=TRUE)
head(cars)
summary(cars$Body)

In [None]:
#Remove columns and add response variable 'Cargo'
row.names(cars) <- paste(cars[,1],cars[,2],cars[,3])
cars$Cargo <- as.integer(cars$Body %in% c("SUV","Wagon","Minivan","Pickup"))
cars <- cars[,-c(1:3,13:14)]
head(cars)
summary(cars$Cargo)

## A Random Training-Test Partition

In [None]:
set.seed(3375)
n_train <- as.integer(0.6*nrow(cars))
train_idx <- sample(1:n_train)
cars_train <- cars[train_idx,]
cars_test <- cars[-train_idx,]
nrow(cars_train)
nrow(cars_test)
summary(as.factor(cars_train$Cargo))
summary(as.factor(cars_test$Cargo))

### Problem: Imbalanced Classes

The partition above has NO records in the training set with Cargo set to 1. This makes it impossible to train a model to predict  Cargo as a response variable.

One way to address an imbalance in classes is **_stratified_** sampling, where a fixed proportion of each class is sampled for the training set.  

Below is an example of **_systematic_** sampling, another sampling mechanism that can lead to a more balanced sample. 

In [None]:
test_idx <- seq(3,n_train,by=3)
test_idx <- rep(c(FALSE,TRUE,TRUE,FALSE,FALSE),82)[1:407]
head(test_idx,10)
cars_train <- cars[!test_idx,]
cars_test <- cars[test_idx,]
nrow(cars_train)
nrow(cars_test)
summary(as.factor(cars_train$Cargo))
summary(as.factor(cars_test$Cargo))

## Create Several Models

We create three different types of model using all features as predictors:

* Logistic Regression with 0.5 as cutoff
* k-Nearest Neighbors (kNN) with k=3
* Support Vector Machine (SVM) with default kernel

Each model is created with the training set.  For each model, we also predict the class of the records in the test set.

In [None]:
#install.packages("e1071")
library(e1071)

#install.packages("class")
library(class)

In [None]:
#Logistic Regression Model
model_logit_cargo <- glm(Cargo ~ ., family="binomial", data=cars_train)
prob_logit_cargo <- predict(model_logit_cargo,cars_test,type="response")
pred_logit_cargo <- as.integer(prob_logit_cargo > 0.5)

#kNN Models
pred_knn_cargo <- knn(cars_train[,1:9], cars_test[,1:9], cl=cars_train[,10], k = 3)

#SVM Model
model_svm_cargo = svm(as.factor(Cargo)~.,data=cars_train)
pred_svm_cargo <- predict(model_svm_cargo,cars_test)

## Model Metrics

All of the above models can be evaluated with:

* Confusion Matrices and related metrics
    * Accuracy
    * Sensitivity
    * Specificity
    * Precision
* ROC curve and Area Under Curve (AUC) metric

Note that ROC curve can be used on any model that gives binary classification predictions.

In [None]:
#install.packages("pROC")
library(pROC)

In [None]:
Actual <- cars_test$Cargo
roc_data=roc(Actual, pred_logit_cargo, quiet=TRUE) 
plot(roc_data, print.auc=TRUE, main ="ROC curve - Logistic Reg")


### Data Requirements for ROC Curve

The code below will result in an error, because the predictions generated for the kNN model are classes, rather than numbers.

In [None]:
roc_data=roc(Actual, pred_knn_cargo, quiet=TRUE) 
plot(roc_data, print.auc=TRUE, main ="ROC curve - kNN-3")

#### Formatting the Predictions to Avoid this Error

In [None]:
pred_knn_int <- as.integer(pred_knn_cargo)
pred_knn_int

#### One Last Adjustment

Notice that the values produced are 1's and 2's, rather than 0's and 1's.  That is because factors are stored as **_levels_** beginning with 1.  We need to first convert the factors to characters (which were 0s and 1s), then convert THAT to integer.

In [None]:
pred_knn_int <- as.integer(as.character(pred_knn_cargo))
pred_knn_int

### ROC Curve with Correct Data Format for Predictions

In [None]:
roc_data=roc(Actual, pred_knn_int, quiet=TRUE) 
plot(roc_data, print.auc=TRUE, main ="ROC curve - kNN-3")