# Classification Model Selection

## Goal

Apply and compare different classification models to find the best-performing model based on the best performance score. The following classification techniques will be explored:

- **Logistic Regression**
- **K-Nearest Neighbors (K-NN)**
- **Support Vector Machine (SVM)**
- **Kernel SVM**
- **Naive Bayes**
- **Decision Tree Classification**
- **Random Forest Classification**

Each model will be trained on the dataset, evaluated using cross-validation, and the best model will be selected based on its performance score.

## Metrics

For breast cancer classification, the F1-Score will be used because the classes (2 for benign and 4 for malignant) are imbalanced. The F1-Score helps maintain a balance between false positives (Precision) and false negatives (Recall), both of which can have significant consequences. Additionally, ROC-AUC will be considered to evaluate the overall performance of the model in distinguishing between the two classes.


## Libraries Loading

In [7]:
# Libraries loading

library(tidyverse)
library(rpart)
library(caret)
library(ggplot2)
library(MLmetrics)
library(class)
library(e1071)
library(randomForest)
library(pROC)

## Data Loading

In [2]:
# Data loading
data <- read_csv('../00_data/Data.csv')

glimpse(data)

[1mRows: [22m[34m683[39m [1mColumns: [22m[34m11[39m
[36m──[39m [1mColumn specification[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (11): Sample code number, Clump Thickness, Uniformity of Cell Size, Unif...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Rows: 683
Columns: 11
$ `Sample code number`          [3m[90m<dbl>[39m[23m 1000025, 1002945, 1015425, 1016277, 1017…
$ `Clump Thickness`             [3m[90m<dbl>[39m[23m 5, 5, 3, 6, 4, 8, 1, 2, 2, 4, 1, 2, 5, 1…
$ `Uniformity of Cell Size`     [3m[90m<dbl>[39m[23m 1, 4, 1, 8, 1, 10, 1, 1, 1, 2, 1, 1, 3, …
$ `Uniformity of Cell Shape`    [3m[90m<dbl>[39m[23m 1, 4, 1, 8, 1, 10, 1, 2, 1, 1, 1, 1, 3, …
$ `Marginal Adhesion`           [3m[90m<dbl>[39m[23m 1, 5, 1, 1, 3, 8, 1, 1, 1, 1, 1, 1, 3, 1…
$ `Single Epithelial Cell Size` [3m[90m<dbl>[39m[23m 2, 7, 2, 3, 2, 7, 2, 2, 2, 2, 1, 2, 2, 2…
$ `Bare Nuclei`                 [3m[90m<dbl>[39m[23m 1, 10, 2, 4, 1, 10, 10, 1, 1, 1, 1, 1, 3…
$ `Bland Chromatin`             [3m[90m<dbl>[39m[23m 3, 3, 3, 3, 3, 9, 3, 3, 1, 2, 3, 2, 4, 3…
$ `Normal Nucleoli`             [3m[90m<dbl>[39m[23m 1, 2, 1, 7, 1, 7, 1, 1, 1, 1, 1, 1, 4, 1…
$ Mitoses                       [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 

In [3]:
# Splitting into training and test set

random_seed <- 444

set.seed(random_seed)
indexes <- sample(1:nrow(data), size = 0.8*nrow(data))

train_data <- data[indexes, -1] # remove `Sample code number`
test_data <- data[-indexes, -1]

head(train_data)

Clump Thickness,Uniformity of Cell Size,Uniformity of Cell Shape,Marginal Adhesion,Single Epithelial Cell Size,Bare Nuclei,Bland Chromatin,Normal Nucleoli,Mitoses,Class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,3,2,6,3,3,3,5,1,2
5,10,8,10,8,10,3,6,3,4
2,1,1,1,1,1,2,1,1,2
3,2,2,1,4,3,2,1,1,2
5,4,6,7,9,7,8,10,1,4
4,1,1,2,2,1,2,1,1,2


## Targets variable and Features

+ Target variable:
    + `Class`
        + **2** - benign
        + **4** - malignant
+ Features:
    + `Clump Thickness`
    + `Uniformity of Cell Size`
    + `Uniformity of Cell Shape`
    + `Marginal Adhesion`
    + `Single Epithelial Cell Size`
    + `Bare Nuclei`
    + `Bland Chromatin`
    + `Normal Nucleoli`
    + `Mitoses`
    

## Data Preprocessing

In [4]:
train_data <- train_data |>
        mutate(Class = factor(Class, levels = c(2, 4)))

test_data <- test_data |>
        mutate(Class = factor(Class, levels = c(2, 4)))

colnames(train_data) <- gsub(" ", "", colnames(train_data))
colnames(test_data) <- gsub(" ", "", colnames(test_data))

## Select the Model

In [5]:
random_seed <- 42

# get scores to choose the best model
scores <- function(y_pred, y_prob){
    roc_curve <- roc(test_data$Class, y_prob, levels = c(2, 4), direction = "<")
    auc_value <- auc(roc_curve)
    list('f1' = F1_Score(y_pred, test_data$Class, positive = "4"),
         'roc-auc' = auc_value)
}

# convert probabilities to classes
prob_to_class <- function(y_prob){
    ifelse(y_prob > 0.5, 4, 2)
}

# get confusion matrix
conf_mat <- function(y_pred){
    table(test_data$Class, y_pred)
}

# models configurations
models_config <- list(
    
  # 1. Logistic Regression
  'Logistic Regression' = list(
      name = 'Logistic Regression',
      train_test = function(){
          set.seed(random_seed)
          # fit the model on the train set
          fit <- glm(Class ~ .,
                     family = "binomial",
                     data = train_data)
          # get probabilities predicted on test set
          prob_pred <- predict(fit,
                               type = 'response',
                               newdata = test_data)
          
          return(prob_pred)
      },
      evaluate = conf_mat,
      scores = scores
  ),
    
  # 2. KNN
  'KNN' = list(
      name = 'KNN',
      train_test = function(){
          # fit the model on the train set and get probabilities predicted on test set
          y_pred <- knn(train = train_data[, -ncol(test_data)],
                        test = test_data[, -ncol(test_data)],
                        cl = train_data$Class, # classes from training set
                        k = 5,
                        prob = TRUE) # get probabilities

          winning_class_probs <- attr(y_pred, "prob")
          # Adjust probabilities for the positive class - 4
          positive_probs <- ifelse(y_pred == 4, winning_class_probs, 1 - winning_class_probs)
          
          return(positive_probs)
    },
    evaluate = conf_mat,
    scores = scores
  ),
    
  # 3. SVM
  'SVM' = list(
      name = 'SVM',
      train_test = function(){
          kernels <- list(linear = list(kernel = 'linear',
                                        params = list(cost = c(0.1, 1, 10, 100))),
                          redial = list(kernel = 'radial',
                                        params = list(cost = c(0.1, 1, 10, 100), 
                                                      gamma = c(1, 0.1, 0.01))),
                          poly = list(kernel = 'poly',
                                      params = list(cost = c(0.1, 1, 10, 100), 
                                                    gamma = c(1, 0.1, 0.01),
                                                    degree = 2:4))
                         )
          # tune params and select the best SVM kernel
          best_models <- lapply(kernels, function(x){
              set.seed(random_seed)
              tuned <- tune('svm',
                            Class ~ .,
                            data = train_data,
                            kernel = x$kernel,
                            ranges = x$params,
                            tunecontrol = tune.control(cross = 10),
                            probability = TRUE
                           )
            
              best_model <- tuned$best.model
              y_pred <- predict(best_model, newdata = test_data)
              f1 <- F1_Score(y_pred, test_data$Class, positive = "4")
              list(kernel = x$kernel,
                   best.params = tuned$best.parameters,
                   best.f1=f1,
                   best.model=best_model)
          })
          # select best SVM
          best_model <- best_models[[which.max(sapply(best_models, function(x) x$best.f1))]]
          # predict probabilities on test set
          y_pred <- predict(best_model$best.model, newdata = test_data, probability = TRUE)
          probs <- attr(y_pred, 'probabilities')
                                                      
          return(probs[,2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
                                               
  # 4. Naive Bayes
  'Naive Bayes' = list(
      name = 'Naive Bayes',
      train_test = function(){
          # fit the model on training set
          fit <- naiveBayes(Class ~ ., data = train_data)
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data[-ncol(test_data)], type = 'raw')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
                                               
  # 5. Decision Tree
  'Decision Tree' = list(
      name = 'Decision Tree',
      train_test = function(){
          set.seed(random_seed)
          # fit the model on training set
          fit <- rpart(Class ~ ., data = train_data)
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data[-ncol(test_data)], type = 'prob')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  ),
                                               
  # 6. Random Forest
  'Random Forest' = list(
      name = 'Random Forest',
      train_test = function(){
          set.seed(random_seed)
          # fit the model on training set
          fit <- randomForest(Class ~ .,
                              data = train_data,
                              ntree = 100)        
          # predict probabilities on teset set
          y_pred <- predict(fit, newdata = test_data, type = 'prob')
          
          return(y_pred[, 2])
      },
      evaluate = conf_mat,
      scores = scores
  )
)

# run experiment
best_models <- lapply(models_config, function(x){
    print(paste0('Running ', x[['name']], '...'))
    # train
    prob_pred <- x[['train_test']]()
    #print('Model trained!')
    y_pred <- prob_to_class(prob_pred)
    #print('Evaluating...')
    cm <- x[['evaluate']](y_pred)
    # score
    #print('Scoring...')
    scores <- x[['scores']](y_pred, prob_pred)
    list(
        #name = x[['name']],
        confusion_matrix = cm,
        scores = scores
    )
})

best_models
                                     

[1] "Running Logistic Regression..."
[1] "Running KNN..."
[1] "Running SVM..."
[1] "Running Naive Bayes..."
[1] "Running Decision Tree..."
[1] "Running Random Forest..."


$`Logistic Regression`
$`Logistic Regression`$confusion_matrix
   y_pred
     2  4
  2 93  3
  4  2 39

$`Logistic Regression`$scores
$`Logistic Regression`$scores$f1
[1] 0.939759

$`Logistic Regression`$scores$`roc-auc`
Area under the curve: 0.9936



$KNN
$KNN$confusion_matrix
   y_pred
     2  4
  2 93  3
  4  0 41

$KNN$scores
$KNN$scores$f1
[1] 0.9647059

$KNN$scores$`roc-auc`
Area under the curve: 0.9931



$SVM
$SVM$confusion_matrix
   y_pred
     2  4
  2 93  3
  4  1 40

$SVM$scores
$SVM$scores$f1
[1] 0.952381

$SVM$scores$`roc-auc`
Area under the curve: 0.9957



$`Naive Bayes`
$`Naive Bayes`$confusion_matrix
   y_pred
     2  4
  2 90  6
  4  1 40

$`Naive Bayes`$scores
$`Naive Bayes`$scores$f1
[1] 0.9195402

$`Naive Bayes`$scores$`roc-auc`
Area under the curve: 0.9925



$`Decision Tree`
$`Decision Tree`$confusion_matrix
   y_pred
     2  4
  2 91  5
  4  3 38

$`Decision Tree`$scores
$`Decision Tree`$scores$f1
[1] 0.9047619

$`Decision Tree`$scores$`roc-auc`
Area under the

In [6]:
# Best model selection
max_val <- max(sapply(best_models, function(x) x$scores$f1))
best_of_the_best <- best_models[which(sapply(best_models, function(x) x$scores$f1) == max_val, arr.ind = TRUE)]

print('The best models:')

best_of_the_best

[1] "The best models:"


$KNN
$KNN$confusion_matrix
   y_pred
     2  4
  2 93  3
  4  0 41

$KNN$scores
$KNN$scores$f1
[1] 0.9647059

$KNN$scores$`roc-auc`
Area under the curve: 0.9931



$`Random Forest`
$`Random Forest`$confusion_matrix
   y_pred
     2  4
  2 93  3
  4  0 41

$`Random Forest`$scores
$`Random Forest`$scores$f1
[1] 0.9647059

$`Random Forest`$scores$`roc-auc`
Area under the curve: 0.9926


