# **Expolaration and Research - AD454**

## **Caret Package**

### **Summary**

The caret package includes several functions to create prediction model for complex regression and classification problems. In addition to model, there are other abilities which caret package helps to do in easy way like data splitting, preprocessing, feature selection using RFE, visualisation, model tuning etc. The package utilizes a number of R packages(32).Caret package provides uniform interface for functions as well as a way to standardize common tasks. Thanks to Max Kuhn and other contributors, we can apply machine learning approaches to daily problems more in a more easy and efficient ways.

### **Caret vs other ML Packages**

##### **Comparison with Tidymodels Package**

First of all, both are primarily written by Max Kuhn, caret is old but tidymodels is new package which is kind of tidy version of caret so you may easily find solutions for the problems in caret but the tidymodels package is still in the development phase and sometimes, you may encounter bugs in tidymodels due to this reason.

Rebecca Barter in her blog states that the caret package shows a slower performance when even a modest level of model building is tried while tidymodels is a newer interface that offers advanced-level coordination with other packages like tidyverse.

##### **Comparison with Mlr3 Package**

Mlr3 supports object-oriented programming where 'R6' objects are being provided along with machine learning workflow and mrl3 can overcome the limitations of Râ€™s S3 classes with the help of R6. On the other hand, mlr3 is still counted as new and under developed compared to caret so it's harder to find solutions for encountered problems. While 'mlr3' focuses on the core computational operations, caret package provides additional functionality.

##### **Comparison with Mlflow Package**

MLflow is an open source platform for managing the end-to-end machine learning lifecycle. It has some primary components like "tracking" which allows you to track experiments to record and compare parameters and results, "models" which allow you to manage and deploy models from a variety of ML libraries to a variety of model serving and inference platforms, "projects" which allow you to package ML code in a reusable, reproducible form to share with other data scientists or transfer to production etc. Actually, it's not alternative to caret or other ML packages, on the contrary, mlflow is kind of inclusionary package which other ML packages' functionalities can used in it.

#### **References**

- https://www.machinelearningplus.com/machine-learning/caret-package/
- https://www.linkedin.com/pulse/max-kuhns-twins-caret-tidymodels-dr-amita-sharma/?trk=public_profile_article_view
- https://www.r-bloggers.com/2019/12/meta-machine-learning-aggregator-packages-in-r-the-2nd-generation/
- https://cran.r-project.org/web/packages/mlr3/index.html
- https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/

### **Implementation - Logistic Regression on Bank Loan Acceptance**

In [None]:
library(tidyverse)
library(data.table)
library(plotly)
library(DT)
library(broom)
library(caret)
library(psych)
library(GGally)
library(magrittr)
library(lindia)

options(repr.matrix.max.rows=20, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 
datapath <- "~/data_ad454"

In [None]:
datapath <- "~/data_ad454"
realty_data <- readRDS(sprintf("%s/rds/06_02_realty_data3.rds", datapath))

In [None]:
realty_data[, premium := as.integer(premium_neigh > 0)]
vars <- c("premium", "esyali", "krediye_uygunluk", "bina_yasi", "kat_sayisi", "kat", realty_data %>% keep(is.logical) %>% names)
realty_data2 <- realty_data %>% select(all_of(vars)) %>% na.omit

In [None]:
realty_data2$premium <- as.factor(realty_data2$premium)

In [None]:
train_indices <- createDataPartition(realty_data2$premium, p = .7, 
                                  list = FALSE, 
                                  times = 1)

In [None]:
train_data <- realty_data2[ train_indices,]
test_data  <- realty_data2[-train_indices,]

In [None]:
logreg_model <- train(premium~., data = train_data, 
                 method = "glm", family = "binomial")

In [None]:
logreg_model

In [None]:
summary(logreg_model)

In [None]:
pred_train <- predict(logreg_model, train_data, type = "prob")
train_class <- ifelse(pred_train[,1] < 0.4, 1, 0)

In [None]:
table(actual = train_data$premium, fitted = train_class)

In [None]:
pred_test <- predict(logreg_model, test_data, type = "prob")
test_class <- ifelse(pred_test[,1] < 0.4, 1, 0)

In [None]:
table(actual = train_data$premium, fitted = train_class)

In [None]:
p1 <- data.table(D = train_data[,premium], M = pred_train[,1]) %>%
ggplot(aes(m = M, d = D)) +
    plotROC::geom_roc() +
    plotROC::style_roc(theme = theme_grey)

plotROC::export_interactive_roc(p1) %>% IRdisplay::display_html()

In [None]:
pROC::auc(train_data[,premium], pred_train[,1])