# Catboost R Tutorial
R kernel for Jupyter Notebook: [link](https://irkernel.github.io/installation/)

In [1]:
library(catboost)
library(caret)
library(titanic)

Loading required package: lattice
Loading required package: ggplot2


## Make catboost pool

###  `catboost.load_pool`

Two files are needed to create catboost pool in R:

- File with features
  
```sh
> cat adult_train.1000 | head -1
1	28.0	Private	120135.0	Assoc-voc	11.0	Never-married	Sales	Not-in-family	White	Female	0.0	0.0	40.0	United-States
```

- Column description file

```sh
> cat adult.cd | head -3
0	Target
2	Categ
4	Categ
```

Column indices are 0-based, column types must be one of:

- Target (one column);
- Categ;
- Num (default type).

Indices and description of numeric columns can be omitted.

In [45]:
# load pool from path
pool_path = '../R-package/inst/extdata/adult_train.1000'
column_description_path = '../R-package/inst/extdata/adult.cd'
pool <- catboost.load_pool(pool_path, column_description_path)
catboost.head(pool, 1)

# load pool from package
pool_path = system.file("extdata", "adult_train.1000", package="catboost")
column_description_path = system.file("extdata", "adult.cd", package="catboost")
pool <- catboost.load_pool(pool_path, column_description_path)
catboost.head(pool, 1)

###  `catboost.from_matrix`

Categorical features must be transformed to numeric columns using your own method (e.g. string hash). Indices in **`cat_features`** vector are 0-based and can be different from indices in **`.cd`** file.

In [74]:
pool_path = '../R-package/inst/extdata/adult_train.1000'

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'

data <- read.table(pool_path, head=F, sep="\t", colClasses=column_description_vector)

# Transform categorical features to numeric.
for (i in cat_features)
    data[,i] <- as.numeric(factor(data[,i]))

target <- c(1)
data_matrix <- as.matrix(data)
pool <- catboost.from_matrix(data = as.matrix(data[,-target]),
                             target = as.matrix(data[,target]),
                             cat_features = cat_features)
catboost.head(pool, 1)

### `catboost.from_data_frame`

Categorical features must be converted to factors (use as.factor(), colClasses argument of read.table() etc). Numeric features must be presented as type numeric. Target feature must be presented as type numeric.

In [79]:
train_path = '../R-package/inst/extdata/adult_train.1000'
test_path = '../R-package/inst/extdata/adult_test.1000'

column_description_vector = rep('numeric', 15)
cat_features <- c(3, 5, 7, 8, 9, 10, 11, 15)
for (i in cat_features)
    column_description_vector[i] <- 'factor'
    
train <- read.table(train_path, head=F, sep="\t", colClasses=column_description_vector)
test <- read.table(test_path, head=F, sep="\t", colClasses=column_description_vector)
target <- c(1)
train_pool <- catboost.from_data_frame(data=train[,-target], target=train[,target])
test_pool <- catboost.from_data_frame(data=test[,-target], target=test[,target])
catboost.head(train_pool, 1)
catboost.head(test_pool, 1)

## Explore pool

In [123]:
# number of rows
cat("Nrows: ", catboost.nrow(train_pool), ", Ncols: ", catboost.ncol(train_pool), "\n")
# first rows of pool
cat("First row: ")
catboost.head(train_pool, n = 1)

Nrows:  1000 , Ncols:  14 
First row: 

## Train model

See **`help(catboost.train)`** for all arguments and description. Loss functions: RMSE, MAE, Logloss, CrossEntropy, Quantile, LogLinQuantile, Poisson, MAPE, MultiClass, AUC.

In [124]:
fit_params <- list(iterations = 100,
                   thread_count = 10,
                   loss_function = 'Logloss',
                   ignored_features = c(4,9),
                   border_count = 32,
                   depth = 5,
                   learning_rate = 0.03,
                   l2_leaf_reg = 3.5,
                   border = 0.5,
                   train_dir = 'train_dir')
model <- catboost.train(train_pool, test_pool, fit_params)

## Predict and evaluate

In [125]:
calc_accuracy <- function(prediction, expected) {
  labels <- ifelse(prediction > 0.5, 1, -1)
  accuracy <- sum(labels == expected) / length(labels)
  return(accuracy)
}

prediction <- catboost.predict(model, test_pool, type = 'Probability')
cat("Sample predictions: ", sample(prediction, 5), "\n")

labels <- catboost.predict(model, test_pool, type = 'Class')
table(labels, test[,target])

# works properly only for Logloss
accuracy <- calc_accuracy(prediction, test[,target])
cat("Accuracy: ", accuracy, "\n")

# feature splits importances (not finished)
cat(catboost.importance(model, learn_pool), "\n")

Sample predictions:  0.2544215 0.08505329 0.8480813 0.1397898 0.1505384 


      
labels  -1   1
     0 436 125
     1  64 375

Accuracy:  0.811 
7.253927 0.5130889 0.5939859 16.50384 0 22.99408 11.65108 10.85036 1.893606 0 21.13951 0.6196936 4.18149 1.805337 


## Catboosting with caret

Load and preprocess the Titanic dataset.

In [131]:
set.seed(12345)

data <- as.data.frame(as.matrix(titanic_train), stringsAsFactors=TRUE)

age_levels <- levels(data$Age)
most_frequent_age <- which.max(table(data$Age))
data$Age[is.na(data$Age)] <- age_levels[most_frequent_age]

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")
x <- data[,!(names(data) %in% drop_columns)]
y <- data[,c("Survived")]

At training we use 5-fold cross-validation. Also try to find the optimal trees' depth.

In [132]:
fit_control <- trainControl(method = "cv",
                            number = 5,
                            classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),
                    learning_rate = 0.1,
                    iterations = 100,
                    l2_leaf_reg = 1e-3,
                    rsm = 0.95,
                    border_count = 64)

report <- train(x, as.factor(make.names(y)),
                method = catboost.caret,
                verbose = FALSE, preProc = NULL,
                tuneGrid = grid, trControl = fit_control)

And print the result.

In [133]:
print(report)

importance <- varImp(report, scale = FALSE)
print(importance)

Catboost 

891 samples
  7 predictor
  2 classes: 'X0', 'X1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 714, 712, 713, 713, 712 
Resampling results across tuning parameters:

  depth  Accuracy   Kappa    
  4      0.8024142  0.5654448
  6      0.7980396  0.5627861
  8      0.8114472  0.5872851

Tuning parameter 'learning_rate' was held constant at a value of 0.1

Tuning parameter 'rsm' was held constant at a value of 0.95
Tuning
 parameter 'border_count' was held constant at a value of 64
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were depth = 8, learning_rate =
 0.1, iterations = 100, l2_leaf_reg = 0.001, rsm = 0.95 and border_count = 64.
custom variable importance

         Overall
Sex       20.619
Age       17.349
Fare      17.124
Pclass    14.620
SibSp     11.043
Embarked  10.750
Parch      8.495
