Employee attrition information originally provided by IBM Watson Analytics Lab.    
* problem type: supervised binomial classification
* response variable: Attrition (i.e., “Yes”, “No”)
* features: 30
* observations: 1,470
* objective: use employee attributes to predict if they will attrit (leave the company)
* access: provided by the rsample package (Kuhn and Wickham 2019)
* more details: See ?rsample::attrition


In [None]:
#install.packages("rsample")
#install.packages("h2o")

# Import libraries

In [4]:
# Helper packages
library(dplyr)     # for data manipulation
library(ggplot2)   # for awesome graphics

# Modeling process packages
library(rsample)   # for resampling procedures
library(caret)     # for resampling and model training
library(h2o)       # for resampling and model training

# h2o set-up 
h2o.no_progress()  # turn off h2o progress bars
h2o.init()         # launch h2o

 Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         45 seconds 963 milliseconds 
    H2O cluster timezone:       America/Montevideo 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.30.1.3 
    H2O cluster version age:    6 days  
    H2O cluster name:           H2O_started_from_R_creyesp_wwr066 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   2.56 GB 
    H2O cluster total cores:    8 
    H2O cluster allowed cores:  8 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
    R Version:                  R version 3.6.3 (2020-02-29) 



# Load Dataset

In [12]:
data("attrition", package = "modeldata")


In [13]:
# Job attrition data
attrition <- attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)
attrition <- as.h2o(attrition)

“data.table cannot be used without R package bit64 version 0.9.7 or higher.  Please upgrade to take advangage of data.table speedups.”


In [14]:
# initial dimension
dim(attrition)

In [15]:
# response variable
head(attrition)

Unnamed: 0_level_0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EnvironmentSatisfaction,Gender,⋯,PerformanceRating,RelationshipSatisfaction,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
Unnamed: 0_level_1,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
1,41,Yes,Travel_Rarely,1102,Sales,1,College,Life_Sciences,Medium,Female,⋯,Excellent,Low,0,8,0,Bad,6,4,0,5
2,49,No,Travel_Frequently,279,Research_Development,8,Below_College,Life_Sciences,High,Male,⋯,Outstanding,Very_High,1,10,3,Better,10,7,1,7
3,37,Yes,Travel_Rarely,1373,Research_Development,2,College,Other,Very_High,Male,⋯,Excellent,Medium,0,7,3,Better,0,0,0,0
4,33,No,Travel_Frequently,1392,Research_Development,3,Master,Life_Sciences,Very_High,Female,⋯,Excellent,High,0,8,3,Better,8,7,3,0
5,27,No,Travel_Rarely,591,Research_Development,2,Below_College,Medical,Low,Male,⋯,Excellent,Very_High,1,6,3,Better,2,2,2,2
6,32,No,Travel_Frequently,1005,Research_Development,2,College,Life_Sciences,Very_High,Male,⋯,Excellent,High,0,8,2,Good,7,7,3,6


## Splittin dataset

In [16]:
# Using base R
set.seed(123)  # for reproducibility
index_1 <- sample(1:nrow(attrition), round(nrow(attrition) * 0.7))
train_1 <- churn[index_1, ]
test_1  <- churn[-index_1, ]

# Using caret package
set.seed(123)  # for reproducibility
index_2 <- createDataPartition(churn$Attrition, p = 0.7, list = FALSE)
train_2 <- churn[index_2, ]
test_2  <- churn[-index_2, ]

# Using rsample package
set.seed(123)  # for reproducibility
split_1  <- initial_split(churn, prop = 0.7)
train_3  <- training(split_1)
test_3   <- testing(split_1)

# Using h2o package
split_2 <- h2o.splitFrame(churn.h2o, ratios = 0.7, seed = 123)
train_4 <- split_2[[1]]
test_4  <- split_2[[2]]

ERROR: Error in eval(expr, envir, enclos): object 'churn' not found


# EDA

In [None]:
ggplot(attrition, aes(x=Attrition)) + geom_bar()

# Modeling
R tiene un aplio abanico de librerias que implementan diferentes tipos de modelos e incluso varias implementaciones del mismo modelo pero con diferentes enfoques. Ademas de estos hay librerias que unifican estas para agilizar el proceso de modelo y standarizar los procesos.

Hay diferentes formas de como se deben pasar la data para entrenar un modelo:
* Clasica de R usando una formula como una expresion sombolica de la relacion de dependencia 
    * `fn(Y ~ X)`
* usando separadamente $Y$, $X$ como argumentos independientes
    * `fn(x = ames[, features], y = ames$Sale_Price)`
* usando separadamente $Y$, $X$ pero pasando solo el nombre de las variables de un dataframe 
    * `fn(x = c("Year_Sold", "Longitude", "Latitude"), y = "Sale_Price", data = ames.h2o)`



## Engines
Como se mencion anteriormente hay varias librerias que implementan por ejemplo el mismo modelo pero con diferente sabpres, estos son conocidos como "engines"

In [None]:
lm_glm <- glm(Attrition ~ ., data = train_2, binomial(link="logit"))
summary(lm_glm)

In [None]:
lm_caret  <- train(Attrition ~ ., data = train_2,   method = "rf", family = "binomial")

In [None]:
lm_caret

In [None]:
pred = predict(lm_caret, newdata = test_2)
table(pred, test_2$Attrition)


In [None]:
pred = predict(lm_caret, newdata = test_2, type = "prob")

In [None]:
head(pred)

In [None]:
ggplot(pred, aes(Yes) ) + geom_histogram()

In [None]:
y = as.numeric(test_2$Attrition) - 1 
calibrate.plot(y, pred$Yes, )


In [None]:
install.packages("gbm")


In [None]:
library(gbm)