-
Notifications
You must be signed in to change notification settings - Fork 4
/
DriveML.Rmd
354 lines (268 loc) · 12.3 KB
/
DriveML.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
title: "DriveML: Self-Drive machine learning projects"
subtitle: "Automated machine learning classification model functions"
author: "Dayananda Ubrangala, Sayan Putatunda, Kiran R, Ravi Prasad Kondapalli"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_caption: yes
toc: true
css: custom.css
vignette: >
%\VignetteIndexEntry{Vignette Title Subtitle}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\SweaveUTF8
%\VignetteIndexEntry{Information Value: Usage Examples}
---
```{r setup, include=FALSE}
library(rmarkdown)
library(SmartEDA)
library(DriveML)
library(mlr)
library(knitr)
library(ggplot2)
library(tidyr)
```
## 1. Introduction
The document introduces the **DriveML** package and how it can help you to build effortless machine learning binary classification models in a short period.
**DriveML** is a series of functions such as `AutoDataPrep`, `AutoMAR`, `autoMLmodel`. **DriveML** automates some of the complicated machine learning functions such as exploratory data analysis, data pre-processing, feature engineering, model training, model validation, model tuning and model selection.
This package automates the following steps on any input dataset for machine learning classification problems
1. Data cleaning
+ Replacing NA, infinite values
+ Removing duplicates
+ Cleaning feature names
2. Feature engineering
+ Missing at random features
+ Missing variable imputation
+ Outlier treatment - Outlier flag and imputation with 5th or 95th percentile value
+ Date variable transformation
+ Bulk interactions for numerical features
+ Frequent transformer for categorical features
+ Categorical feature engineering - one hot encoding
+ Feature selection using zero variance, correlation and AUC method
3. Binary classification - Model training and validation
+ Automated test and validation set creations
+ Hyperparameter tuning using random search
+ Multiple binary classification algorithms like logistic regression, randomForest, xgboost, glmnet, rpart
+ Model validation using AUC value
+ Model plots like training and testing ROC curve, threshold plot
+ Probability scores and model objects
4. Model Explanation
+ Lift plot
+ Partial dependence plot
+ Feature importance plot
5. Model report
+ model output in rmarkdown html format
Additionally, we are providing a function SmartEDA for Exploratory data analysis that generates automated EDA report in HTML format to understand the distributions of the data. Please note there are some dependencies on some other R packages such as MLR, caret, data.table, ggplot2, etc. for some specific task.
To summarize, DriveML package helps in getting the complete Machine learning classification model just by running the function instead of writing lengthy r code.
#### Missing not at random features
Algorithm: Missing at random features
1. Select all the missing features X_i where i=1,2,…..,N.
2. For i=1 to N:
+ Define Y_i, which will have value of 1 if X_i has a missing value, 0 if X_i is not having missing value
+ Impute all X_(i+1 ) 〖to X〗_(N ) variables using imputation method
+ Fit binary classifier f_m to the training data using Y_i ~ X_(i+1 ) 〖+⋯+ X〗_(N )
+ Calculate AUC ∝_ivalue between actual Y_i and predicted Y ̂_i
+ If ∝_i is low then the missing values in X_i are missing at random,Y_i to be dropped
+ Repeat steps 1 to 4 for all the independent variables in the original dataset
## 2. Functionalities of DriveML
The DriveML R package has three unique functions
1. Data Pre-processing and Data Preparation
+ `autoDataPrep` function to generate a novel features based on the functional understanding of the dataset
2. Building Machine Learning Models
+ `autoMLmodel` function to develop baseline machine learning models using regression and tree based classification techniques
3. Generating Model Report
+ `autoMLReport` function to print the machine learning model outcome in HTML format
## 3. Machine learning classfication model use-case using DriveML
#### About the dataset
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.
Data Source `https://archive.ics.uci.edu/ml/datasets/Heart+Disease`
Install the package "DriveML" to get the example data set.
```{r eda-c3-r, warning=FALSE,eval=F}
library("DriveML")
library("SmartEDA")
## Load sample dataset from ISLR pacakge
data(heart)
```
more detailed attribute information is there in `DriveML` help page
### 3.1 Data Exploration
For data exploratory analysis used `SmartEDA` package
#### Overview of the data
Understanding the dimensions of the dataset, variable names, overall missing summary and data types of each variables
```{r od_1,warning=FALSE,eval=F,include=T}
# Overview of the data - Type = 1
ExpData(data=heart,type=1)
# Structure of the data - Type = 2
ExpData(data=heart,type=2)
```
```{r od_2,warning=FALSE,eval=T,include=F}
ovw_tabl <- ExpData(data=heart,type=1)
ovw_tab2 <- ExpData(data=heart,type=2)
```
* Overview of the data
```{r od_3,warning=FALSE,eval=T,render=ovw_tabl,echo=F}
kable(ovw_tabl, "html")
```
* Structure of the data
```{r od_31,warning=FALSE,eval=T,render=ovw_tab2,echo=F}
kable(ovw_tab2, "html")
```
#### Summary of numerical variables
```{r snc1,warning=FALSE,eval=T,include=F}
snc = ExpNumStat(heart,by="GA",gp="target_var",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
rownames(snc)<-NULL
```
```{r snc2, warning=FALSE,eval=F,include=T}
ExpNumStat(heart,by="GA",gp="target_var",Qnt=seq(0,1,0.1),MesofShape=2,Outlier=TRUE,round=2)
```
```{r snc3,warning=FALSE,eval=T,render=snc,echo=F}
paged_table(snc)
```
#### Distributions of Numerical variables
Box plots for all numerical variables vs categorical dependent variable - Bivariate comparison only with classes
Boxplot for all the numerical attributes by each class of the **target variable**
```{r bp3.1,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7}
plot4 <- ExpNumViz(heart,target="target_var",type=1,nlim=3,fname=NULL,Page=c(2,2),sample=8)
plot4[[1]]
```
#### Summary of categorical variables
```{r ed3.3, eval=T,include=F}
et100 <- ExpCTable(heart,Target="target_var",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
rownames(et100)<-NULL
```
**Cross tabulation with target_var variable**
Custom tables between all categorical independent variables and the target variable
```{r ed3.4, warning=FALSE,eval=F,include=T}
ExpCTable(Carseats,Target="Urban",margin=1,clim=10,nlim=3,round=2,bin=NULL,per=F)
```
```{r ed3.5,warning=FALSE,eval=T,render=et100,echo=F,out.height=8,out.width=8}
kable(et100,"html")
```
#### Distributions of categorical variables
Stacked bar plot with vertical or horizontal bars for all categorical variables
```{r ed3.10,warning=FALSE,eval=T,include=T,fig.align='center',fig.height=7,fig.width=7}
plot5 <- ExpCatViz(heart,target = "target_var", fname = NULL, clim=5,col=c("slateblue4","slateblue1"),margin=2,Page = c(2,1),sample=2)
plot5[[1]]
```
#### Outlier analysis using boxplot
```{r ktana, eval=T,include=F}
ana1 <- ExpOutliers(heart, varlist = c("oldpeak","trestbps","chol"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9))
outlier_summ <- ana1[[1]]
```
```{r out1, warning=FALSE,eval=F,include=T}
ExpOutliers(heart, varlist = c("oldpeak","trestbps","chol"), method = "boxplot", treatment = "mean", capping = c(0.1, 0.9))
```
```{r out11,warning=FALSE,eval=T,render=outlier_summ,echo=F,out.height=8,out.width=8}
kable(outlier_summ,"html")
```
### 3.2 Data preparations using `autoDataprep`
**Data preparation using DriveML autoDataprep function with default options**
```{r , warning=FALSE,eval=T,include=T}
dateprep <- autoDataprep(data = heart,
target = 'target_var',
missimpute = 'default',
auto_mar = FALSE,
mar_object = NULL,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
train_data <- dateprep$master_data
```
**We can use different types of missing imputation using mlr::impute function**
```{r , warning=FALSE,eval=T,include=T}
myimpute <- list(classes=list(factor = imputeMode(),
integer = imputeMean(),
numeric = imputeMedian(),
character = imputeMode()))
dateprep <- autoDataprep(data = heart,
target = 'target_var',
missimpute = myimpute,
auto_mar = FALSE,
mar_object = NULL,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
train_data <- dateprep$master_data
```
**Adding Missing at Random features using autoMAR function**
```{r , warning=FALSE,eval=T,include=T}
marobj <- autoMAR (heart, aucv = 0.9, strataname = NULL, stratasize = NULL, mar_method = "glm")
dateprep <- autoDataprep(data = heart,
target = 'target_var',
missimpute = myimpute,
auto_mar = TRUE,
mar_object = marobj,
dummyvar = TRUE,
char_var_limit = 15,
aucv = 0.002,
corr = 0.98,
outlier_flag = TRUE,
uid = NULL,
onlykeep = NULL,
drop = NULL)
train_data <- dateprep$master_data
```
### 3.3 Machine learning models using `autoMLmodel`
Automated training, tuning and validation of machine learning models. This function includes the following binary classification techniques
+ Logistic regression - logreg
+ Regularised regression - glmnet
+ Extreme gradient boosting - xgboost
+ Random forest - randomForest
+ Random forest - ranger
+ Decision tree - rpart
```{r , warning=FALSE,eval=F,include=T}
mymodel <- autoMLmodel( train = heart,
test = NULL,
target = 'target_var',
testSplit = 0.2,
tuneIters = 100,
tuneType = "random",
models = "all",
varImp = 10,
liftGroup = 50,
maxObs = 4000,
uid = NULL,
htmlreport = FALSE,
seed = 1991)
```
```{r , warning=FALSE,eval=T,include=F}
mymodel <- heart.model
```
### 3.3 Model output
Model performance
```{r out00,warning=FALSE,eval=T,render=mymodel, echo=F,out.height=8,out.width=8}
performance <- mymodel$results
kable(performance, "html")
```
Randomforest model Receiver Operating Characteristic (ROC) and the variable Importance
Training dataset ROC
```{r o1,warning=FALSE,render=mymodel,eval=T,include=T,fig.align='center',fig.height=4,fig.width=7}
TrainROC <- mymodel$trainedModels$randomForest$modelPlots$TrainROC
TrainROC
```
Test dataset ROC
```{r o10,warning=FALSE,render=mymodel,eval=T,include=T,fig.align='center',fig.height=4,fig.width=7}
TestROC <- mymodel$trainedModels$randomForest$modelPlots$TestROC
TestROC
```
Variable importance
```{r o11,warning=FALSE,render=mymodel,eval=T,include=T,fig.align='center',fig.height=4,fig.width=7}
VarImp <- mymodel$trainedModels$randomForest$modelPlots$VarImp
VarImp
```
Threshold
```{r o12,warning=FALSE,render=mymodel,eval=T,include=T,fig.align='center',fig.height=4,fig.width=7}
Threshold <- mymodel$trainedModels$randomForest$modelPlots$Threshold
Threshold
```