Skip to content
Xander Horn edited this page Feb 3, 2019 · 14 revisions

Introduction

Welcome to the Wiki of autoML, we do hope you enjoy your time here!

Whilst you are here, consider checking out my other library autoEDA here: https://github.com/XanderHorn/autoEDA

Note that autoML is a work in progress.

Support my work

Please take a moment and consider contributing to the future development and support of autoML, autoEDA and any future work by checking out my Paypal and Patreon accounts.

Patreon: https://patreon.com/XanderHorn

PayPal: https://www.paypal.me/XanderHorn

Installation

library(devtools)
install_github("XanderHorn/autoML")

Some users might experience issues when installing via github, please see this link for a potential fix: https://github.com/r-lib/devtools/issues/1900

Note on parallelization

autoML uses a parallel back end as default, this unfortunately means that autoLearn will never produce the exact same models when you run autoLearn or autoML (wrapper of autoLearn) multiple times on the same dataset. This does not effect predictions.

Library overview

Welcome to the Wiki of autoML. Whilst the library consists of a number of functions there are a few functions which are intended to be used the most:

autoML - Automated machine learning

This function is a wrapper of autoPreProcess, autoLearn and autoInterpret, with some of the more advanced settings hidden and automatically set. The function should be used to train models for binary classification, multi-class classification, regression and unsupervised problems.

The function will return the final training set, all models that were trained as well as the results table comparing all trained models.

Because autoML utilities the brilliant mlr library, all the models trained are mlr model objects, meaning that all of the functionality in mlr can be applied to the models produced in autoML

autoPreProcess - Automated feature engineering and data cleaning

This function will clean your data set and perform various methods of feature engineering on the data. Various settings can be set to the user's wish to control how much cleaning and feature engineering will take place. The function also produces a preProcess function, which can be used to re-create all the steps that the function took during execution.

The intention of the code produced is not that it is used, but rather that it is fed to either autoLearn or autoML to create production code functions for each of the models trained.

autoLearn - Automated model training and validation

autoLearn will automate the process of model training, model validation, parameter tuning, target creation and target experimentation. Various settings can be configured ranging from tuning methods, training modes to the models that should be trained on the data. When output from autoPreProcess is provided to autoLearn, each model will return a production function, unique to that model.

Because autoLearn utilities the brilliant mlr library, all the models trained are mlr model objects, meaning that all of the functionality in mlr can be applied to the models produced in autoLearn

autoInterpret - Automated model interpretability methods

autoInterpret will automatically produce model interpretability methods for any mlr trained model and the training set. Since all models trained in the autoML library are mlr model objects, autoInterpret can be used with them.

saveCode - Code generation

Allows the user to save the code that is generated by autoPreProcess, autoLearn and autoML to an R script. The code generated from autoLearn and autoML can then be embedded into an API for rapid model productionalization.

In a nutshell

This library aims to automate various aspects of the traditional machine learning cycle. The library automatically performs the following actions on any dataset:

  • Data cleaning
    1. Encoding of incorrect missing values i.e. Unknown => NA
    2. Removal of duplicate observations
    3. Removal of constant, duplicate and features only containing missing values
    4. Correcting of feature/variable names
    5. Formatting of features/variables i.e. character => numeric
  • Feature engineering
    1. Imputation and outlier clipping
    2. Categorical features sparse categories correction
    3. Categorical feature engineering i.e. one hot encoding, proportional encoding
    4. Flagging/Tracking features i.e. keeps track where missing data was observed
    5. Date and text feature engineering
    6. Numerical feature transformations i.e. square root transformation
    7. Numerical feature interactions i.e. x1 / x2
    8. Unsupervised feature creation using k-means clustering
    9. Feature scaling
  • Model training
    1. Automated target generation i.e. regression, unsupervised learning
    2. Automated test and validation set creation
    3. Resampling of tuned models i.e. k-fold cross validation
    4. Tuning of models i.e. random search
    5. Different training targets i.e. balanced vs original vs reduced features
    6. Optimisation of various performance metrics i.e. auc, brier
    7. Training plots i.e. learning curve, threshold, calibration etc.
    8. Parallel processing / multicore processing
    9. Various models included i.e. xgboost, lasso, knn
    10. Probability cutoffs included for classification models
  • Model interpretation
    1. Partial dependence plots
    2. Feature importance plots
    3. Local model interpretable plots
    4. Model feature/variable interaction plots
  • Code generation
    1. Code is generated whilst functions executes
    2. Code is adapted to each model that is trained and ready for production
    3. Code easily interpreted to lessen the black-box feeling
  • Lower level functionality
    1. Most functions utilised in the main functions available for more flexibility