Skip to content
Applying genetic programming to AutoML
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

automl-genetic Build Status

AutoML is a pretty general term and could be thought of and implemented in many different ways with an exception of one thing being the same - they all should try to help us find best approaches(models) of solving a problem at hand.

Here we are trying to employ evolutionary algorithms and concepts to search the space of classifiers. In particularly we are interested in automatic construction of ensembles of classifiers because nowadays they have proved to be very efficient.

This project has it's roots growing from following paper and basically represents an attempt to implement, experiment and extend those ideas and provide convenient framework with simple API for researchers and businesses.

Contributions of any kind are very welcome! Please contact me for coordination of our efforts.

Supported by FIT faculty of Czech Technical University in Prague and ShowmaxLab sponsored by Showmax.


Getting started

In order to run your own evolution process you can start with taking a look at example test com.automl.AutoMLSuite.

In a nutshell you need to load and prepare you data with Spark and then pass it as a dataframe into core class com.automl.AutoML.

Let's take a look at an example:

val seed: Seq[LeafTemplate[SimpleModelMember]] = Seq(

val seedPopulation = new Population(seed)

val autoMl = new AutoML(
       data = trainingSplit,
       maxTime = 200000,
       useMetaDB = false,
       initialPopulationSize = Some(7),
       seedPopulation = seedPopulation,
       maxGenerations = 5)

Here we are passing trainingSplit into AutoML, not using metaDB, set maximum search time to 200 seconds. Each evolution will perform 5 generations. Seed consist of 4 classifiers but initial size is 7. It means that there will be duplicates in our initial population.

Description of AutoML class input parameters:

Parameter Description
data DataFrame with features and label colums.
maxTime Time in milliseconds during which algorithm will be performing search of optimal ensemble.
useMetaDB Whether or not to use metaDB of previously found templates based on similarity of datasets.
maxGenerations Maximum number of generations (cycles of selection and mutation) within one evolution. After this number of generations algoritm will increase portion of data and essentially run new evolution.
seedPopulation If we don't use metaDB ( useMetaDB is set to false) then we need to solve problem of cold start. Seed population is a set of classifiers that will be used to construct initial population.
initialPopulationSize Size of initial population. Based on this value we will be drawing from the seedPopulation in order to get needed size.
You can’t perform that action at this time.