CivisML uses the Civis Platform to train machine learning models and parallelize their predictions over large datasets. It contains best-practice models for general-purpose classification and regression modeling as well as model quality evaluations and visualizations. All CivisML models use the scikit-learn API for interoperability with other platforms and to allow you to leverage resources in the open-source software community when creating machine learning models.
You do not need any external libraries installed to use CivisML, but the following pip-installable dependencies enhance the capabilities of the ~civis.ml.ModelPipeline
:
- pandas
- scikit-learn
- glmnet
- feather-format
- civisml-extensions
- muffnn
Install pandas
if you wish to download tables of predictions. You can also model on ~pandas.DataFrame
objects in your interpreter.
If you wish to use the ~civis.ml.ModelPipeline
code to model on ~pandas.DataFrame
objects in your local environment, the feather-format package (requires pandas >= 0.20) will improve data transfer speeds and guarantee that your data types are correctly detected by CivisML. You must install feather-format if you wish to use pd.Categorical columns in your DataFrame objects, since that type information is lost when writing data as a CSV.
If you wish to use custom models or download trained models, you'll need scikit-learn installed.
Several pre-defined models rely on public Civis Analytics libraries. The "sparse_logistic", "sparse_linear_regressor", "sparse_ridge_regressor", "stacking_classifier", and "stacking_regressor" models all use the glmnet
library. Pre-defined MLP models ("multilayer_perceptron_classifier" and "multilayer_perceptron_regressor") depend on the muffnn
library. Finally, models which use the default CivisML ETL, along with models which use stacking or hyperband, depend on civisml-extensions
. Install these packages if you wish to download the pre-defined models that depend on them.
Start the modeling process by defining your model. Do this by creating an instance of the ~civis.ml.ModelPipeline
class. Each ~civis.ml.ModelPipeline
corresponds to a scikit-learn ~sklearn.pipeline.Pipeline
which will run in Civis Platform. A ~sklearn.pipeline.Pipeline
allows you to combine multiple modeling steps (such as missing value imputation and feature selection) into a single model. The ~sklearn.pipeline.Pipeline
is treated as a unit -- for example, cross-validation happens over all steps together.
You can define your model in two ways, either by selecting a pre-defined algorithm or by providing your own scikit-learn ~sklearn.pipeline.Pipeline
or ~sklearn.base.BaseEstimator
object. Note that whichever option you chose, CivisML will pre-process your data using either its default ETL, or ETL that you provide (see custom-etl
).
If you have already trained a scikit-learn model outside of Civis Platform, you can register it with Civis Platform as a CivisML model so that you can score it using CivisML. Read model-registration
for how to do this.
You can use the following pre-defined models with CivisML. All models start by imputing missing values with the mean of non-null values in a column. The "sparse*" models include a LASSO regression step (using the glmnet package) to do feature selection before passing data to the final model. In some models, CivisML uses default parameters different from those in scikit-learn, as indicated in the "Altered Defaults" column. All models also have random_state=42
.
Name | Model Type | Algorithm | Altered Defaults |
---|---|---|---|
sparse_logistic | classification | LogisticRegression | C=499999950, tol=1e-08 |
gradient_boosting_classifier | classification | GradientBoostingClassifier | n_estimators=500, max_depth=2 |
random_forest_classifier | classification | RandomForestClassifier | n_estimators=500, max_depth=7 |
extra_trees_classifier multilayer_perceptron_classifier stacking_classifier sparse_linear_regressor sparse_ridge_regressor |
classification classification classification regression regression |
ExtraTreesClassifier muffnn.MLPClassifier civismlext.StackedClassifier LinearRegression Ridge |
|
gradient_boosting_regressor | regression | GradientBoostingRegressor | n_estimators=500, max_depth=2 |
random_forest_regressor | regression | RandomForestRegressor | n_estimators=500, max_depth=7 |
extra_trees_regressor multilayer_perceptron_regressor stacking_regressor |
regression regression regression |
ExtraTreesRegressor muffnn.MLPRegressor civismlext.StackedRegressor |
|
The "stacking_classifier" model stacks the "gradient_boosting_classifier", and "random_forest_classifier" predefined models together with a glmnet.LogitNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='log_loss')
. The models are combined using a ~sklearn.pipeline.Pipeline
containing a Normalizer step, followed by LogisticRegressionCV with penalty='l2'
and tol=1e-08
. The "stacking_regressor" works similarly, stacking together the "gradient_boosting_regressor" and "random_forest_regressor" models and a glmnet.ElasticNet(alpha=0, n_splits=4, max_iter=10000, tol=1e-5, scoring='r2')
, combining them using NonNegativeLinearRegression. The estimators that are being stacked have the same names as the associated pre-defined models, and the meta-estimator steps are named "meta-estimator". Note that although default parameters are provided for multilayer perceptron models, it is highly recommended that multilayer perceptrons be run using hyperband.
You can create your own ~sklearn.pipeline.Pipeline
instead of using one of the pre-defined ones. Create the object and pass it as the model
parameter of the ~civis.ml.ModelPipeline
. Your model must follow the scikit-learn API, and you will need to include any dependencies as custom-dependencies
if they are not already installed in CivisML. Preinstalled libraries available for your use include:
- scikit-learn v0.19.1
- glmnet v2.0.0
- xgboost v0.6a2
- muffnn v1.2.0
- civisml-extensions v.0.1.6
When you're assembling your own model, remember that you'll have to make certain that either you add a missing value imputation step or that your data doesn't have any missing values. If you're making a classification model, the model must have a predict_proba
method. If the class you're using doesn't have a predict_proba
method, you can add one by wrapping it in a ~sklearn.calibration.CalibratedClassifierCV
.
By default, CivisML pre-processes data using the ~civismlext.preprocessing.DataFrameETL
class, with cols_to_drop
equal to the excluded_columns
parameter. You can replace this with your own ETL by creating an object of class ~sklearn.base.BaseEstimator
and passing it as the etl
parameter during training.
By default, ~civismlext.preprocessing.DataFrameETL
automatically one-hot encodes all categorical columns in the dataset. If you are passing a custom ETL estimator, you will have to ensure that no categorical columns remain after the transform
method is called on the dataset.
You can tune hyperparamters using one of two methods: grid search or hyperband. CivisML will perform grid search if you pass a dictionary of hyperparameters to the cross_validation_parameters
parameter, where the keys are hyperparameter names, and the values are lists of hyperparameter values to grid search over. You can run hyperparameter tuning in parallel by setting the n_jobs
parameter to however many jobs you would like to run in parallel. By default, n_jobs
is dynamically calculated based on the resources available on your cluster, such that a modeling job will never take up more than 90% of the cluster resources at once.
Hyperband is an efficient approach to hyperparameter optimization, and recommended over grid search where possible. CivisML will perform hyperband optimization for a pre-defined model if you pass the string 'hyperband'
to cross_validation_parameters
. Hyperband is currently only supported for the following models: gradient_boosting_classifier
, random_forest_classifier
, extra_trees_classifier
, multilayer_perceptron_classifier
, stacking_classifier
, gradient_boosting_regressor
, random_forest_regressor
, extra_trees_regressor
, multilayer_perceptron_regressor
, and stacking_regressor
. Although hyperband is supported for stacking models, stacking itself is a kind of model tuning, and the combination of stacking and hyperband is likely too computationally intensive to be useful in many cases.
Hyperband cannot be used to tune GLMs. For this reason, preset GLMs do not have a hyperband option. Similarly, when cross_validation_parameters='hyperband'
and the model is stacking_classifier
or stacking_regressor
, only the GBT and random forest steps of the stacker are tuned using hyperband. Note that if you want to use hyperband with a custom model, you will need to wrap your estimator in a civismlext.hyperband.HyperbandSearchCV
estimator yourself.
CivisML runs pre-defined models with hyperband using the following distributions:
Models | Cost Parameter | Hyperband Distributions |
---|---|---|
gradient_boosting_classifier gradient_boosting_regressor GBT step in stacking_classifier GBT step in stacking_regressor |
n_estimators min = 100, max = 1000 |
max_depth: randint(low=1, high=5) max_features: [None, 'sqrt', 'log2', 0.5, 0.3, 0.1, 0.05, 0.01] learning_rate: truncexpon(b=5, loc=.0003, scale=1./167.) |
random_forest_classifier random_forest_regressor extra_trees_classifier extra_trees_regressor RF step in stacking_classifier RF step in stacking_regressor |
n_estimators min = 100, max = 1000 |
criterion: ['gini', 'entropy'] max_features: truncexpon(b=10., loc=.01, scale=1./10.11) max_depth: [1, 2, 3, 4, 6, 10] |
multilayer_perceptron_classifier multilayer_perceptron_regressor |
n_epochs min = 5, max = 50 |
keep_prob: uniform() hidden_units: [(), (16,), (32,), (64,), (64, 64), (64, 64, 64), (128,), (128, 128), (128, 128, 128), (256,), (256, 256), (256, 256, 256), (512, 256, 128, 64), (1024, 512, 256, 128)] learning_rate: [1e-2, 2e-2, 5e-2, 8e-2, 1e-3, 2e-3, 5e-3, 8e-3, 1e-4] |
The truncated exponential distribution for the gradient boosting classifier and regressor was chosen to skew the distribution toward small values, ranging between .0003 and .03, with a mean close to .006. Similarly, the truncated exponential distribution for the random forest and extra trees models skews toward small values, ranging between .01 and 1, and with a mean close to .1.
Installing packages from PyPI is straightforward. You can specify a dependencies
argument to ~civis.ml.ModelPipeline
which will install the dependencies in your runtime environment. VCS support is also enabled (see docs.) Installing a remote git repository from, say, Github only requires passing the HTTPS URL in the form of, for example, git+https://github.com/scikit-learn/scikit-learn
.
CivisML will run pip install [your package here]
. We strongly encourage you to pin package versions for consistency. Example code looks like:
from civis.ml import ModelPipeline
from pyearth import Earth
deps = ['git+https://github.com/scikit-learn-contrib/py-earth.git@da856e11b2a5d16aba07f51c3c15cef5e40550c7']
est = Earth()
model = ModelPipeline(est, dependent_variable='age', dependencies=deps)
train = model.train(table_name='donors.from_march', database_name='client')
Additionally, you can store a remote git host's API token in the Civis Platform as a credential to use for installing private git repositores. For example, you can go to Github at the https://github.com/settings/tokens
URL, copy your token into the password field of a credential, and pass the credential name to the git_token_name
argument in ~civis.ml.ModelPipeline
. This also works with other hosting services. A simple example of how to do this with API looks as follows
import civis
password = 'abc123' # token copied from https://github.com/settings/tokens
username = 'user123' # Github username
git_token_name = 'Github credential'
client = civis.APIClient()
credential = client.credentials.post(password=password,
username=username,
name=git_token_name,
type="Custom")
pipeline = civis.ml.ModelPipeline(..., git_token_name=git_token_name)
Note, installing private dependencies with submodules is not supported.
All calls to a ~civis.ml.ModelPipeline
object are non-blocking, i.e. they immediately provide a result without waiting for the job in the Civis Platform to complete. Calls to civis.ml.ModelPipeline.train
and civis.ml.ModelPipeline.predict
return a ~civis.ml.ModelFuture
object, which is a subclass of ~concurrent.futures.Future
from the Python standard library. This behavior lets you train multiple models at once, or generate predictions from models, while still doing other work while waiting for your jobs to complete.
The ~civis.ml.ModelFuture
can find and retrieve outputs from your CivisML jobs, such as trained ~sklearn.pipeline.Pipeline
objects or out-of-sample predictions. The ~civis.ml.ModelFuture
only downloads outputs when you request them.
Civis Platform permanently stores all models, indexed by the job ID and the run ID (also called a "build") of the training job. If you wish to use an existing model, call civis.ml.ModelPipeline.from_existing
with the job ID of the training job. You can find the job ID with the ~civis.ml.ModelFuture.train_job_id
attribute of a ~civis.ml.ModelFuture
, or by looking at the URL of your model on the Civis Platform models page. If the training job has multiple runs, you may also provide a run ID to select a run other than the most recent. You can list all model runs of a training job by calling civis.APIClient().jobs.get(train_job_id)['runs']
. You may also store the ~civis.ml.ModelPipeline
itself with the pickle
module.
~concurrent.futures.Future
objects have the method ~concurrent.futures.Future.add_done_callback
. This is called as soon as the run completes. It takes a single argument, the ~concurrent.futures.Future
for the completed job. You can use this method to chain jobs together:
from concurrent import futures
from civis.ml import ModelPipeline
import pandas as pd
df = pd.read_csv('data.csv')
training, predictions = [], []
model = ModelPipeline('sparse_logistic', dependent_variable='type')
training.append(model.train(df))
training[-1].add_done_callback(lambda fut: predictions.append(model.predict(df)))
futures.wait(training) # Blocks until all training jobs complete
futures.wait(predictions) # Blocks until all prediction jobs complete
You can create and train multiple models at once to find the best approach for solving a problem. For example:
from civis.ml import ModelPipeline
algorithms = ['gradient_boosting_classifier', 'sparse_logistic', 'random_forest_classifier']
pkey = 'person_id'
depvar = 'likes_cats'
models = [ModelPipeline(alg, primary_key=pkey, dependent_variable=depvar) for alg in algorithms]
train = [model.train(table_name='schema.name', database_name='My DB') for model in models]
aucs = [tr.metrics['roc_auc'] for tr in train] # Code blocks here
Instead of using CivisML to train your model, you may train any scikit-learn-compatible model outside of Civis Platform and use civis.ml.ModelPipeline.register_pretrained_model
to register it as a CivisML model in Civis Platform. This will let you use Civis Platform to make predictions using your model, either to take advantage of distributed predictions on large datasets, or to create predictions as part of a workflow or service in Civis Platform.
When registering a model trained outside of Civis Platform, you are strongly advised to provide an ordered list of feature names used for training. This will allow CivisML to ensure that tables of data input for predictions have the correct features in the correct order. If your model has more than one output, you should also provide a list of output names so that CivisML knows how many outputs to expect and how to name them in the resulting table of model predictions.
If your model uses dependencies which aren't part of the default CivisML execution environment, you must provide them to the dependencies
parameter of the ~civis.ml.ModelPipeline.register_pretrained_model
function, just as with the ~civis.ml.ModelPipeline
constructor.
civis.ml.ModelPipeline
civis.ml.ModelFuture