Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Extra visualization example #189

5 changes: 4 additions & 1 deletion autoPyTorch/api/base_task.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,6 +576,7 @@ def _do_traditional_prediction(self, time_left: int, func_eval_time_limit_secs:
assert self._dask_client is not None

self._logger.info("Starting to create traditional classifier predictions.")
starttime = time.time()

# Initialise run history for the traditional classifiers
run_history = RunHistory()
Expand Down Expand Up @@ -649,6 +650,7 @@ def _do_traditional_prediction(self, time_left: int, func_eval_time_limit_secs:
origin = additional_info['configuration_origin']
run_history.add(config=configuration, cost=cost,
time=runtime, status=status, seed=self.seed,
starttime=starttime, endtime=starttime + runtime,
origin=origin)
else:
if additional_info.get('exitcode') == -6:
Expand Down Expand Up @@ -1235,7 +1237,8 @@ def __del__(self) -> None:
# When a multiprocessing work is done, the
# objects are deleted. We don't want to delete run areas
# until the estimator is deleted
self._backend.context.delete_directories(force=False)
if hasattr(self, '_backend'):
self._backend.context.delete_directories(force=False)

@typing.no_type_check
def get_incumbent_results(
Expand Down
41 changes: 40 additions & 1 deletion docs/manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,43 @@
Manual
======

TODO
This manual shows how to get started with Auto-PyTorch. We recommend going over the examples first.
There are additional recommendations on how to interact with the API, further below in this manual.
However, you are welcome to contribute to this documentation by making a Pull-Request.

In a nutshell, Auto-PyTorch searches for the best ensemble of both traditional machine learning models and neural networks for a given dataset. It does so via the `search()` method of the different supported task. Currently we support Tabular classification and Tabular Regression. We plan to also support image processing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction
The searching starts by calling search() function of each supported task.
Currently, we are supporting Tabular classification and Tabular Regression.
We expand the support to image processing tasks in the future.

######################################################3
Great that you are saying method rather than function precisely!
Many people make mistakes:(
(function defined in an instance is called method and that in a class is called function)


Examples
========
* `Classification <examples/tabular/20_basics/example_tabular_classification.html>`_
* `Regression <examples/tabular/20_basics/example_tabular_regression.html>`_
* `Customizing the search space <examples/tabular/40_advanced/example_custom_configuration_space.html>`_
* `Changing the resampling strategy <examples/tabular/40_advanced/example_resampling_strategy.html>`_
* `Visualizing the results <examples/tabular/40_advanced/example_visualization.html>`_

Resource Allocation
===================

Auto-PyTorch allows to control the maximum allowed resident set memory that an estimator can use. By providing the `memory_limit` argument to the `search()` method, one can make sure that neither the individual machine learning models fitted by SMAC nor the final ensemble consume more than `memory_limit` megabytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is resident set memory?


Additionally, one can control the allocated time to search for a model, via the argument `total_walltime_limit` to the `search()` method. This argument controls the total time SMAC can use to search for new configurations. The more time is allocated, the better the final estimator will be.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for the comma

search for a model, via the argument

==> search for a model via the argument


Ensemble Building Process
=========================

Auto-PyTorch uses ensemble selection by `Caruana et al. (2004) <https://dl.acm.org/doi/pdf/10.1145/1015330.1015432>`_
to build an ensemble based on the models’ prediction for the validation set. The following hyperparameters control how the ensemble is constructed:

* ``ensemble_size`` determines the maximal size of the ensemble. If it is set to zero, no ensemble will be constructed.
* ``ensemble_nbest`` allows the user to directly specify the number of models considered for the ensemble. This hyperparameter can be an integer *n*, such that only the best *n* models are used in the final ensemble. If a float between 0.0 and 1.0 is provided, ``ensemble_nbest`` would be interpreted as a fraction suggesting the percentage of models to use in the ensemble building process (namely, if ensemble_nbest is a float, library pruning is implemented as described in `Caruana et al. (2006) <https://dl.acm.org/doi/10.1109/ICDM.2006.76>`_).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hyperparameter can be an integer n, such that only the best n models are used in the final ensemble

When an integer is provided for this hyperparameter, the final ensemble chooses each predictor from only the best n models

If a float between 0.0 and 1.0 is provided, ensemble_nbest would be interpreted as a fraction suggesting the percentage of models to use in the ensemble building process

If a float between 0.0 and 1.0 is provided, ensemble_nbest will be interpreted as a fraction suggesting the percentage of models to use in the ensemble building process.
For example, when we train 10 models before the ensemble building process and the hyperparameter is 0.7, we build an ensemble by taking some models from only best (?) 7 models among 10 models.

* ``max_models_on_disc`` defines the maximum number of models that are kept on the disc, as a mechanism to control the amount of disc space consumed by *auto-sklearn*. Throughout the automl process, different individual models are optimized, and their predictions (and other metadata) is stored on disc. The user can set the upper bound on how many models are acceptable to keep on disc, yet this variable takes priority in the definition of the number of models used by the ensemble builder (that is, the minimum of ``ensemble_size``, ``ensemble_nbest`` and ``max_models_on_disc`` determines the maximal amount of models used in the ensemble). If set to None, this feature is disabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the maximum number of models that are kept on the disc

the maximum number of models that can stay on the disc

the amount of disc space consumed by auto-sklearn

auto-pytorch?

their predictions (and other metadata) is stored on disc.

are stored


Inspecting the results
======================

Auto-PyTorch allows users to inspect the training results and statistics. The following example shows how different statistics can be printed for the inspection.

>>> from autoPyTorch.api.tabular_classification import TabularClassificationTask
>>> automl = TabularClassificationTask()
>>> automl.fit(X_train, y_train)
>>> automl.show_models()
4 changes: 2 additions & 2 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,6 @@
Releases
========

Version 0.0.3
Version 0.1.0
==============
TODO
[refactor] Initial version of the new scikit-learn compatible API.
166 changes: 166 additions & 0 deletions examples/tabular/40_advanced/example_visualization.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
"""
nabenabe0928 marked this conversation as resolved.
Show resolved Hide resolved
=======================
Visualizing the Results
=======================

Auto-Pytorch uses SMAC to fit individual machine learning algorithms
and then ensembles them together using `Ensemble Selection
<https://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml04.icdm06long.pdf>`_.

The following examples shows how to visualize both the performance
of the individual models and their respective ensemble.

Additionally, as we are compatible with scikit-learn,
we show how to further interact with `Scikit-Learn Inspection
<https://scikit-learn.org/stable/inspection.html>`_ support.


"""
import os
import pickle
import tempfile as tmp
import time
import warnings

os.environ['JOBLIB_TEMP_FOLDER'] = tmp.gettempdir()
os.environ['OMP_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
os.environ['MKL_NUM_THREADS'] = '1'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about those environment variables?
We need explanations for them.


warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

import matplotlib.pyplot as plt

import numpy as np

import pandas as pd


import sklearn.datasets
import sklearn.model_selection
from sklearn.inspection import permutation_importance

from smac.tae import StatusType


from autoPyTorch.api.tabular_classification import TabularClassificationTask
from autoPyTorch.metrics import accuracy


if __name__ == '__main__':

############################################################################
# Data Loading
# ============

# We will use the iris dataset for this Toy example
seed = 42
X, y = sklearn.datasets.fetch_openml(data_id=61, return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X,
y,
random_state=42,
)

############################################################################
# Build and fit a classifier
# ==========================
api = TabularClassificationTask(seed=seed)
api.search(
X_train=X_train,
y_train=y_train,
X_test=X_test.copy(),
y_test=y_test.copy(),
optimize_metric=accuracy.name,
total_walltime_limit=200,
func_eval_time_limit_secs=50
)

############################################################################
# One can also save the model for future inference
# ================================================

# For more details on how to deploy a model, please check
# `Scikit-Learn persistence
# <https://scikit-learn.org/stable/modules/model_persistence.html>`_ support.
with open('estimator.pickle', 'wb') as handle:
pickle.dump(api, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Then let us read it back and use it for our analysis
with open('estimator.pickle', 'rb') as handle:
estimator = pickle.load(handle)

############################################################################
# Plotting the model performance
# ==============================

# We will plot the search incumbent through time.

# Collect the performance of individual machine learning algorithms
# found by SMAC
individual_performances = []
for run_key, run_value in estimator.run_history.data.items():
if run_value.status != StatusType.SUCCESS:
# Ignore crashed runs
continue
individual_performances.append({
'Timestamp': pd.Timestamp(
time.strftime(
'%Y-%m-%d %H:%M:%S',
time.localtime(run_value.endtime)
)
),
'single_best_optimization_accuracy': accuracy._optimum - run_value.cost,
'single_best_test_accuracy': np.nan if run_value.additional_info is None else
accuracy._optimum - run_value.additional_info['test_loss'],
})
individual_performance_frame = pd.DataFrame(individual_performances)

# Collect the performance of the ensemble through time
# This ensemble is built from the machine learning algorithms
# found by SMAC
ensemble_performance_frame = pd.DataFrame(estimator.ensemble_performance_history)

# As we are tracking the incumbent, we are interested in the cummax() performance
ensemble_performance_frame['ensemble_optimization_accuracy'] = ensemble_performance_frame[
'train_accuracy'
].cummax()
ensemble_performance_frame['ensemble_test_accuracy'] = ensemble_performance_frame[
'test_accuracy'
].cummax()
ensemble_performance_frame.drop(columns=['test_accuracy', 'train_accuracy'], inplace=True)
individual_performance_frame['single_best_optimization_accuracy'] = individual_performance_frame[
'single_best_optimization_accuracy'
].cummax()
individual_performance_frame['single_best_test_accuracy'] = individual_performance_frame[
'single_best_test_accuracy'
].cummax()

pd.merge(
ensemble_performance_frame,
individual_performance_frame,
on="Timestamp", how='outer'
).sort_values('Timestamp').fillna(method='ffill').plot(
x='Timestamp',
kind='line',
legend=True,
title='Auto-sklearn accuracy over time',
franchuterivera marked this conversation as resolved.
Show resolved Hide resolved
grid=True,
)
plt.show()

# We then can understand the importance of each input feature using
# a permutation importance analysis. This is done as a proof of concept, to
# showcase that we can leverage of scikit-learn API.
result = permutation_importance(estimator, X_train, y_train, n_repeats=5,
scoring='accuracy',
random_state=seed)
sorted_idx = result.importances_mean.argsort()

fig, ax = plt.subplots()
ax.boxplot(result.importances[sorted_idx].T,
vert=False, labels=X_test.columns[sorted_idx])
ax.set_title("Permutation Importances (Train set)")
fig.tight_layout()
plt.show()