Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stacked ensemble classes #1134

Merged
merged 90 commits into from Sep 30, 2020
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
aec7234
init
angela97lin Aug 31, 2020
c1af455
init
angela97lin Sep 2, 2020
cd7b13b
fixing but some tricky things to do
angela97lin Sep 2, 2020
c95c383
reeee more testing
angela97lin Sep 3, 2020
0267edc
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 3, 2020
02d49bc
clean up and more test
angela97lin Sep 3, 2020
8842cab
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 3, 2020
8d497f5
release note
angela97lin Sep 3, 2020
7ea1594
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 3, 2020
9b625e3
add test, add broken pipeline sadness
angela97lin Sep 4, 2020
bdb0c4e
updating estimators to be a kwarg
angela97lin Sep 8, 2020
3e76578
cleanup and remove feature importances
angela97lin Sep 8, 2020
2faf743
add empty hyperparameter ranges
angela97lin Sep 8, 2020
eab4524
merging
angela97lin Sep 9, 2020
0a4515c
fix some tests, serialization still broken
angela97lin Sep 9, 2020
1a5eec0
add separate tests for serialization
angela97lin Sep 9, 2020
72a3968
fix more tests
angela97lin Sep 9, 2020
1dd9b22
revert code
angela97lin Sep 9, 2020
76aecf9
cleanup of exception
angela97lin Sep 10, 2020
087a6eb
add tests
angela97lin Sep 10, 2020
4f217ea
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 13, 2020
4e2ed44
docstrings and test cleanup
angela97lin Sep 13, 2020
1b2e37f
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 14, 2020
0713148
clean up and add tests for stacked
angela97lin Sep 15, 2020
a0432a9
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 15, 2020
6760c24
fix model family updated test, still need to update for kwargs
angela97lin Sep 15, 2020
c4127fa
address PR comments
angela97lin Sep 16, 2020
ba29024
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 16, 2020
ac4a4ec
fixing tests and linting
angela97lin Sep 17, 2020
f203104
fixing more tests
angela97lin Sep 17, 2020
52a5b41
merging
angela97lin Sep 17, 2020
6bd0eee
updated estimators parameter to input_pipelines and breaking everything
angela97lin Sep 17, 2020
943b74f
move release notes
angela97lin Sep 17, 2020
fc409a7
adding local changes for now
angela97lin Sep 18, 2020
ac255d9
clean up tests
angela97lin Sep 18, 2020
e8f11aa
add progress on creating scikit wrapper util methods, still lots to do
angela97lin Sep 18, 2020
ad1ea35
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 21, 2020
7fc723f
empty for circleci
angela97lin Sep 21, 2020
cb9b6a5
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 21, 2020
6704514
updated but need to fix compat
angela97lin Sep 21, 2020
e5dd69b
remove extranenous tests
angela97lin Sep 22, 2020
ed67e6e
some cleanup
angela97lin Sep 22, 2020
73eb272
retrigger
angela97lin Sep 22, 2020
8a4b59c
add test for pipelines with different problem types
angela97lin Sep 22, 2020
d1079f3
add tests to check for pandas structures
angela97lin Sep 22, 2020
41aa3ac
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 22, 2020
605ed46
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 22, 2020
bf89169
lint
angela97lin Sep 22, 2020
d5af473
Merge branch '1129_ensemble_classes' of github.com:alteryx/evalml int…
angela97lin Sep 22, 2020
be1a86a
testing n jobs
angela97lin Sep 22, 2020
b28bfb5
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 22, 2020
2297891
revert
angela97lin Sep 22, 2020
2d16f33
Merge branch '1129_ensemble_classes' of github.com:alteryx/evalml int…
angela97lin Sep 22, 2020
0db129d
updating to test if scikit learn objects work directly
angela97lin Sep 23, 2020
2a9cd14
fix class to instance
angela97lin Sep 23, 2020
35103bf
all broken code
angela97lin Sep 23, 2020
67b7ef1
minor cleanup, still all broken
angela97lin Sep 23, 2020
7ce7202
revert to almost working state
angela97lin Sep 23, 2020
f0b6d18
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 23, 2020
7e62234
update n_jobs to -2
angela97lin Sep 23, 2020
17dca26
Merge branch '1129_ensemble_classes' of github.com:alteryx/evalml int…
angela97lin Sep 23, 2020
1407df3
try n_jobs=-3
angela97lin Sep 23, 2020
2d95000
try updating scipy version
angela97lin Sep 23, 2020
3821e91
n_jobs=-1
angela97lin Sep 24, 2020
261f769
-2
angela97lin Sep 24, 2020
d8455bd
-3
angela97lin Sep 24, 2020
84becbe
-4
angela97lin Sep 24, 2020
e90686b
revert requirements
angela97lin Sep 24, 2020
06fca30
testing
angela97lin Sep 24, 2020
03fa28f
try requirements update
angela97lin Sep 24, 2020
41d37ea
n_jobs=1
angela97lin Sep 24, 2020
33cdf6e
revert requirements
angela97lin Sep 24, 2020
786c4ea
revert
angela97lin Sep 24, 2020
4b2efcb
set directly in parameters
angela97lin Sep 24, 2020
ea19650
revert requirement
angela97lin Sep 24, 2020
a96ec86
set n_jobs = 1
angela97lin Sep 24, 2020
340fa23
fix test
angela97lin Sep 24, 2020
d637f12
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 24, 2020
aba6e87
add to api ref
angela97lin Sep 24, 2020
9c00abb
Merge branch '1129_ensemble_classes' of github.com:alteryx/evalml int…
angela97lin Sep 24, 2020
2ad139a
fix docs by adding custom default parameters impl
angela97lin Sep 24, 2020
bc809cd
fix docstr
angela97lin Sep 24, 2020
631e5e1
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 25, 2020
7f2374c
Merge branch 'main' into 1129_ensemble_classes
angela97lin Sep 28, 2020
e54984b
merging
angela97lin Sep 29, 2020
631ffa0
some cleanup
angela97lin Sep 29, 2020
db44af5
minor cleanup for tests and docstring addition
angela97lin Sep 29, 2020
d7a0aaa
linting, merging, cleanup
angela97lin Sep 29, 2020
92d7dc6
fix new test
angela97lin Sep 29, 2020
4f5f9f9
merge and move release notes
angela97lin Sep 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Expand Up @@ -4,6 +4,7 @@ Release Notes
**Future Releases**
* Enhancements
* Added `output_format` field to explain predictions functions :pr:`1107`
* Added stacked ensemble component classes (StackedEnsembleClassifier, StackedEnsembleRegressor) :pr:`1134`
* Modified `get_objective` and `get_objectives` to be able to return any objective in `evalml.objectives` :pr:`1132`
* Added a `return_instance` boolean parameter to `get_objective` :pr:`1132`
* Added `ClassImbalanceDataCheck` to determine whether target imbalance falls below a given threshold :pr:`1135`
Expand Down
5 changes: 5 additions & 0 deletions evalml/exceptions/exceptions.py
Expand Up @@ -38,6 +38,11 @@ class AutoMLSearchException(Exception):
pass


class EnsembleMissingEstimatorsError(Exception):
"""An exception raised when an ensemble is missing `estimators` (list) as a parameter."""
pass


class PipelineScoreError(Exception):
"""An exception raised when a pipeline errors while scoring any objective in a list of objectives.

Expand Down
4 changes: 3 additions & 1 deletion evalml/model_family/model_family.py
Expand Up @@ -9,7 +9,8 @@ class ModelFamily(Enum):
LINEAR_MODEL = 'linear_model'
CATBOOST = 'catboost'
EXTRA_TREES = 'extra_trees'
BASELINE = 'baseline'
BASELINE = 'baseline',
ENSEMBLE = 'ensemble',
NONE = 'none'

def __str__(self):
Expand All @@ -20,6 +21,7 @@ def __str__(self):
ModelFamily.CATBOOST.name: "CatBoost",
ModelFamily.EXTRA_TREES.name: "Extra Trees",
ModelFamily.BASELINE.name: "Baseline",
ModelFamily.ENSEMBLE.name: "Ensemble",
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
ModelFamily.NONE.name: "None"}
return model_family_dict[self.name]

Expand Down
4 changes: 4 additions & 0 deletions evalml/pipelines/components/__init__.py
Expand Up @@ -36,3 +36,7 @@
TextFeaturizer,
LSA,
)
from .ensemble import (
StackedEnsembleClassifier,
StackedEnsembleRegressor
)
4 changes: 4 additions & 0 deletions evalml/pipelines/components/ensemble/__init__.py
@@ -0,0 +1,4 @@
# flake8:noqa
from .ensemble_base import EnsembleBase
from .stacked_ensemble_classifier import StackedEnsembleClassifier
from .stacked_ensemble_regressor import StackedEnsembleRegressor
5 changes: 5 additions & 0 deletions evalml/pipelines/components/ensemble/ensemble_base.py
@@ -0,0 +1,5 @@
from evalml.pipelines.components.estimators import Estimator


class EnsembleBase(Estimator):
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
pass
@@ -0,0 +1,59 @@
from sklearn.ensemble import StackingClassifier

from evalml.exceptions import EnsembleMissingEstimatorsError
from evalml.model_family import ModelFamily
from evalml.pipelines.components import LogisticRegressionClassifier
from evalml.pipelines.components.ensemble import EnsembleBase
from evalml.problem_types import ProblemTypes
from evalml.utils.gen_utils import _nonstackable_model_families


class StackedEnsembleClassifier(EnsembleBase):
"""Stacked Ensemble Classifier."""
name = "Stacked Ensemble Classifier"
model_family = ModelFamily.ENSEMBLE
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]
hyperparameter_ranges = {}

def __init__(self, final_estimator=None, cv=None, n_jobs=-1, random_state=0, **kwargs):
"""Stacked ensemble classifier.

Arguments:
final_estimator (Estimator or subclass): The classifier used to combine the base estimators. If None, uses LogisticRegressionClassifier.
cv (int, cross-validation generator or an iterable): Determines the cross-validation splitting strategy used to train final_estimator.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
Possible inputs for cv are:
- None: 5-fold cross validation
- int: the number of folds in a (Stratified) KFold
- An scikit-learn cross-validation generator object
- An iterable yielding (train, test) splits
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
random_state (int, np.random.RandomState): seed for the random number generator
**kwargs: 'estimators' containing a list of Estimator objects must be passed as a keyword argument, or else EnsembleMissingEstimatorsError will be raised
"""
if 'estimators' not in kwargs:
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
raise EnsembleMissingEstimatorsError("`estimators` must be passed to the constructor as a keyword argument")
estimators = kwargs.get('estimators')
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
parameters = {
"estimators": estimators,
"final_estimator": final_estimator,
"cv": cv,
"n_jobs": n_jobs
}
contains_non_stackable = [estimator for estimator in estimators if estimator.model_family in _nonstackable_model_families]
if contains_non_stackable:
raise ValueError("Classifiers with any of the following model families cannot be used as base estimators in StackedEnsembleClassifier: {}".format(_nonstackable_model_families))
sklearn_parameters = parameters.copy()
parameters.update(kwargs)
if final_estimator is None:
final_estimator = LogisticRegressionClassifier()
sklearn_parameters.update({"final_estimator": final_estimator._component_obj})
sklearn_parameters.update({"estimators": [(estimator.name + f"({idx})", estimator._component_obj) for idx, estimator in enumerate(estimators)]})
super().__init__(parameters=parameters,
component_obj=StackingClassifier(**sklearn_parameters),
random_state=random_state)

@property
def feature_importance(self):
raise NotImplementedError("feature_importance is not implemented for StackedEnsembleClassifier")
80 changes: 80 additions & 0 deletions evalml/pipelines/components/ensemble/stacked_ensemble_regressor.py
@@ -0,0 +1,80 @@
from sklearn.ensemble import StackingRegressor

from evalml.exceptions import EnsembleMissingEstimatorsError
from evalml.model_family import ModelFamily
from evalml.pipelines.components import LinearRegressor
from evalml.pipelines.components.ensemble import EnsembleBase
from evalml.problem_types import ProblemTypes
from evalml.utils.gen_utils import _nonstackable_model_families
angela97lin marked this conversation as resolved.
Show resolved Hide resolved


class StackedEnsembleRegressor(EnsembleBase):
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"""Stacked Ensemble Regressor."""
name = "Stacked Ensemble Regressor"
model_family = ModelFamily.ENSEMBLE
supported_problem_types = [ProblemTypes.REGRESSION]
hyperparameter_ranges = {}

def __init__(self, final_estimator=None, cv=None, n_jobs=-1, random_state=0, **kwargs):
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"""Stacked ensemble regressor.

Arguments:
final_estimator (Estimator or subclass): The regressor used to combine the base estimators. If None, uses LinearRegressor.
cv (int, cross-validation generator or an iterable): Determines the cross-validation splitting strategy used to train final_estimator.
For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.
Possible inputs for cv are:
- None: 5-fold cross validation
- int: the number of folds in a (Stratified) KFold
- An scikit-learn cross-validation generator object
- An iterable yielding (train, test) splits
n_jobs (int or None): Non-negative integer describing level of parallelism used for pipelines.
None and 1 are equivalent. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used.
random_state (int, np.random.RandomState): seed for the random number generator
**kwargs: 'estimators' containing a list of Estimator objects must be passed as a keyword argument, or else EnsembleMissingEstimatorsError will be raised
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"""
# if 'estimators' not in kwargs:
# raise EnsembleMissingEstimatorsError("`estimators` must be passed to the constructor as a keyword argument")
# estimators = kwargs.get('estimators')
# parameters = {
# "estimators": estimators,
# "final_estimator": final_estimator,
# "cv": cv,
# "n_jobs": n_jobs
# }
# contains_non_stackable = [estimator for estimator in estimators if estimator.model_family in _nonstackable_model_families]
# if contains_non_stackable:
# raise ValueError("Regressors with any of the following model families cannot be used as base estimators in StackedEnsembleRegressor: {}".format(_nonstackable_model_families))
# sklearn_parameters = parameters.copy()
# parameters.update(kwargs)
# if final_estimator is None:
# final_estimator = LinearRegressor()
# sklearn_parameters.update({"final_estimator": final_estimator._component_obj})
# sklearn_parameters.update({"estimators": [(estimator.name + f"({idx})", estimator._component_obj) for idx, estimator in enumerate(estimators)]})
# super().__init__(parameters=parameters,
# component_obj=StackingRegressor(**sklearn_parameters),
# random_state=random_state)
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
if 'estimators' not in kwargs:
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
raise EnsembleMissingEstimatorsError("`estimators` must be passed to the constructor as a keyword argument")
estimators = kwargs.get('estimators')
parameters = {
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
"estimators": estimators,
"final_estimator": final_estimator,
"cv": cv,
"n_jobs": n_jobs
}
contains_non_stackable = [estimator for estimator in estimators if estimator.model_family in _nonstackable_model_families]
if contains_non_stackable:
raise ValueError("Regressors with any of the following model families cannot be used as base estimators in StackedEnsembleRegressor: {}".format(_nonstackable_model_families))
sklearn_parameters = parameters.copy()
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
parameters.update(kwargs)
if final_estimator is None:
final_estimator = LinearRegressor()
sklearn_parameters.update({"final_estimator": final_estimator._component_obj})
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
sklearn_parameters.update({"estimators": [(estimator.name + f"({idx})", estimator._component_obj) for idx, estimator in enumerate(estimators)]})
super().__init__(parameters=parameters,
component_obj=StackingRegressor(**sklearn_parameters),
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
random_state=random_state)

@property
def feature_importance(self):
raise NotImplementedError("feature_importance is not implemented for StackedEnsembleRegressor")
angela97lin marked this conversation as resolved.
Show resolved Hide resolved
54 changes: 42 additions & 12 deletions evalml/tests/component_tests/test_components.py
Expand Up @@ -11,6 +11,7 @@

from evalml.exceptions import (
ComponentNotYetFittedError,
EnsembleMissingEstimatorsError,
MethodPropertyNotFoundError
)
from evalml.model_family import ModelFamily
Expand All @@ -36,6 +37,10 @@
Transformer,
XGBoostClassifier
)
from evalml.pipelines.components.ensemble import (
StackedEnsembleClassifier,
StackedEnsembleRegressor
)
from evalml.pipelines.components.utils import (
_all_estimators,
_all_estimators_used_in_search,
Expand Down Expand Up @@ -353,7 +358,11 @@ def test_component_parameters_getter(test_classes):
def test_component_parameters_init():
for component_class in all_components():
print('Testing component {}'.format(component_class.name))
component = component_class()
try:
component = component_class()
except EnsembleMissingEstimatorsError:
component = component_class(estimators=[])

parameters = component.parameters

component2 = component_class(**parameters)
Expand Down Expand Up @@ -403,7 +412,10 @@ def test_clone_fitted(X_y_binary):

def test_components_init_kwargs():
for component_class in all_components():
component = component_class()
try:
component = component_class()
except EnsembleMissingEstimatorsError:
continue
if component._component_obj is None:
continue

Expand Down Expand Up @@ -516,13 +528,12 @@ def test_estimator_predict_output_type(X_y_binary):
assert (predict_proba_output.columns == y_cols_expected).all()


@pytest.mark.parametrize("cls", all_components())
@pytest.mark.parametrize("cls", [cls for cls in all_components() if cls not in [StackedEnsembleRegressor, StackedEnsembleClassifier]])
def test_default_parameters(cls):

assert cls.default_parameters == cls().parameters, f"{cls.__name__}'s default parameters don't match __init__."


@pytest.mark.parametrize("cls", all_components())
@pytest.mark.parametrize("cls", [cls for cls in all_components() if cls not in [StackedEnsembleRegressor, StackedEnsembleClassifier]])
def test_default_parameters_raise_no_warnings(cls):
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
Expand Down Expand Up @@ -701,7 +712,8 @@ def test_all_transformers_check_fit(X_y_binary):

def test_all_estimators_check_fit(X_y_binary, test_estimator_needs_fitting_false):
X, y = X_y_binary
for component_class in _all_estimators() + [test_estimator_needs_fitting_false]:
estimators_to_check = [estimator for estimator in _all_estimators() if estimator not in [StackedEnsembleClassifier, StackedEnsembleRegressor]] + [test_estimator_needs_fitting_false]
for component_class in estimators_to_check:
if not component_class.needs_fitting:
continue

Expand Down Expand Up @@ -739,20 +751,38 @@ def test_no_fitting_required_components(X_y_binary, test_estimator_needs_fitting
def test_serialization(X_y_binary, tmpdir):
X, y = X_y_binary
path = os.path.join(str(tmpdir), 'component.pkl')

for component_class in all_components():
print('Testing serialization of component {}'.format(component_class.name))
try:
component = component_class()
except EnsembleMissingEstimatorsError:
if (component_class == StackedEnsembleClassifier):
component = component_class(estimators=[RandomForestClassifier(n_estimators=2)])
elif (component_class == StackedEnsembleRegressor):
component = component_class(estimators=[RandomForestRegressor(n_estimators=2)])

component = component_class()
component.fit(X, y)

for pickle_protocol in range(cloudpickle.DEFAULT_PROTOCOL + 1):
component.save(path, pickle_protocol=pickle_protocol)
loaded_component = ComponentBase.load(path)
assert component.parameters == loaded_component.parameters
assert component.describe(return_dict=True) == loaded_component.describe(return_dict=True)
if issubclass(component_class, Estimator):
assert (component.feature_importance == loaded_component.feature_importance).all()
if isinstance(component, StackedEnsembleClassifier) or isinstance(component, StackedEnsembleRegressor):
# test all parameters except "estimators"
params_without_estimators = {key: value for key, value in component.parameters.items() if key != "estimators"}
loaded_params_without_estimators = {key: value for key, value in loaded_component.parameters.items() if key != "estimators"}
assert params_without_estimators == loaded_params_without_estimators
estimators = component.parameters['estimators']
loaded_estimators = loaded_component.parameters['estimators']
# test equality for each estimator in estimators
for est, loaded_est in zip(estimators, loaded_estimators):
assert est.parameters == loaded_est.parameters
assert est.describe(return_dict=True) == loaded_est.describe(return_dict=True)
angela97lin marked this conversation as resolved.
Show resolved Hide resolved

else:
assert component.parameters == loaded_component.parameters
assert component.describe(return_dict=True) == loaded_component.describe(return_dict=True)
if issubclass(component_class, Estimator):
assert (component.feature_importance == loaded_component.feature_importance).all()


@patch('cloudpickle.dump')
Expand Down
2 changes: 1 addition & 1 deletion evalml/tests/component_tests/test_lgbm_classifier.py
Expand Up @@ -2,7 +2,7 @@

import numpy as np
import pandas as pd
from pandas._testing import assert_frame_equal, assert_series_equal
from pandas.testing import assert_frame_equal, assert_series_equal
from pytest import importorskip

from evalml.model_family import ModelFamily
Expand Down