Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ExtraTrees Pipeline #790

Merged
merged 38 commits into from May 29, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
b33bcb1
Add ExtraTrees as a ModelFamily
eccabay May 20, 2020
03b51d4
Add components for ExtraTrees classifier and regressor
eccabay May 20, 2020
b0c787d
Add Pipeline classes for ExtraTrees classification and regression
eccabay May 20, 2020
8e69c9d
Add ExtraTrees to init files
eccabay May 20, 2020
5050ffd
Modify current tests to include ExtraTrees pipeline and components
eccabay May 20, 2020
e4cc862
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 20, 2020
1b077bf
Update ModelFamily and components tests to include ExtraTrees
eccabay May 20, 2020
77a71e5
Fix CatBoost learning rate bug - rerorder _ALL_PIPELINES
eccabay May 21, 2020
30f6cfe
Add classification pipeline tests
eccabay May 21, 2020
6612f0b
Add ExtraTrees to the API reference
eccabay May 21, 2020
26fc87d
Fix pipeline tests to use correct estimators
eccabay May 21, 2020
6543fd4
Update changelog
eccabay May 21, 2020
af5f67c
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 21, 2020
e44989e
Lint fixes
eccabay May 21, 2020
5dd78cc
More lint fixes
eccabay May 21, 2020
4207a8f
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 21, 2020
23258d3
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 22, 2020
07b15be
Remove explicit components/et_classifier feature_importances
eccabay May 22, 2020
e66bf80
Renname classification pipeline test_et to test_et_classification
eccabay May 26, 2020
131f541
Update ExtraTrees parameters to better reflect model
eccabay May 26, 2020
695ca3c
Remove ExtraTrees from automl (through _ALL_PIPELINES)
eccabay May 26, 2020
c7ef269
Preliminary ExtraTrees components tests
eccabay May 26, 2020
c26b2f4
Linting and a few test tweaks
eccabay May 27, 2020
af800f4
lint fixes
eccabay May 28, 2020
bb73610
Update et_regression pipeline tests to remove redundancy
eccabay May 28, 2020
ad65a49
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 28, 2020
570d206
Update et_classification pipeline tests to remove redundancy
eccabay May 28, 2020
3b63c1e
Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…
eccabay May 28, 2020
c69cfac
Fix hidden failing components tests
eccabay May 28, 2020
bbd28a2
Fix hidden failing components tests
eccabay May 28, 2020
84f99ca
Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…
eccabay May 28, 2020
42f93d6
Expose max_depth as a hyperparameter
eccabay May 29, 2020
0280854
Remove feature selection from extratrees
eccabay May 29, 2020
34be3a2
Merge branch 'master' into 771_extratrees_pipeline
eccabay May 29, 2020
817395a
Split up et_classifier pipeline tests
eccabay May 29, 2020
8c03017
Merge branch '771_extratrees_pipeline' of https://github.com/FeatureL…
eccabay May 29, 2020
ea6d4f1
remove unnecessary test
eccabay May 29, 2020
94759e8
lint fix
eccabay May 29, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/source/api_reference.rst
Expand Up @@ -76,6 +76,8 @@ Classification Pipelines

CatBoostBinaryClassificationPipeline
CatBoostMulticlassClassificationPipeline
ETBinaryClassificationPipeline
ETMulticlassClassificationPipeline
LogisticRegressionBinaryPipeline
LogisticRegressionMulticlassPipeline
RFBinaryClassificationPipeline
Expand All @@ -98,6 +100,7 @@ Regression Pipelines

RFRegressionPipeline
CatBoostRegressionPipeline
ETRegressionPipeline
LinearRegressionPipeline
XGBoostRegressionPipeline
BaselineRegressionPipeline
Expand Down Expand Up @@ -178,6 +181,7 @@ Classifiers are components that output a predicted class label.
:nosignatures:

CatBoostClassifier
ExtraTreesClassifier
RandomForestClassifier
LogisticRegressionClassifier
XGBoostClassifier
Expand All @@ -195,6 +199,7 @@ Regressors are components that output a predicted target value.

CatBoostRegressor
LinearRegressor
ExtraTreesRegressor
RandomForestRegressor
XGBoostRegressor
BaselineRegressor
Expand Down
2 changes: 2 additions & 0 deletions docs/source/changelog.rst
Expand Up @@ -7,6 +7,7 @@ Changelog
* Added baseline models for classification and regression, add functionality to calculate baseline models before searching in AutoML :pr:`746`
* Port over highly-null guardrail as a data check and define `DefaultDataChecks` and `DisableDataChecks` classes :pr:`745`
* Update `Tuner` classes to work directly with pipeline parameters dicts instead of flat parameter lists :pr:`779`
* Added new Pipeline option `ExtraTrees` :pr:`790`
* Added precicion-recall curve metrics and plot for binary classification problems in `evalml.pipeline.graph_utils` :pr:`794`
* Fixes
* Update pipeline `score` to return `nan` score for any objective which throws an exception during scoring :pr:`787`
Expand Down Expand Up @@ -36,6 +37,7 @@ Changelog
* Added unit tests for fraud cost, lead scoring, and standard metric objectives :pr:`741`
* Update codecov client :pr:`782`
* Updated AutoBase __str__ test to include no parameters case :pr:`783`
* Added unit tests for `ExtraTrees` pipeline :pr:`790`
* If codecov fails to upload, fail build :pr:`810`
* Updated Python version of dependency action :pr:`816`
* Update the dependency update bot to use a suffix when creating branches :pr:`817`
Expand Down
2 changes: 2 additions & 0 deletions evalml/model_family/model_family.py
Expand Up @@ -7,6 +7,7 @@ class ModelFamily(Enum):
XGBOOST = 'xgboost'
LINEAR_MODEL = 'linear_model'
CATBOOST = 'catboost'
EXTRA_TREES = 'extra_trees'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Makes sense to have a separate family here for now. In the new automl algo this will mean the first round runs both RF and extra trees. Down the road we may want to group tree-based models different but this is a great starting point. @jeremyliweishih

BASELINE = 'baseline'
NONE = 'none'

Expand All @@ -15,6 +16,7 @@ def __str__(self):
ModelFamily.XGBOOST.name: "XGBoost",
ModelFamily.LINEAR_MODEL.name: "Linear",
ModelFamily.CATBOOST.name: "CatBoost",
ModelFamily.EXTRA_TREES.name: "Extra Trees",
ModelFamily.BASELINE.name: "Baseline",
ModelFamily.NONE.name: "None"}
return model_family_dict[self.name]
7 changes: 6 additions & 1 deletion evalml/pipelines/__init__.py
Expand Up @@ -16,7 +16,9 @@
RFClassifierSelectFromModel,
RFRegressorSelectFromModel,
CatBoostClassifier,
CatBoostRegressor
CatBoostRegressor,
ExtraTreesClassifier,
ExtraTreesRegressor
)

from .pipeline_base import PipelineBase
Expand All @@ -28,6 +30,8 @@
from .classification import (
CatBoostBinaryClassificationPipeline,
CatBoostMulticlassClassificationPipeline,
ETBinaryClassificationPipeline,
ETMulticlassClassificationPipeline,
LogisticRegressionBinaryPipeline,
LogisticRegressionMulticlassPipeline,
RFBinaryClassificationPipeline,
Expand All @@ -45,6 +49,7 @@
RFRegressionPipeline,
CatBoostRegressionPipeline,
XGBoostRegressionPipeline,
ETRegressionPipeline,
BaselineRegressionPipeline,
MeanBaselineRegressionPipeline
)
Expand Down
2 changes: 2 additions & 0 deletions evalml/pipelines/classification/__init__.py
Expand Up @@ -7,5 +7,7 @@
from .catboost_multiclass import CatBoostMulticlassClassificationPipeline
from .random_forest_binary import RFBinaryClassificationPipeline
from .random_forest_multiclass import RFMulticlassClassificationPipeline
from .extra_trees_binary import ETBinaryClassificationPipeline
from .extra_trees_multiclass import ETMulticlassClassificationPipeline
from .baseline_binary import BaselineBinaryPipeline, ModeBaselineBinaryPipeline
from .baseline_multiclass import BaselineMulticlassPipeline, ModeBaselineMulticlassPipeline
7 changes: 7 additions & 0 deletions evalml/pipelines/classification/extra_trees_binary.py
@@ -0,0 +1,7 @@
from evalml.pipelines import BinaryClassificationPipeline


class ETBinaryClassificationPipeline(BinaryClassificationPipeline):
"""Extra Trees Pipeline for binary classification"""
custom_name = "Extra Trees Binary Classification Pipeline"
component_graph = ['One Hot Encoder', 'Simple Imputer', 'Extra Trees Classifier']
7 changes: 7 additions & 0 deletions evalml/pipelines/classification/extra_trees_multiclass.py
@@ -0,0 +1,7 @@
from evalml.pipelines import MulticlassClassificationPipeline


class ETMulticlassClassificationPipeline(MulticlassClassificationPipeline):
"""Extra Trees Pipeline for multiclass classification"""
custom_name = "Extra Trees Multiclass Classification Pipeline"
component_graph = ['One Hot Encoder', 'Simple Imputer', 'Extra Trees Classifier']
2 changes: 2 additions & 0 deletions evalml/pipelines/components/__init__.py
Expand Up @@ -8,6 +8,8 @@
RandomForestRegressor,
XGBoostClassifier,
CatBoostClassifier,
ExtraTreesClassifier,
ExtraTreesRegressor,
CatBoostRegressor,
XGBoostRegressor,
BaselineClassifier,
Expand Down
2 changes: 2 additions & 0 deletions evalml/pipelines/components/estimators/__init__.py
Expand Up @@ -4,9 +4,11 @@
RandomForestClassifier,
XGBoostClassifier,
CatBoostClassifier,
ExtraTreesClassifier,
BaselineClassifier)
from .regressors import (LinearRegressor,
RandomForestRegressor,
CatBoostRegressor,
XGBoostRegressor,
ExtraTreesRegressor,
BaselineRegressor)
Expand Up @@ -3,4 +3,5 @@
from .rf_classifier import RandomForestClassifier
from .xgboost_classifier import XGBoostClassifier
from .catboost_classifier import CatBoostClassifier
from .et_classifier import ExtraTreesClassifier
from .baseline_classifier import BaselineClassifier
@@ -0,0 +1,40 @@
from sklearn.ensemble import ExtraTreesClassifier as SKExtraTreesClassifier
from skopt.space import Integer

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes


class ExtraTreesClassifier(Estimator):
"""Extra Trees Classifier"""
name = "Extra Trees Classifier"
hyperparameter_ranges = {
"n_estimators": Integer(10, 1000),
"max_features": ["auto", "sqrt", "log2"],
"max_depth": Integer(4, 10)
}
model_family = ModelFamily.EXTRA_TREES
supported_problem_types = [ProblemTypes.BINARY, ProblemTypes.MULTICLASS]

def __init__(self,
n_estimators=100,
max_features="auto",
max_depth=6,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_jobs=-1,
random_state=0):
parameters = {"n_estimators": n_estimators,
"max_features": max_features,
"max_depth": max_depth}
et_classifier = SKExtraTreesClassifier(n_estimators=n_estimators,
max_features=max_features,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_weight_fraction_leaf=min_weight_fraction_leaf,
n_jobs=n_jobs,
random_state=random_state)
super().__init__(parameters=parameters,
component_obj=et_classifier,
random_state=random_state)
Expand Up @@ -3,4 +3,5 @@
from .rf_regressor import RandomForestRegressor
from .catboost_regressor import CatBoostRegressor
from .xgboost_regressor import XGBoostRegressor
from .et_regressor import ExtraTreesRegressor
from .baseline_regressor import BaselineRegressor
40 changes: 40 additions & 0 deletions evalml/pipelines/components/estimators/regressors/et_regressor.py
@@ -0,0 +1,40 @@
from sklearn.ensemble import ExtraTreesRegressor as SKExtraTreesRegressor
from skopt.space import Integer

from evalml.model_family import ModelFamily
from evalml.pipelines.components.estimators import Estimator
from evalml.problem_types import ProblemTypes


class ExtraTreesRegressor(Estimator):
"""Extra Trees Regressor"""
name = "Extra Trees Regressor"
hyperparameter_ranges = {
"n_estimators": Integer(10, 1000),
"max_features": ["auto", "sqrt", "log2"],
"max_depth": Integer(4, 10)
}
model_family = ModelFamily.EXTRA_TREES
supported_problem_types = [ProblemTypes.REGRESSION]

def __init__(self,
n_estimators=100,
max_features="auto",
max_depth=6,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_jobs=-1,
random_state=0):
parameters = {"n_estimators": n_estimators,
"max_features": max_features,
"max_depth": max_depth}
et_regressor = SKExtraTreesRegressor(random_state=random_state,
n_estimators=n_estimators,
max_features=max_features,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_weight_fraction_leaf=min_weight_fraction_leaf,
n_jobs=n_jobs)
super().__init__(parameters=parameters,
component_obj=et_regressor,
random_state=random_state)
2 changes: 2 additions & 0 deletions evalml/pipelines/components/utils.py
Expand Up @@ -10,6 +10,8 @@
CatBoostClassifier,
CatBoostRegressor,
Estimator,
ExtraTreesClassifier,
ExtraTreesRegressor,
LinearRegressor,
LogisticRegressionClassifier,
RandomForestClassifier,
Expand Down
1 change: 1 addition & 0 deletions evalml/pipelines/regression/__init__.py
Expand Up @@ -3,4 +3,5 @@
from .random_forest import RFRegressionPipeline
from .catboost import CatBoostRegressionPipeline
from .xgboost_regression import XGBoostRegressionPipeline
from .extra_trees import ETRegressionPipeline
from .baseline_regression import BaselineRegressionPipeline, MeanBaselineRegressionPipeline
7 changes: 7 additions & 0 deletions evalml/pipelines/regression/extra_trees.py
@@ -0,0 +1,7 @@
from evalml.pipelines import RegressionPipeline


class ETRegressionPipeline(RegressionPipeline):
"""Extra Trees Pipeline for regression problems"""
custom_name = "Extra Trees Regression Pipeline"
component_graph = ['One Hot Encoder', 'Simple Imputer', 'Extra Trees Regressor']
6 changes: 6 additions & 0 deletions evalml/tests/component_tests/test_components.py
Expand Up @@ -6,6 +6,8 @@
from evalml.pipelines.components import (
ComponentBase,
Estimator,
ExtraTreesClassifier,
ExtraTreesRegressor,
LinearRegressor,
LogisticRegressionClassifier,
OneHotEncoder,
Expand Down Expand Up @@ -66,10 +68,14 @@ def test_describe_component():

# testing estimators
lr_classifier = LogisticRegressionClassifier()
et_classifier = ExtraTreesClassifier(n_estimators=10, max_features="auto")
et_regressor = ExtraTreesRegressor(n_estimators=10, max_features="auto")
rf_classifier = RandomForestClassifier(n_estimators=10, max_depth=3)
rf_regressor = RandomForestRegressor(n_estimators=10, max_depth=3)
linear_regressor = LinearRegressor()
assert lr_classifier.describe(return_dict=True) == {'name': 'Logistic Regression Classifier', 'parameters': {'C': 1.0, 'penalty': 'l2'}}
assert et_classifier.describe(return_dict=True) == {'name': 'Extra Trees Classifier', 'parameters': {'max_depth': 6, 'max_features': "auto", 'n_estimators': 10}}
assert et_regressor.describe(return_dict=True) == {'name': 'Extra Trees Regressor', 'parameters': {'max_depth': 6, 'max_features': "auto", 'n_estimators': 10}}
assert rf_classifier.describe(return_dict=True) == {'name': 'Random Forest Classifier', 'parameters': {'max_depth': 3, 'n_estimators': 10}}
assert rf_regressor.describe(return_dict=True) == {'name': 'Random Forest Regressor', 'parameters': {'max_depth': 3, 'n_estimators': 10}}
assert linear_regressor.describe(return_dict=True) == {'name': 'Linear Regressor', 'parameters': {'fit_intercept': True, 'normalize': False}}
Expand Down
82 changes: 82 additions & 0 deletions evalml/tests/component_tests/test_et_classifier.py
@@ -0,0 +1,82 @@
import numpy as np
import pytest
from sklearn.ensemble import ExtraTreesClassifier as SKExtraTreesClassifier

from evalml.exceptions import MethodPropertyNotFoundError
from evalml.model_family import ModelFamily
from evalml.pipelines import ExtraTreesClassifier
from evalml.problem_types import ProblemTypes


def test_model_family():
assert ExtraTreesClassifier.model_family == ModelFamily.EXTRA_TREES


def test_problem_types():
assert ProblemTypes.BINARY in ExtraTreesClassifier.supported_problem_types
assert ProblemTypes.MULTICLASS in ExtraTreesClassifier.supported_problem_types
assert len(ExtraTreesClassifier.supported_problem_types) == 2


def test_et_parameters():

clf = ExtraTreesClassifier(n_estimators=20, max_features="auto", max_depth=5, random_state=2)
expected_parameters = {
"n_estimators": 20,
"max_features": "auto",
"max_depth": 5
}

assert clf.parameters == expected_parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



def test_fit_predict_binary(X_y):
X, y = X_y

sk_clf = SKExtraTreesClassifier(max_depth=6, random_state=0)
sk_clf.fit(X, y)
y_pred_sk = sk_clf.predict(X)
y_pred_proba_sk = sk_clf.predict_proba(X)

clf = ExtraTreesClassifier()
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)

np.testing.assert_almost_equal(y_pred, y_pred_sk, decimal=5)
np.testing.assert_almost_equal(y_pred_proba, y_pred_proba_sk, decimal=5)


def test_fit_predict_multi(X_y_multi):
X, y = X_y_multi

sk_clf = SKExtraTreesClassifier(max_depth=6, random_state=0)
sk_clf.fit(X, y)
y_pred_sk = sk_clf.predict(X)
y_pred_proba_sk = sk_clf.predict_proba(X)

clf = ExtraTreesClassifier()
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)

np.testing.assert_almost_equal(y_pred, y_pred_sk, decimal=5)
np.testing.assert_almost_equal(y_pred_proba, y_pred_proba_sk, decimal=5)


def test_feature_importances(X_y):
X, y = X_y

# testing that feature importances can't be called before fit
clf = ExtraTreesClassifier()
with pytest.raises(MethodPropertyNotFoundError):
feature_importances = clf.feature_importances

sk_clf = SKExtraTreesClassifier(max_depth=6, random_state=0)
sk_clf.fit(X, y)
sk_feature_importances = sk_clf.feature_importances_

clf.fit(X, y)
feature_importances = clf.feature_importances

np.testing.assert_almost_equal(sk_feature_importances, feature_importances, decimal=5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are great!

One thing which you can add: check that calling estimator.parameters on the instance returns what you'd expect. This is partially covered in the components tests, but it would be helpful to have direct coverage here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same for the regressor of course lol