Stacked ensemble performing poorly #2093

dancuarini · 2021-04-06T13:31:54Z

Steps to reproduce:

Load Happiness dataset into evalml
Run long enough to include ensembling
The baseline regressor shows as ranked higher than the stacked regressor.
Happiness Data Full Set.csv.zip

angela97lin · 2021-04-12T20:09:17Z

@dancuarini I tried to reproduce this locally but was not able to; could be because of additional steps before running AutoMLSearch (ex: data split size, dropping cols). Let's talk about the problem configuration!

Here's what I tried to run locally:

from evalml.automl import AutoMLSearch
import pandas as pd
import woodwork as ww
from evalml.automl.callbacks import raise_error_callback

happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)
# display(X.head())

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.2, random_seed=0)
# print(X.types)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=20, ensembling=True)
automl.search()

This results in the following rankings:

angela97lin · 2021-04-13T16:25:16Z

Current progress: discussed with @dancuarini about not being able to repro locally, will keep in touch with @Cmancuso about repro-ing and next steps.

dsherry · 2021-04-13T18:23:15Z

@angela97lin wait, are you sure you couldn't repro this? Here the stacked ensembler shows up in the middle of the rankings--I'd expect it to be at the top!

Thanks for sharing the reproducer :)

angela97lin · 2021-04-13T21:22:53Z

@dsherry While it is a little suspicious that the stacked ensembler isn't at the top, the original issue was that the stacked ensembler was performing so poorly that it was ranked above the baseline regressor!

dsherry · 2021-04-13T22:20:07Z

@angela97lin ah yes understood! I sent you some notes.

I think any evidence that our ensembles aren't always close to the top is a problem.

angela97lin · 2021-04-15T22:09:08Z

Dug into this a bit more. I think there are some potential reasons why the ensembler performs poorly with this data set:

The dataset is really small, and our current data splitting strategy means that the ensembler is provided with and validated on a very small subset of data. Right now, if we want to train a stacked ensembler, we split some data (identified with ensembling_indices) for the ensembler to train on. This is to prevent overfitting the ensembler via training the metalearner on the same data that the input pipelines were already trained on. We then do one CV split, further splitting the data from the ensembling_indices. For this dataset of 128 rows, we train and validate on 17 and 8 rows, respectively. I filed CV fold for ensembler after ensembling_indices split #2144 to discuss whether we want to do this additional CV split.
Our ensembler is currently constructed by taking the best pipeline of each model family found and using that as the input pipelines for the stacked ensembler. However, if some of the input pipelines perform quite poorly, then the stacked ensembler may not perform as well as a high-performing individual pipeline.

For example, this is the final rankings table:

We notice that the stacked ensemble performs right smack in the middle--if we simplify and say that the stacked ensemble averages the predictions of its input pipelines, this makes sense. To test my hypothesis, I decided to only use the model families that performed better than the stacked ensembler, rather than all of the model families, and noticed that the resulting score performs much better than any individual pipeline. This leads me to believe that the poor-performing individual pipelines led the stacked ensembler to perform worse.

Here's the repro code for this:

From above:

import pandas as pd
import woodwork as ww
happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.25, random_seed=0)
 
automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=10, ensembling=True)
automl.search()

import woodwork as ww
from evalml.automl.engine import train_and_score_pipeline
from evalml.automl.engine.engine_base import JobLogger

# Get the pipelines fed into the ensemble but only use the ones better than the stacked ensemble
input_pipelines = []
input_info = automl._automl_algorithm._best_pipeline_info
from evalml.model_family import ModelFamily

trimmed = dict()
trimmed.update({ModelFamily.RANDOM_FOREST: input_info[ModelFamily.RANDOM_FOREST]})
trimmed.update({ModelFamily.XGBOOST: input_info[ModelFamily.XGBOOST]})
trimmed.update({ModelFamily.DECISION_TREE: input_info[ModelFamily.EXTRA_TREES]})

for pipeline_dict in trimmed.values():
    pipeline_class = pipeline_dict['pipeline_class']
    pipeline_params = pipeline_dict['parameters']
    input_pipelines.append(pipeline_class(parameters=automl._automl_algorithm._transform_parameters(pipeline_class, pipeline_params),
                                                      random_seed=automl._automl_algorithm.random_seed))
ensemble_pipeline = _make_stacked_ensemble_pipeline(input_pipelines, "regression")
X_train = X.iloc[automl.ensembling_indices]
y_train = ww.DataColumn(y.iloc[automl.ensembling_indices])
train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())

By just using these three model families, we get a MAE score of ~0.22, which is much better than any individual pipeline.

#output of train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())
{'scores': {'cv_data': [{'all_objective_scores': OrderedDict([('MAE',
                  0.22281276417465426),
                 ('ExpVariance', 0.9578811127332543),
                 ('MaxError', 0.3858477236606914),
                 ('MedianAE', 0.2790362808260225),
                 ('MSE', 0.0642654425375983),
                 ('R2', 0.9152119239698017),
                 ('Root Mean Squared Error', 0.2535062968401343),
                 ('# Training', 17),
                 ('# Validation', 9)]),
    'mean_cv_score': 0.22281276417465426,
    'binary_classification_threshold': None}],
  'training_time': 9.944366216659546,
  'cv_scores': 0    0.222813
  dtype: float64,
  'cv_score_mean': 0.22281276417465426},
 'pipeline': TemplatedPipeline(parameters={'Stacked Ensemble Regressor':{'input_pipelines': [GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 184, 'max_depth': 25, 'n_jobs': -1},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})], 'final_estimator': None, 'cv': None, 'n_jobs': -1},}),

This makes me wonder if we need to rethink about what input pipelines we should feed to our stacked ensembler.

The metalearner we're using (LinearRegressor) is not the best. I tested this via the stacking_test branch I created where I updated the default metalearner to RidgeCV (scikit-learn default, but we don't have in EvalML), and the the ensembler performs much better:

angela97lin · 2021-04-16T19:34:41Z

Next steps after discussion with @dsherry:

Try #1 and #3 (using Elastic Net) on other datasets, run perf tests, see if we can get better performance overall.

rpeck · 2021-04-26T19:43:00Z

@angela97lin Your points about the splitting, for tiny datasets, are right on target. Eventually, we need to handle tiny datasets really differently than bigger ones, e.g. by only using high-fold-count xval on the entire dataset, even LOOCV, and making sure we construct the folds differently for the ensemble metalearner training.

I also agree that the metalearner needs to use strong regularization. I used Elastic Net in H2O-3 StackedEnsemble, and only remember one time that the ensemble came in second in the leaderboard. Every other time I tested, it was first. The regularization should never allow poor models to bring down the performance of the ensemble.

And this was feeding the entire leaderboard of even 50 models into the metalearner. :-)

angela97lin · 2021-04-28T19:21:55Z

Just posting some extra updates on this:

Tested locally using all of the regression datasets. Results can be found here or just the charts here.

From this:

Agreed @rpeck! We should update the metalearner to use strong regularization for sure. ElasticNetCV seemed to perform better than our LinearRegressor on many datasets. This issue tracks this: Ability to use more meta-learner models for stacked ensembles #1739
@dsherry and I rediscussed our data splitting strategy: Right now, we split off data for the ensemble. However, this is under the assumption that we want the metalearner to be trained on this ensemble indices. With the scikit-learn implementation, when we train our StackedEnsembler on this ensemble indices split, we end up training both the input pipelines and the metalearner on this small set of data. This could likely be why we are not performing well. While the parameters for our input pipelines are from tuning using by the other data, these pipelines are not fitted. In the long term, rolling our own implementation could allow us to pass in trained pipelines to the ensembler, in which case we would have the behavior we want. For now, that is not the case.

Next step: Test out this hypothesis with ensembler manually. Try to manually train input pipelines on 80% of data, create cross-validated predictions on data set aside for ensembling and train metalearner with out-predictions.

angela97lin · 2021-04-30T18:32:05Z

Results from experimentation look good: https://alteryx.quip.com/4hEyAaTBZDap/Ensembling-Performance-Using-More-Data

Next steps:

Double check with holdout set and validation score
Design doc for Build ensembler component #1930

angela97lin · 2021-05-12T05:00:15Z

After some digging around, we believe that the issue is not with how the ensemble performs, but rather, how we report the ensemble’s performance. Currently, we do a separate ensemble split which is 20% of the data, and then do another train-validation split, and report the score of the ensemble as the validation data. This means that in some cases, the ensemble score is calculated using a very small number of rows (as the happiness dataset above).

By removing the ensemble indices split and using our old method of calculating the cv training score for the ensemble (give it all data, train and validate on one fold), we see that the ensemble is ranked higher in almost all cases, and comes up as #1 in many more cases. Meanwhile, the validation score is the same or slightly better.

Note that since we don’t do any hyperparameter tuning, input pipelines are not trained, and the ensemble only gets the outpredictions of the input pipelines as input, overfitting is not an issue. We can revisit implementing our own ensemble and update the splitting strategy then, but for now, we’re able to see improvements by just changing the data split strategy and scikit-learn’s implementation.

Note that this will cause an increase in fit time when ensembling is enabled: all pipelines see more data (no reserved ensemble indices), and the ensemble is trained on more data. I think this is fine.

Results tabulated here: https://alteryx.quip.com/jI2mArnWZfTU/Ensembling-vs-Best-Pipeline-Validation-Scores#MKWACADlCDt

dsherry added the bug Issues tracking problems with existing features. label Apr 6, 2021

dsherry changed the title ~~Baseline regressor is performing better than the stacked regressor model~~ Stacked ensemble shown as performing poorly Apr 6, 2021

dsherry changed the title ~~Stacked ensemble shown as performing poorly~~ Stacked ensemble performing poorly Apr 6, 2021

dsherry added the performance Issues tracking performance improvements. label Apr 6, 2021

dsherry assigned angela97lin Apr 6, 2021

angela97lin mentioned this issue Apr 15, 2021

CV fold for ensembler after ensembling_indices split #2144

Closed

angela97lin mentioned this issue Apr 22, 2021

[SPIKE] Update final estimator of our ensemblers #2178

Closed

angela97lin linked a pull request Apr 22, 2021 that will close this issue

[SPIKE] Update final estimator of our ensemblers #2178

Closed

dsherry mentioned this issue Apr 29, 2021

Ability to use more meta-learner models for stacked ensembles #1739

Open

dsherry mentioned this issue May 4, 2021

Speed up stacked ensembler in automl #1688

Closed

angela97lin mentioned this issue May 17, 2021

Remove ensemble split and indices in AutoMLSearch #2260

Merged

angela97lin closed this as completed in #2260 May 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stacked ensemble performing poorly #2093

Stacked ensemble performing poorly #2093

dancuarini commented Apr 6, 2021 •

edited by dsherry

Loading

angela97lin commented Apr 12, 2021 •

edited

Loading

angela97lin commented Apr 13, 2021

dsherry commented Apr 13, 2021 •

edited

Loading

angela97lin commented Apr 13, 2021

dsherry commented Apr 13, 2021

angela97lin commented Apr 15, 2021 •

edited

Loading

angela97lin commented Apr 16, 2021

rpeck commented Apr 26, 2021

angela97lin commented Apr 28, 2021 •

edited

Loading

angela97lin commented Apr 30, 2021

angela97lin commented May 12, 2021

Stacked ensemble performing poorly #2093

Stacked ensemble performing poorly #2093

Comments

dancuarini commented Apr 6, 2021 • edited by dsherry Loading

angela97lin commented Apr 12, 2021 • edited Loading

angela97lin commented Apr 13, 2021

dsherry commented Apr 13, 2021 • edited Loading

angela97lin commented Apr 13, 2021

dsherry commented Apr 13, 2021

angela97lin commented Apr 15, 2021 • edited Loading

angela97lin commented Apr 16, 2021

rpeck commented Apr 26, 2021

angela97lin commented Apr 28, 2021 • edited Loading

angela97lin commented Apr 30, 2021

angela97lin commented May 12, 2021

dancuarini commented Apr 6, 2021 •

edited by dsherry

Loading

angela97lin commented Apr 12, 2021 •

edited

Loading

dsherry commented Apr 13, 2021 •

edited

Loading

angela97lin commented Apr 15, 2021 •

edited

Loading

angela97lin commented Apr 28, 2021 •

edited

Loading