Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stacked ensemble performing poorly #2093

Closed
dancuarini opened this issue Apr 6, 2021 · 11 comments · Fixed by #2260
Closed

Stacked ensemble performing poorly #2093

dancuarini opened this issue Apr 6, 2021 · 11 comments · Fixed by #2260
Assignees
Labels
bug Issues tracking problems with existing features. performance Issues tracking performance improvements.

Comments

@dancuarini
Copy link

dancuarini commented Apr 6, 2021

Steps to reproduce:

  1. Load Happiness dataset into evalml
  2. Run long enough to include ensembling
  3. The baseline regressor shows as ranked higher than the stacked regressor.
    Happiness Data Full Set.csv.zip
@dsherry dsherry added the bug Issues tracking problems with existing features. label Apr 6, 2021
@dsherry dsherry changed the title Baseline regressor is performing better than the stacked regressor model Stacked ensemble shown as performing poorly Apr 6, 2021
@dsherry dsherry changed the title Stacked ensemble shown as performing poorly Stacked ensemble performing poorly Apr 6, 2021
@dsherry dsherry added the performance Issues tracking performance improvements. label Apr 6, 2021
@angela97lin
Copy link
Contributor

angela97lin commented Apr 12, 2021

@dancuarini I tried to reproduce this locally but was not able to; could be because of additional steps before running AutoMLSearch (ex: data split size, dropping cols). Let's talk about the problem configuration!

Here's what I tried to run locally:

from evalml.automl import AutoMLSearch
import pandas as pd
import woodwork as ww
from evalml.automl.callbacks import raise_error_callback

happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)
# display(X.head())

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.2, random_seed=0)
# print(X.types)

automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=20, ensembling=True)
automl.search()

This results in the following rankings:

image

@angela97lin
Copy link
Contributor

Current progress: discussed with @dancuarini about not being able to repro locally, will keep in touch with @Cmancuso about repro-ing and next steps.

@dsherry
Copy link
Contributor

dsherry commented Apr 13, 2021

@angela97lin wait, are you sure you couldn't repro this? Here the stacked ensembler shows up in the middle of the rankings--I'd expect it to be at the top!

Thanks for sharing the reproducer :)

@angela97lin
Copy link
Contributor

@dsherry While it is a little suspicious that the stacked ensembler isn't at the top, the original issue was that the stacked ensembler was performing so poorly that it was ranked above the baseline regressor!

@dsherry
Copy link
Contributor

dsherry commented Apr 13, 2021

@angela97lin ah yes understood! I sent you some notes.

I think any evidence that our ensembles aren't always close to the top is a problem.

@angela97lin
Copy link
Contributor

angela97lin commented Apr 15, 2021

Dug into this a bit more. I think there are some potential reasons why the ensembler performs poorly with this data set:

  1. The dataset is really small, and our current data splitting strategy means that the ensembler is provided with and validated on a very small subset of data. Right now, if we want to train a stacked ensembler, we split some data (identified with ensembling_indices) for the ensembler to train on. This is to prevent overfitting the ensembler via training the metalearner on the same data that the input pipelines were already trained on. We then do one CV split, further splitting the data from the ensembling_indices. For this dataset of 128 rows, we train and validate on 17 and 8 rows, respectively. I filed CV fold for ensembler after ensembling_indices split #2144 to discuss whether we want to do this additional CV split.

  2. Our ensembler is currently constructed by taking the best pipeline of each model family found and using that as the input pipelines for the stacked ensembler. However, if some of the input pipelines perform quite poorly, then the stacked ensembler may not perform as well as a high-performing individual pipeline.

For example, this is the final rankings table:
image

We notice that the stacked ensemble performs right smack in the middle--if we simplify and say that the stacked ensemble averages the predictions of its input pipelines, this makes sense. To test my hypothesis, I decided to only use the model families that performed better than the stacked ensembler, rather than all of the model families, and noticed that the resulting score performs much better than any individual pipeline. This leads me to believe that the poor-performing individual pipelines led the stacked ensembler to perform worse.

Here's the repro code for this:

From above:

import pandas as pd
import woodwork as ww
happiness_data_set = pd.read_csv("Happiness Data Full Set.csv")
y = happiness_data_set['Happiness']
X = happiness_data_set.drop(['Happiness'], axis=1)

X = ww.DataTable(X)
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, problem_type='regression', test_size=0.25, random_seed=0)
 
automl = AutoMLSearch(X, y, problem_type="regression", objective="MAE", error_callback=raise_error_callback, max_batches=10, ensembling=True)
automl.search()
import woodwork as ww
from evalml.automl.engine import train_and_score_pipeline
from evalml.automl.engine.engine_base import JobLogger

# Get the pipelines fed into the ensemble but only use the ones better than the stacked ensemble
input_pipelines = []
input_info = automl._automl_algorithm._best_pipeline_info
from evalml.model_family import ModelFamily

trimmed = dict()
trimmed.update({ModelFamily.RANDOM_FOREST: input_info[ModelFamily.RANDOM_FOREST]})
trimmed.update({ModelFamily.XGBOOST: input_info[ModelFamily.XGBOOST]})
trimmed.update({ModelFamily.DECISION_TREE: input_info[ModelFamily.EXTRA_TREES]})

for pipeline_dict in trimmed.values():
    pipeline_class = pipeline_dict['pipeline_class']
    pipeline_params = pipeline_dict['parameters']
    input_pipelines.append(pipeline_class(parameters=automl._automl_algorithm._transform_parameters(pipeline_class, pipeline_params),
                                                      random_seed=automl._automl_algorithm.random_seed))
ensemble_pipeline = _make_stacked_ensemble_pipeline(input_pipelines, "regression")
X_train = X.iloc[automl.ensembling_indices]
y_train = ww.DataColumn(y.iloc[automl.ensembling_indices])
train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())

By just using these three model families, we get a MAE score of ~0.22, which is much better than any individual pipeline.

#output of train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger())
{'scores': {'cv_data': [{'all_objective_scores': OrderedDict([('MAE',
                  0.22281276417465426),
                 ('ExpVariance', 0.9578811127332543),
                 ('MaxError', 0.3858477236606914),
                 ('MedianAE', 0.2790362808260225),
                 ('MSE', 0.0642654425375983),
                 ('R2', 0.9152119239698017),
                 ('Root Mean Squared Error', 0.2535062968401343),
                 ('# Training', 17),
                 ('# Validation', 9)]),
    'mean_cv_score': 0.22281276417465426,
    'binary_classification_threshold': None}],
  'training_time': 9.944366216659546,
  'cv_scores': 0    0.222813
  dtype: float64,
  'cv_score_mean': 0.22281276417465426},
 'pipeline': TemplatedPipeline(parameters={'Stacked Ensemble Regressor':{'input_pipelines': [GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'most_frequent', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Random Forest Regressor':{'n_estimators': 184, 'max_depth': 25, 'n_jobs': -1},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Regressor':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100},}), GeneratedPipeline(parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'Extra Trees Regressor':{'n_estimators': 100, 'max_features': 'auto', 'max_depth': 6, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_jobs': -1},})], 'final_estimator': None, 'cv': None, 'n_jobs': -1},}),

This makes me wonder if we need to rethink about what input pipelines we should feed to our stacked ensembler.

  1. The metalearner we're using (LinearRegressor) is not the best. I tested this via the stacking_test branch I created where I updated the default metalearner to RidgeCV (scikit-learn default, but we don't have in EvalML), and the the ensembler performs much better:
    image

@angela97lin
Copy link
Contributor

Next steps after discussion with @dsherry:

Try #1 and #3 (using Elastic Net) on other datasets, run perf tests, see if we can get better performance overall.

@rpeck
Copy link

rpeck commented Apr 26, 2021

@angela97lin Your points about the splitting, for tiny datasets, are right on target. Eventually, we need to handle tiny datasets really differently than bigger ones, e.g. by only using high-fold-count xval on the entire dataset, even LOOCV, and making sure we construct the folds differently for the ensemble metalearner training.

I also agree that the metalearner needs to use strong regularization. I used Elastic Net in H2O-3 StackedEnsemble, and only remember one time that the ensemble came in second in the leaderboard. Every other time I tested, it was first. The regularization should never allow poor models to bring down the performance of the ensemble.

And this was feeding the entire leaderboard of even 50 models into the metalearner. :-)

@angela97lin
Copy link
Contributor

angela97lin commented Apr 28, 2021

Just posting some extra updates on this:

Tested locally using all of the regression datasets. Results can be found here or just the charts here.

From this:

  • Agreed @rpeck! We should update the metalearner to use strong regularization for sure. ElasticNetCV seemed to perform better than our LinearRegressor on many datasets. This issue tracks this: Ability to use more meta-learner models for stacked ensembles #1739
  • @dsherry and I rediscussed our data splitting strategy: Right now, we split off data for the ensemble. However, this is under the assumption that we want the metalearner to be trained on this ensemble indices. With the scikit-learn implementation, when we train our StackedEnsembler on this ensemble indices split, we end up training both the input pipelines and the metalearner on this small set of data. This could likely be why we are not performing well. While the parameters for our input pipelines are from tuning using by the other data, these pipelines are not fitted. In the long term, rolling our own implementation could allow us to pass in trained pipelines to the ensembler, in which case we would have the behavior we want. For now, that is not the case.

Next step: Test out this hypothesis with ensembler manually. Try to manually train input pipelines on 80% of data, create cross-validated predictions on data set aside for ensembling and train metalearner with out-predictions.

@angela97lin
Copy link
Contributor

Results from experimentation look good: https://alteryx.quip.com/4hEyAaTBZDap/Ensembling-Performance-Using-More-Data

Next steps:

@angela97lin
Copy link
Contributor

After some digging around, we believe that the issue is not with how the ensemble performs, but rather, how we report the ensemble’s performance. Currently, we do a separate ensemble split which is 20% of the data, and then do another train-validation split, and report the score of the ensemble as the validation data. This means that in some cases, the ensemble score is calculated using a very small number of rows (as the happiness dataset above).

By removing the ensemble indices split and using our old method of calculating the cv training score for the ensemble (give it all data, train and validate on one fold), we see that the ensemble is ranked higher in almost all cases, and comes up as #1 in many more cases. Meanwhile, the validation score is the same or slightly better.

Note that since we don’t do any hyperparameter tuning, input pipelines are not trained, and the ensemble only gets the outpredictions of the input pipelines as input, overfitting is not an issue. We can revisit implementing our own ensemble and update the splitting strategy then, but for now, we’re able to see improvements by just changing the data split strategy and scikit-learn’s implementation.

Note that this will cause an increase in fit time when ensembling is enabled: all pipelines see more data (no reserved ensemble indices), and the ensemble is trained on more data. I think this is fine.

Results tabulated here: https://alteryx.quip.com/jI2mArnWZfTU/Ensembling-vs-Best-Pipeline-Validation-Scores#MKWACADlCDt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues tracking problems with existing features. performance Issues tracking performance improvements.
Projects
None yet
4 participants