-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stacked ensemble performing poorly #2093
Comments
@dancuarini I tried to reproduce this locally but was not able to; could be because of additional steps before running AutoMLSearch (ex: data split size, dropping cols). Let's talk about the problem configuration! Here's what I tried to run locally:
This results in the following rankings: |
Current progress: discussed with @dancuarini about not being able to repro locally, will keep in touch with @Cmancuso about repro-ing and next steps. |
@angela97lin wait, are you sure you couldn't repro this? Here the stacked ensembler shows up in the middle of the rankings--I'd expect it to be at the top! Thanks for sharing the reproducer :) |
@dsherry While it is a little suspicious that the stacked ensembler isn't at the top, the original issue was that the stacked ensembler was performing so poorly that it was ranked above the baseline regressor! |
@angela97lin ah yes understood! I sent you some notes. I think any evidence that our ensembles aren't always close to the top is a problem. |
Dug into this a bit more. I think there are some potential reasons why the ensembler performs poorly with this data set:
For example, this is the final rankings table: We notice that the stacked ensemble performs right smack in the middle--if we simplify and say that the stacked ensemble averages the predictions of its input pipelines, this makes sense. To test my hypothesis, I decided to only use the model families that performed better than the stacked ensembler, rather than all of the model families, and noticed that the resulting score performs much better than any individual pipeline. This leads me to believe that the poor-performing individual pipelines led the stacked ensembler to perform worse. Here's the repro code for this: From above:
import woodwork as ww
from evalml.automl.engine import train_and_score_pipeline
from evalml.automl.engine.engine_base import JobLogger
# Get the pipelines fed into the ensemble but only use the ones better than the stacked ensemble
input_pipelines = []
input_info = automl._automl_algorithm._best_pipeline_info
from evalml.model_family import ModelFamily
trimmed = dict()
trimmed.update({ModelFamily.RANDOM_FOREST: input_info[ModelFamily.RANDOM_FOREST]})
trimmed.update({ModelFamily.XGBOOST: input_info[ModelFamily.XGBOOST]})
trimmed.update({ModelFamily.DECISION_TREE: input_info[ModelFamily.EXTRA_TREES]})
for pipeline_dict in trimmed.values():
pipeline_class = pipeline_dict['pipeline_class']
pipeline_params = pipeline_dict['parameters']
input_pipelines.append(pipeline_class(parameters=automl._automl_algorithm._transform_parameters(pipeline_class, pipeline_params),
random_seed=automl._automl_algorithm.random_seed))
ensemble_pipeline = _make_stacked_ensemble_pipeline(input_pipelines, "regression")
X_train = X.iloc[automl.ensembling_indices]
y_train = ww.DataColumn(y.iloc[automl.ensembling_indices])
train_and_score_pipeline(ensemble_pipeline, automl.automl_config, X_train, y_train, JobLogger()) By just using these three model families, we get a MAE score of ~0.22, which is much better than any individual pipeline.
This makes me wonder if we need to rethink about what input pipelines we should feed to our stacked ensembler. |
@angela97lin Your points about the splitting, for tiny datasets, are right on target. Eventually, we need to handle tiny datasets really differently than bigger ones, e.g. by only using high-fold-count xval on the entire dataset, even LOOCV, and making sure we construct the folds differently for the ensemble metalearner training. I also agree that the metalearner needs to use strong regularization. I used Elastic Net in H2O-3 StackedEnsemble, and only remember one time that the ensemble came in second in the leaderboard. Every other time I tested, it was first. The regularization should never allow poor models to bring down the performance of the ensemble. And this was feeding the entire leaderboard of even 50 models into the metalearner. :-) |
Just posting some extra updates on this: Tested locally using all of the regression datasets. Results can be found here or just the charts here. From this:
Next step: Test out this hypothesis with ensembler manually. Try to manually train input pipelines on 80% of data, create cross-validated predictions on data set aside for ensembling and train metalearner with out-predictions. |
Results from experimentation look good: https://alteryx.quip.com/4hEyAaTBZDap/Ensembling-Performance-Using-More-Data Next steps:
|
After some digging around, we believe that the issue is not with how the ensemble performs, but rather, how we report the ensemble’s performance. Currently, we do a separate ensemble split which is 20% of the data, and then do another train-validation split, and report the score of the ensemble as the validation data. This means that in some cases, the ensemble score is calculated using a very small number of rows (as the happiness dataset above). By removing the ensemble indices split and using our old method of calculating the cv training score for the ensemble (give it all data, train and validate on one fold), we see that the ensemble is ranked higher in almost all cases, and comes up as #1 in many more cases. Meanwhile, the validation score is the same or slightly better. Note that since we don’t do any hyperparameter tuning, input pipelines are not trained, and the ensemble only gets the outpredictions of the input pipelines as input, overfitting is not an issue. We can revisit implementing our own ensemble and update the splitting strategy then, but for now, we’re able to see improvements by just changing the data split strategy and scikit-learn’s implementation. Note that this will cause an increase in fit time when ensembling is enabled: all pipelines see more data (no reserved ensemble indices), and the ensemble is trained on more data. I think this is fine. Results tabulated here: https://alteryx.quip.com/jI2mArnWZfTU/Ensembling-vs-Best-Pipeline-Validation-Scores#MKWACADlCDt |
Steps to reproduce:
Happiness Data Full Set.csv.zip
The text was updated successfully, but these errors were encountered: