Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV fold for ensembler after ensembling_indices split #2144

Closed
angela97lin opened this issue Apr 15, 2021 · 2 comments
Closed

CV fold for ensembler after ensembling_indices split #2144

angela97lin opened this issue Apr 15, 2021 · 2 comments
Assignees
Labels
enhancement An improvement to an existing feature. spike To generate additional issues and kick off a sprint.

Comments

@angela97lin
Copy link
Contributor

Currently in AutoML, if we want to train an ensemble, we create an ensemble split (

X_train, y_train = X.iloc[automl_config.ensembling_indices], y.iloc[automl_config.ensembling_indices]
). This is to prevent overfitting, by not training the ensemble on the same data that the metalearners are trained on.

Then, in train_and_score_pipeline, we split the ensembling indices data using our data splitter and train/validate on one fold of the data (

if pipeline.model_family == ModelFamily.ENSEMBLE and i > 0:
). Is this necessary? The ensemble internally already does cross-validation. For small datasets, this means we're scoring on 1/3 * 0.2 (ensemble indices size) of data. For the happiness dataset with 128 rows (#2093), the mean_cv_score is calculated using just 8 rows. 😬 Perhaps we can remove these lines of code and just train the ensemble on the full ensemble_indices.

@dsherry @bchen1116 @rpeck FYI

@angela97lin angela97lin added the enhancement An improvement to an existing feature. label Apr 15, 2021
@dsherry
Copy link
Contributor

dsherry commented Apr 20, 2021

Marking this as blocked on #2093. Its possible the fix for #2093 includes fixing the code described here!

@dsherry dsherry added the spike To generate additional issues and kick off a sprint. label Apr 20, 2021
@angela97lin
Copy link
Contributor Author

Closed by #2260

@angela97lin angela97lin self-assigned this May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. spike To generate additional issues and kick off a sprint.
Projects
None yet
Development

No branches or pull requests

2 participants