Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoMM: Add consistent seed, stratification, alignment with tabular for train-val split #3004

Merged
merged 5 commits into from
Mar 23, 2023

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Mar 6, 2023

Issue #, if available:

#3003

Description of changes:

  • Previously AutoMM would split the same data differently even with the same seed when calling fit multiple times. Now it will use the same split if called with same data and same seed.
  • Changed default seed to 0 to align with Tabular
  • Changed splitting logic to be identical to Tabular, including adding stratification and other guardrails such as ensuring at least 1 sample per class exists in train data after split. Stratification should help improve AutoMM's test scores, especially when train_data has a small amount of rows.

Reproducing

Tabular & MultiModal Identical Result

Now the following code produces identical results for Tabular and AutoMM (previously did not due to (1) seed difference, (2) data split logic difference:

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000  # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)

label = 'label'
eval_metric = 'accuracy'

from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label=label, eval_metric=eval_metric)
predictor.fit(
    train_data,
    hyperparameters={'AG_AUTOMM': {
        "optimization.max_epochs": 2,
    }},
    fit_weighted_ensemble=False,
)

test_score = predictor.evaluate(test_data)
print(test_score)

from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor(label=label, eval_metric=eval_metric)
predictor.fit(
    train_data,
    hyperparameters={
        "optimization.max_epochs": 2,
    }
)

test_score = predictor.evaluate(test_data)
print(test_score)

Output:

# TabularPredictor
{'accuracy': 0.8279816513761468, 'balanced_accuracy': 0.8292708596446914, 'mcc': 0.6639300307917014, 'roc_auc': 0.9188057800791447, 'f1': 0.8179611650485437, 'precision': 0.8868421052631579, 'recall': 0.759009009009009}
# MultiModalPredictor
{'accuracy': 0.8279816513761468}

MultiModalPredictor Continuous Training Identical Split

Now AutoMM uses the same train/val split when passed the same training data across multiple fit calls with the same seed. This avoids corrupting the validation score across multiple fit calls (#3003)

You can check the difference by running this code (Note: You'll need to add print statements to the train/val data rows or enter debugger to see that they are misaligned in mainline, and aligned in this PR):

from autogluon.core.utils.loaders import load_pd

train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000  # subsample data for faster demo, try setting this to larger values
train_data = train_data.sample(n=subsample_size, random_state=0)

label = 'label'
eval_metric = 'accuracy'

from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor(label=label, eval_metric=eval_metric)

predictor.fit(
    train_data,
    hyperparameters={
        "optimization.max_epochs": 2,
    }
)

predictor.fit(
    train_data,
    hyperparameters={
        "optimization.max_epochs": 2,
    }
)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added this to the 0.7.1 Release milestone Mar 6, 2023
@Innixma Innixma added bug Something isn't working enhancement New feature or request module: multimodal priority: 1 High priority labels Mar 6, 2023
@github-actions
Copy link

github-actions bot commented Mar 6, 2023

Job PR-3004-9175664 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3004/9175664/index.html

The ratio of data to use as validation.
If 0.2, 20% of the data will be used for validation, and 80% for training.
If None, the ratio is automatically determined,
ranging from 0.2 for small row count to 0.01 for large row count.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we changing train-test split ratios logic to match tabular here?

If we are changing, this might not be the right thing to do for text models. Generally you’ll need more text rows for validation than in tabular models.

Specifically, cases like when you have 100k rows, giving 0.01 of that for validation might not enough information for stable validation. On top of that, generally we don't train multiple multi-modal models unlike the defaults in tabular top presets => more strict requirements for generalization verification.

Copy link
Contributor Author

@Innixma Innixma Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ratios are not changed. It was already matching Tabular. Also, 100k will not have 0.01, it will have 0.025 (2500).

@@ -530,7 +529,7 @@ def fit(
column_types: Optional[dict] = None,
holdout_frac: Optional[float] = None,
teacher_predictor: Union[str, MultiModalPredictor] = None,
seed: Optional[int] = 123,
seed: Optional[int] = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to compare the performance on AutoMM benchmark datasets before and after changing the seed. If there is no performance drop, we can make this change.

Copy link
Contributor Author

@Innixma Innixma Mar 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a performance drop purely by changing the seed, that just means we were overfit on the seed. It shouldn't be a blocker for merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yes, I would recommend a benchmark run on this PR, since there are other impactful changes, such as using stratification during splits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tonyhoo Do we need to test this PR on the vision and text benchmarks?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the tutorial generated from this PR as an indicator on the performance impact? If there are some concerns raised from there, I am fine rerun the NLP and CV benchmark

@@ -718,14 +709,28 @@ def fit(
if self._config is not None: # continuous training
config = self._config

problem_type, output_shape = infer_problem_type_output_shape(
# FIXME: Align logic with Tabular,
# don't combine output_shape and problem_type detection, make them separate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. We need to separate the logic. The data split inconsistency between two fit() is due to that we tie them together and don't infer problem type before splitting data.

@@ -687,15 +687,6 @@ def fit(
fit_called=fit_called,
)

train_data, tuning_data = split_train_tuning_data(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving split_train_tuning_data after infer_column_types would cause issues since infer_column_types requires to use tuning_data in training. Otherwise, columns with one unique value can't be ignored. See https://github.com/autogluon/autogluon/blob/master/multimodal/src/autogluon/multimodal/data/infer_types.py#L504-L528

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to resolve this. Note that it is a bit awkward to resolve this given how logic is being coupled together in the current functions, so I've added TODOs and FIXMEs where I think it can be streamlined.

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer_column_types requires tuning_data in training.

@Innixma
Copy link
Contributor Author

Innixma commented Mar 7, 2023

infer_column_types requires tuning_data in training.

fixed, although I will note the code as previously written made it very easy to do this incorrectly. We should look to refactor the code so such bugs are harder to accidentally introduce.

@github-actions
Copy link

github-actions bot commented Mar 7, 2023

Job PR-3004-633433c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3004/633433c/index.html

@github-actions
Copy link

github-actions bot commented Mar 7, 2023

Job PR-3004-cfde794 is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3004/cfde794/index.html

Copy link
Contributor

@zhiqiangdon zhiqiangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can refactor the logic of inferring problem type and output shape later.

@Innixma
Copy link
Contributor Author

Innixma commented Mar 17, 2023

@zhiqiangdon rebased to resolve merge conflict. Please feel free to merge when you are ready.

@zhiqiangdon
Copy link
Contributor

@zhiqiangdon rebased to resolve merge conflict. Please feel free to merge when you are ready.

CI has one unit test error.

@Innixma
Copy link
Contributor Author

Innixma commented Mar 20, 2023

@zhiqiangdon rebased to resolve merge conflict. Please feel free to merge when you are ready.

CI has one unit test error.

Should be fixed in latest commit, thanks!

@github-actions
Copy link

Job PR-3004-b084d3c is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3004/b084d3c/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request module: multimodal priority: 1 High priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants