Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabular: Enhance drop_duplicates, enable by default #3010

Merged
merged 4 commits into from
May 25, 2023

Conversation

Innixma
Copy link
Contributor

@Innixma Innixma commented Mar 7, 2023

Issue #, if available:

Description of changes:

  • Previously due to a bug in pandas (or otherwise very strange logic), duplicate category features with NaNs would cause a crash in DropDuplicatesFeatureGenerator because Series.replace({np.nan: foobar}) would not actually replace NaN, unlike if the dtype was numerical or object type.
    • Note: This bug did not cause crashes for standard users as DropDuplicatesFeatureGenerator was previously not enabled except for TextSpecial features which never contain NaN and are numeric
  • This PR fixes the above issue by first converting category features to object when checking for duplicates.
  • Also adds comprehensive unit testing for all of the edge-case scenario's involving NaN and mixed dtypes.
  • Enabled post_drop_duplicates=True by default in PipelineFeatureGenerator, which means that TabularPredictor will now automatically filter duplicate columns (previously it did not do this). This will help significantly in cases where the user passes data with many duplicate features, and also helps to filter out redundant ngrams significantly when text is present. (You can see the ngram filtering in action with the changed AutoMLPipelineFeatureGenerator unit test outputs)
  • Reduced the sample sizes used for duplicate detection to avoid overly long fit times for the generator. The chosen value (2000) should be high enough to still correctly de-dupe features in virtually all practical situations.

TODO:

  • Benchmark on AMLB to ensure stability and performance

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@Innixma Innixma added this to the 0.7.1 Release milestone Mar 7, 2023
@Innixma Innixma added enhancement New feature or request module: tabular priority: 1 High priority labels Mar 7, 2023
@Innixma
Copy link
Contributor Author

Innixma commented Mar 7, 2023

@gradientsky This PR causes EDA unit tests to fail, probably because additional features are being pruned now in the feature generator.

@Innixma Innixma modified the milestones: 0.7.1 Release, 0.8 Release May 16, 2023
@gradientsky
Copy link
Contributor

@gradientsky This PR causes EDA unit tests to fail, probably because additional features are being pruned now in the feature generator.

Added fixes + merged main

@Innixma Innixma added priority: 0 Maximum priority and removed priority: 1 High priority labels May 25, 2023
@Innixma Innixma requested a review from gradientsky May 25, 2023 18:39
@github-actions
Copy link

Job PR-3010-98ade7e is done.
Docs are uploaded to http://autogluon-staging.s3-website-us-west-2.amazonaws.com/PR-3010/98ade7e/index.html

@gradientsky
Copy link
Contributor

LGTM

@Innixma Innixma merged commit 2f3835c into autogluon:master May 25, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request module: tabular priority: 0 Maximum priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants