Tabular: Enhance drop_duplicates, enable by default #3010

Innixma · 2023-03-07T02:19:27Z

Issue #, if available:

Description of changes:

Previously due to a bug in pandas (or otherwise very strange logic), duplicate category features with NaNs would cause a crash in DropDuplicatesFeatureGenerator because Series.replace({np.nan: foobar}) would not actually replace NaN, unlike if the dtype was numerical or object type.
- Note: This bug did not cause crashes for standard users as DropDuplicatesFeatureGenerator was previously not enabled except for TextSpecial features which never contain NaN and are numeric
This PR fixes the above issue by first converting category features to object when checking for duplicates.
Also adds comprehensive unit testing for all of the edge-case scenario's involving NaN and mixed dtypes.
Enabled post_drop_duplicates=True by default in PipelineFeatureGenerator, which means that TabularPredictor will now automatically filter duplicate columns (previously it did not do this). This will help significantly in cases where the user passes data with many duplicate features, and also helps to filter out redundant ngrams significantly when text is present. (You can see the ngram filtering in action with the changed AutoMLPipelineFeatureGenerator unit test outputs)
Reduced the sample sizes used for duplicate detection to avoid overly long fit times for the generator. The chosen value (2000) should be high enough to still correctly de-dupe features in virtually all practical situations.

TODO:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Innixma · 2023-03-07T02:41:37Z

@gradientsky This PR causes EDA unit tests to fail, probably because additional features are being pruned now in the feature generator.

gradientsky · 2023-05-25T00:46:10Z

@gradientsky This PR causes EDA unit tests to fail, probably because additional features are being pruned now in the feature generator.

Added fixes + merged main

github-actions · 2023-05-25T21:15:15Z

gradientsky · 2023-05-25T21:18:27Z

LGTM

Innixma added this to the 0.7.1 Release milestone Mar 7, 2023

Innixma added enhancement New feature or request module: tabular priority: 1 High priority labels Mar 7, 2023

Tabular: Enhance drop_duplicates, enable by default

6c81245

Innixma force-pushed the drop_duplicates_opt branch from ed287a7 to 6c81245 Compare March 19, 2023 22:43

Innixma modified the milestones: 0.7.1 Release, 0.8 Release May 16, 2023

gradientsky added 2 commits May 24, 2023 17:41

Merge branch 'master' into drop_duplicates_opt

8b99910

Fixed eda tests

1491392

Fix unit test

98ade7e

Innixma added priority: 0 Maximum priority and removed priority: 1 High priority labels May 25, 2023

Innixma requested a review from gradientsky May 25, 2023 18:39

gradientsky approved these changes May 25, 2023

View reviewed changes

Innixma merged commit 2f3835c into autogluon:master May 25, 2023
28 checks passed

Provide feedback