Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test to make sure OneHotEncoder top_n works with large number of categories #552

merged 13 commits into from Apr 8, 2020


Copy link

@angela97lin angela97lin commented Mar 31, 2020

Add test to make sure OneHotEncoder top_n works with large number of categories.

@angela97lin angela97lin self-assigned this Mar 31, 2020
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

Merging #552 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #552   +/-   ##
  Coverage   98.87%   98.87%           
  Files         118      118           
  Lines        4439     4453   +14     
+ Hits         4389     4403   +14     
  Misses         50       50           
Impacted Files Coverage Δ
...ts/automl_tests/ 100.00% <100.00%> (ø)
...alml/tests/component_tests/ 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0635980...88a64bf. Read the comment docs.

@angela97lin angela97lin requested a review from dsherry Mar 31, 2020
Copy link

@dsherry dsherry left a comment

Great start! Suggestions:

  • Please have this test apply directly to OneHotEncoder instead of going through autobase/pipelines.
  • In that case, there's no need for any numerical features, just categorical.
  • Is there a way we can have more rows, like ~100k, and then ~20k categories with a few rows each, and then top_n categories with more? That way, the test can expect we get a specific set of top_n categories as output. And having more rows than categories simulates real world data more closely.
  • I just merged #530 (🎉 ), so please update to use the get_random_state util for random number generation. And then all our tests will continue to be fully deterministic!

Copy link
Contributor Author

angela97lin commented Apr 1, 2020

@dsherry Thanks for the comments!

I had added the test initially in automl to test if the one hot encoder would slow down the entire search process but I guess you're right, it should be sufficient to just test the performance of the encoder itself. Updated!

For the time being, using random still but will update when the random state PR is merged in :)

@angela97lin angela97lin requested a review from dsherry Apr 1, 2020
@angela97lin angela97lin requested a review from dsherry Apr 2, 2020
dsherry approved these changes Apr 8, 2020
Copy link

@dsherry dsherry left a comment

👏 awesome, thank you! 🚢

@angela97lin angela97lin merged commit 56aa951 into master Apr 8, 2020
2 checks passed
@dsherry dsherry deleted the 550_many_categories branch Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.

None yet

2 participants