-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test to make sure OneHotEncoder
top_n works with large number of categories
#552
Conversation
Codecov Report
@@ Coverage Diff @@
## master #552 +/- ##
=======================================
Coverage 98.87% 98.87%
=======================================
Files 118 118
Lines 4439 4453 +14
=======================================
+ Hits 4389 4403 +14
Misses 50 50
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! Suggestions:
- Please have this test apply directly to
OneHotEncoder
instead of going through autobase/pipelines. - In that case, there's no need for any numerical features, just categorical.
- Is there a way we can have more rows, like ~100k, and then ~20k categories with a few rows each, and then
top_n
categories with more? That way, the test can expect we get a specific set oftop_n
categories as output. And having more rows than categories simulates real world data more closely. - I just merged Support numpy.random.RandomState objects #530 (🎉 ), so please update to use the
get_random_state
util for random number generation. And then all our tests will continue to be fully deterministic!
@dsherry Thanks for the comments! I had added the test initially in automl to test if the one hot encoder would slow down the entire search process but I guess you're right, it should be sufficient to just test the performance of the encoder itself. Updated! For the time being, using random still but will update when the random state PR is merged in :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 awesome, thank you! 🚢
Add test to make sure
OneHotEncoder
top_n works with large number of categories.