Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test to make sure OneHotEncoder top_n works with large number of categories #552

Merged
merged 13 commits into from
Apr 8, 2020

Conversation

angela97lin
Copy link
Contributor

Add test to make sure OneHotEncoder top_n works with large number of categories.

@angela97lin angela97lin self-assigned this Mar 31, 2020
@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

Merging #552 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #552   +/-   ##
=======================================
  Coverage   98.87%   98.87%           
=======================================
  Files         118      118           
  Lines        4439     4453   +14     
=======================================
+ Hits         4389     4403   +14     
  Misses         50       50           
Impacted Files Coverage Δ
...ts/automl_tests/test_auto_classification_search.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0635980...88a64bf. Read the comment docs.

@angela97lin angela97lin requested a review from dsherry March 31, 2020 22:21
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! Suggestions:

  • Please have this test apply directly to OneHotEncoder instead of going through autobase/pipelines.
  • In that case, there's no need for any numerical features, just categorical.
  • Is there a way we can have more rows, like ~100k, and then ~20k categories with a few rows each, and then top_n categories with more? That way, the test can expect we get a specific set of top_n categories as output. And having more rows than categories simulates real world data more closely.
  • I just merged Support numpy.random.RandomState objects #530 (🎉 ), so please update to use the get_random_state util for random number generation. And then all our tests will continue to be fully deterministic!

@angela97lin
Copy link
Contributor Author

@dsherry Thanks for the comments!

I had added the test initially in automl to test if the one hot encoder would slow down the entire search process but I guess you're right, it should be sufficient to just test the performance of the encoder itself. Updated!

For the time being, using random still but will update when the random state PR is merged in :)

@angela97lin angela97lin requested a review from dsherry April 1, 2020 17:51
@angela97lin angela97lin requested a review from dsherry April 2, 2020 15:12
Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 awesome, thank you! 🚢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants