Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_describe_pipeline for pandas 1.2.4 #2129

Merged
merged 3 commits into from Apr 13, 2021
Merged

Conversation

angela97lin
Copy link
Contributor

Fixes test for latest dependency bot. Apparently each time the PR is created, it wipes any other changes so I needed to open my own PR 馃槵

Pandas' update to 1.2.4 broke test_describe_pipeline tests. I believe this is because of the way that pandas treats object columns with float precision via https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.2.4.html. We call logger.info on the pandas DataFrame, which calls to_string. I've updated this PR to update our test to conform to pandas 1.2.4 since this is simply a difference in printing/logging, but it's worth updating test_describe_pipeline to be tolerant to such changes with pandas behavior in the future (could look into setting some pandas options hardcoded on our end). Unfortunately, this doesn't work here since the float config has also changed here :d

@angela97lin angela97lin self-assigned this Apr 12, 2021
@codecov
Copy link

codecov bot commented Apr 12, 2021

Codecov Report

Merging #2129 (bcf23f9) into main (8f30b08) will not change coverage.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2129   +/-   ##
=======================================
  Coverage   100.0%   100.0%           
=======================================
  Files         291      291           
  Lines       23809    23809           
=======================================
  Hits        23799    23799           
  Misses         10       10           
Impacted Files Coverage 螖
evalml/automl/automl_search.py 100.0% <100.0%> (酶)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (酶)

Continue to review full report at Codecov.

Legend - Click here to learn more
螖 = absolute <relative> (impact), 酶 = not affected, ? = missing data
Powered by Codecov. Last update 8f30b08...bcf23f9. Read the comment docs.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @angela97lin !

I think this is the smallest change we can make to fix our tests but I think its weird we're displaying the dataset size to three decimal places since it should always be an int. In "old pandas" we would display to one decimal place which is better but I still think its weird.

I wonder if the better thing to do is either display the fold size outside of the dataframe or round to ints?

@angela97lin
Copy link
Contributor Author

angela97lin commented Apr 13, 2021

@freddyaboulton Yeah, I think the logic here is frustrating because we're converting to object to handle the NaN values (since we can't convert to int with a series that has nan without pandas throwing).

One option is to use the nullable Int64--that can be okay for this internal structure that isn't expected to be modified by the user.

Also posting this: https://stackoverflow.com/questions/39690742/convert-float-to-int-and-leave-nulls, will try to see if any of these solutions float our boat :)

EDIT: Found a way to display "ints" aka floats with less precision 馃槑 waiting for tests to pass, then will merge!

@angela97lin angela97lin modified the milestone: Sprint 2021 Apr A Apr 13, 2021
@angela97lin angela97lin merged commit d035cd1 into main Apr 13, 2021
@angela97lin angela97lin deleted the ange_fix_pandas branch April 13, 2021 15:41
assert "0 1.000 66.000 34.000" in out
assert "1 1.000 67.000 33.000" in out
assert "2 1.000 67.000 33.000" in out
assert "0 1.000 66 34" in out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃憦 馃コ 馃挭

chukarsten added a commit that referenced this pull request Apr 20, 2021
鈥kEngine`` #1975.

- Added optional ``engine`` argument to ``AutoMLSearch`` #1975
- Added a warning about how time series support is still in beta when a user passes in a time series problem to ``AutoMLSearch`` #2118
- Added ``NaturalLanguageNaNDataCheck`` data check #2122
- Added ValueError to ``partial_dependence`` to prevent users from computing partial dependence on columns with all NaNs #2120
- Added standard deviation of cv scores to rankings table #2154
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use ``minority:majority`` ratio instead of ``majority:minority`` #2077
- Fixed bug where two-way partial dependence plots with categorical variables were not working correctly #2117
- Fixed bug where ``hyperparameters`` were not displaying properly for pipelines with a list ``component_graph`` and duplicate components #2133
- Fixed bug where ``pipeline_parameters`` argument in ``AutoMLSearch`` was not applied to pipelines passed in as ``allowed_pipelines`` #2133
- Fixed bug where ``AutoMLSearch`` was not applying custom hyperparameters to pipelines with a list ``component_graph`` and duplicate components #2133
- Removed ``hyperparameter_ranges`` from Undersampler and renamed ``balanced_ratio`` to ``sampling_ratio`` for samplers #2113
- Renamed ``TARGET_BINARY_NOT_TWO_EXAMPLES_PER_CLASS`` data check message code to ``TARGET_MULTICLASS_NOT_TWO_EXAMPLES_PER_CLASS`` #2126
- Modified one-way partial dependence plots of categorical features to display data with a bar plot #2117
- Renamed ``score`` column for ``automl.rankings`` as ``mean_cv_score`` #2135
- Fixed ``conf.py`` file #2112
- Added a sentence to the automl user guide stating that our support for time series problems is still in beta. #2118
- Fixed documentation demos #2139
- Update test badge in README to use GitHub Actions #2150
- Fixed ``test_describe_pipeline`` for ``pandas`` ``v1.2.4`` #2129
- Added a GitHub Action for building the conda package #1870 #2148
.. warning::
- Renamed ``balanced_ratio`` to ``sampling_ratio`` for the ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, ``BalancedClassficationSampler``, and Undersampler #2113
- Deleted the "errors" key from automl results #1975
- Deleted the ``raise_and_save_error_callback`` and the ``log_and_save_error_callback`` #1975
- Fixed ``BalancedClassificationDataCVSplit``, ``BalancedClassificationDataTVSplit``, and ``BalancedClassificationSampler`` to use minority:majority ratio instead of majority:minority #2077
@chukarsten chukarsten mentioned this pull request Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants