Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpin pandas version #1708

Merged
merged 22 commits into from Mar 16, 2021
Merged

Unpin pandas version #1708

merged 22 commits into from Mar 16, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Jan 20, 2021

Closes #1618

This change will break our tests that use an older version aka break our conda package build 📦 🤯

A lot of the test changes have to do with how pandas now handles values returned in value_counts(). Now, if two values occur the same number of times, it is stored in the order that it was in the original data. Ex: [5, 1, 6, 5, 1] will be [5, 1, 6] :D

RE prediction explanation tests: confirmed by checking the transformed cols from the OHE that this is the reason for the changes. The new columns are:

Index(['amount', 'provider_American Express',
       'provider_Diners Club / Carte Blanche', 'provider_Discover',
       'provider_JCB 15 digit', 'provider_JCB 16 digit', 'provider_Maestro',
       'provider_Mastercard', 'provider_VISA 13 digit',
       'provider_VISA 16 digit', 'provider_VISA 19 digit', 'currency_CNY',
       'currency_EGP', 'currency_HTG', 'currency_IMP', 'currency_LAK',
       'currency_NAD', 'currency_PAB', 'currency_QAR', 'currency_TZS',
       'currency_XDR'],

Notably, currency_NAD, currency_HTG, and currency_PAB exist.

Using pandas 1.1.5, we get:

Index(['amount', 'provider_American Express',
       'provider_Diners Club / Carte Blanche', 'provider_Discover',
       'provider_JCB 15 digit', 'provider_JCB 16 digit', 'provider_Maestro',
       'provider_Mastercard', 'provider_VISA 13 digit',
       'provider_VISA 16 digit', 'provider_VISA 19 digit', 'currency_CNY',
       'currency_EGP', 'currency_IMP', 'currency_LAK', 'currency_MOP',
       'currency_MUR', 'currency_NIS', 'currency_QAR', 'currency_TZS',
       'currency_XDR'],
      dtype='object')

This aligns with the reason why the tests failed, as the columns chosen for the OHE have changed (due to behavior of value_counts changing)

@angela97lin angela97lin self-assigned this Jan 20, 2021
@angela97lin angela97lin added this to the Sprint 2021 Jan A milestone Jan 20, 2021
@codecov
Copy link

codecov bot commented Jan 20, 2021

Codecov Report

Merging #1708 (675e17f) into main (888dce8) will not change coverage.
The diff coverage is 100.0%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1708   +/-   ##
=======================================
  Coverage   100.0%   100.0%           
=======================================
  Files         273      273           
  Lines       22356    22356           
=======================================
  Hits        22350    22350           
  Misses          6        6           
Impacted Files Coverage Δ
...components/transformers/encoders/onehot_encoder.py 100.0% <ø> (ø)
...rmers/preprocessing/delayed_feature_transformer.py 100.0% <ø> (ø)
...ta_checks_tests/test_class_imbalance_data_check.py 100.0% <ø> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <ø> (ø)
.../tests/pipeline_tests/test_time_series_pipeline.py 100.0% <ø> (ø)
evalml/objectives/objective_base.py 100.0% <100.0%> (ø)
evalml/tests/automl_tests/test_automl.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...s/prediction_explanations_tests/test_explainers.py 100.0% <100.0%> (ø)
...del_understanding_tests/test_partial_dependence.py 100.0% <100.0%> (ø)
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 888dce8...675e17f. Read the comment docs.

@angela97lin angela97lin changed the title Integrate pandas version 1.2.0 Unpin pandas version Mar 11, 2021
@angela97lin angela97lin marked this pull request as ready for review March 11, 2021 20:59
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, pandas' value_counts method now first orders them by frequency (in descending order), then to break ties, it takes the order of occurrence into account?

Does that mean in your example in the description, currency_NAD, currency_HTG, and currency_PAB occur as frequently as currency_MOP, currency_MUR, currency_NIS in the new encoding, but that the first 3 occur AFTER the second 3 in the original input data?

Just wanted to check I am understanding this properly.

evalml/tests/utils_tests/test_gen_utils.py Outdated Show resolved Hide resolved
@angela97lin
Copy link
Contributor Author

@bchen1116 Correct! value_counts always sorted by frequency first but now breaks ties based on their occurrence in the original data, so the test cases that required updates are when the frequencies were equivalent.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin got it! looks good to me then!

@angela97lin angela97lin merged commit 9576d5d into main Mar 16, 2021
@angela97lin angela97lin deleted the 1618_pandas_1.2.0 branch March 16, 2021 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate pandas version 1.2.0
4 participants