[CI] account for changed pandas.concat sorting behavior #201

xiki-tempula · 2022-06-25T17:31:15Z

Fix #200 It seems that pandas 1.4.3 has some issue.

codecov · 2022-06-25T17:35:29Z

Codecov Report

Merging #201 (a651b3e) into master (fd08e1d) will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   98.04%   98.20%   +0.15%     
==========================================
  Files          24       24              
  Lines        1333     1340       +7     
  Branches      281      290       +9     
==========================================
+ Hits         1307     1316       +9     
  Misses          5        5              
+ Partials       21       19       -2

Impacted Files	Coverage Δ
src/alchemlyb/parsing/namd.py	`99.17% <100.00%> (+<0.01%)`	⬆️
src/alchemlyb/parsing/__init__.py	`100.00% <0.00%> (ø)`
src/alchemlyb/parsing/amber.py	`97.48% <0.00%> (+0.02%)`	⬆️
src/alchemlyb/estimators/mbar_.py	`98.24% <0.00%> (+0.03%)`	⬆️
src/alchemlyb/parsing/gmx.py	`98.18% <0.00%> (+0.60%)`	⬆️
src/alchemlyb/parsing/gomc.py	`93.20% <0.00%> (+1.12%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd08e1d...a651b3e. Read the comment docs.

orbeckst · 2022-06-27T04:07:23Z

Do we know what the issue with Pandas is? Is 1.4.3 the latest version and does, say, 1.4.4 work? Basically I am wondering if this is has been wrong the whole time and is fixed now and we need to change the test or if this is a temporary issue?

If it's temporary then we also need to change the pip and condas stuff so that version of pandas is never installed.

orbeckst

Is there a known bug in pandas 1.4.3 that explains this behavior?

I'd like to understand the reason before we just ban certain versions (which would also need to be done in the installation meta data).

xiki-tempula · 2022-06-28T08:10:12Z

@orbeckst Sorry for the delay, I will investigate this issue in the weekends.

orbeckst · 2022-06-28T17:25:23Z

Sure, no problem, thanks for donating your time!!!!

dotsdl · 2022-06-29T23:23:55Z

Investigating this now. Looks like in pandas 1.4.3 the columns of our u_nk dataframes are being sorted from smallest to highest lambda value. Digging into if this is intentional behavior in latest pandas.

dotsdl · 2022-06-30T00:06:50Z

Behavior of this test prior to pandas 1.4.3 gives a u_nk that looks like:

u_nk
                    1.0       0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2       0.1  0.0
time    fep-lambda                                                                 
4000.0  1.0         0.0  0.101483  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4010.0  1.0         0.0  0.396537  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4020.0  1.0         0.0  0.813706  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4030.0  1.0         0.0  0.533748  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4040.0  1.0         0.0 -0.238862  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...       ...  ...
49960.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.880429  0.0
49970.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.901732  0.0
49980.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -3.063265  0.0
49990.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.727282  0.0
50000.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -1.969434  0.0

[30170 rows x 11 columns]

which gives for the resulting BAR estimator's delta_f_:

           1.0       0.9       0.8       0.7       0.6       0.5       0.4        0.3        0.2       0.1       0.0
1.0   0.000000 -1.296011 -2.914735 -4.520049 -5.732406 -6.495677 -8.836413 -10.094600 -10.228049 -8.583977 -4.184060
0.9   1.296011  0.000000 -1.618724 -3.224038 -4.436395 -5.199666 -7.540402  -8.798590  -8.932038 -7.287966 -2.888049
0.8   2.914735  1.618724  0.000000 -1.605314 -2.817671 -3.580942 -5.921678  -7.179866  -7.313314 -5.669242 -1.269325
0.7   4.520049  3.224038  1.605314  0.000000 -1.212357 -1.975628 -4.316364  -5.574552  -5.708000 -4.063928  0.335989
0.6   5.732406  4.436395  2.817671  1.212357  0.000000 -0.763271 -3.104007  -4.362195  -4.495643 -2.851571  1.548346
0.5   6.495677  5.199666  3.580942  1.975628  0.763271  0.000000 -2.340736  -3.598924  -3.732372 -2.088300  2.311617
0.4   8.836413  7.540402  5.921678  4.316364  3.104007  2.340736  0.000000  -1.258188  -1.391636  0.252436  4.652353
0.3  10.094600  8.798590  7.179866  5.574552  4.362195  3.598924  1.258188   0.000000  -0.133449  1.510624  5.910541
0.2  10.228049  8.932038  7.313314  5.708000  4.495643  3.732372  1.391636   0.133449   0.000000  1.644072  6.043989
0.1   8.583977  7.287966  5.669242  4.063928  2.851571  2.088300 -0.252436  -1.510624  -1.644072  0.000000  4.399917
0.0   4.184060  2.888049  1.269325 -0.335989 -1.548346 -2.311617 -4.652353  -5.910541  -6.043989 -4.399917  0.000000

This means that using delta_f_.iloc[0, -1] as we do for each test (and generally advise users) here gives the free energy difference going from lambda = 1 to lambda = 0.

With pandas 1.4.3, a fix to pd.concat's sort parameter was included, which changes the behavior of this line in the namd u_nk parser. We now get for u_nk:

                    0.0       0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8       0.9  1.0
time    fep-lambda                                                                 
4000.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.101483  0.0
4010.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.396537  0.0
4020.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.813706  0.0
4030.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.533748  0.0
4040.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -0.238862  0.0
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...       ...  ...
49960.0 0.0         0.0 -2.880429  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49970.0 0.0         0.0 -2.901732  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49980.0 0.0         0.0 -3.063265  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49990.0 0.0         0.0 -2.727282  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
50000.0 0.0         0.0 -1.969434  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN

[30170 rows x 11 columns]

which features sorted columns. The resulting BAR estimator's delta_f_ then gives:

          0.0       0.1        0.2        0.3       0.4       0.5       0.6       0.7       0.8       0.9        1.0
0.0  0.000000 -4.399917  -6.043989  -5.910541 -4.652353 -2.311617 -1.548346 -0.335989  1.269325  2.888049   4.184060
0.1  4.399917  0.000000  -1.644072  -1.510624 -0.252436  2.088300  2.851571  4.063928  5.669242  7.287966   8.583977
0.2  6.043989  1.644072   0.000000   0.133449  1.391636  3.732372  4.495643  5.708000  7.313314  8.932038  10.228049
0.3  5.910541  1.510624  -0.133449   0.000000  1.258188  3.598924  4.362195  5.574552  7.179866  8.798590  10.094600
0.4  4.652353  0.252436  -1.391636  -1.258188  0.000000  2.340736  3.104007  4.316364  5.921678  7.540402   8.836413
0.5  2.311617 -2.088300  -3.732372  -3.598924 -2.340736  0.000000  0.763271  1.975628  3.580942  5.199666   6.495677
0.6  1.548346 -2.851571  -4.495643  -4.362195 -3.104007 -0.763271  0.000000  1.212357  2.817671  4.436395   5.732406
0.7  0.335989 -4.063928  -5.708000  -5.574552 -4.316364 -1.975628 -1.212357  0.000000  1.605314  3.224038   4.520049
0.8 -1.269325 -5.669242  -7.313314  -7.179866 -5.921678 -3.580942 -2.817671 -1.605314  0.000000  1.618724   2.914735
0.9 -2.888049 -7.287966  -8.932038  -8.798590 -7.540402 -5.199666 -4.436395 -3.224038 -1.618724  0.000000   1.296011
1.0 -4.184060 -8.583977 -10.228049 -10.094600 -8.836413 -6.495677 -5.732406 -4.520049 -2.914735 -1.296011   0.000000

So calling delta_f_.iloc[0, -1] gives the negative of the previous result, since we are now getting the free energy difference going from lambda = 0 to lambda = 1.

dotsdl · 2022-06-30T00:14:33Z

As for what the fix should be, I'm not exactly sure. We generally do use delta_f_.iloc[0,-1] to access the top-right element of the free energy difference dataframe from an estimator, which in many cases corresponds to the difference between the state in which all lambdas = 0 and all lambdas = 1. However, I don't believe we've ever made this guarantee across the library.

dotsdl · 2022-06-30T00:19:36Z

Just checked the docs, and looks like I was wrong: we're advising users there to use delta_f_.loc and explicitly specifying lambda values for [from, to]. This makes our choice of sorting vs. not sorting less important here. Thoughts?

dotsdl · 2022-06-30T00:21:40Z

Restoring previous behavior for the namd parser involves switching to sort=False in this line, if that's what we'd like to do. I don't know if there are other repercussions to this choice.

orbeckst · 2022-06-30T00:22:01Z

Thank you for getting to the bottom of the problem.

Does pandas-dev/pandas#47206 restore the previous behavior? If so, we can just blacklist all versions of pandas with the changed sorting behavior. That would restore our status-quo and avoid breakage of existing scripts.

Should we use .loc in the tests then?

For the deeper question about guaranteeing sorting orders etc we should raise a separate issue. I am happy, though, that we're not giving wrong advice in our docs.

dotsdl · 2022-06-30T00:27:05Z

Thank you for getting to the bottom of the problem.

Does pandas-dev/pandas#47206 restore the previous behavior? If so, we can just blacklist all versions of pandas with the changed sorting behavior. That would restore our status-quo and avoid breakage of existing scripts.

My understanding is that this PR on pandas fixes what was broken sorting behavior, and that our call to sort=True never worked. But now it does work, so now we end up with a sorted-column-names u_nk from that parser where we didn't have one before.

Should we use .loc in the tests then?

Yes, probably. I can make this change.

For the deeper question about guaranteeing sorting orders etc we should raise a separate issue. I am happy, though, that we're not giving wrong advice in our docs.

Would you mind opening this? I'll finish up this and #197.

Previous `pandas` behavior prior to 1.4.3 [did not sort numeric column names](pandas-dev/pandas#47127), but this now occurs. We don't sort within other parsers, so switching this flag to be consistent with previous behavior. There is no clear reason sorting is necessary here.

dotsdl · 2022-06-30T00:57:54Z

@orbeckst if satisfied, please merge. Thanks!

orbeckst

I think that this will do it.

Fix the CI

04ca9b9

Fix #200

xiki-tempula added 2 commits June 25, 2022 19:17

Update environment.yml

2108829

Update environment.yml

d35d8f4

xiki-tempula marked this pull request as draft June 25, 2022 22:08

xiki-tempula added 3 commits June 26, 2022 10:52

Update test_env.yaml

21b83ce

Update test_env.yaml

7ca2ae6

Update environment.yml

9858e0e

xiki-tempula marked this pull request as ready for review June 26, 2022 10:08

xiki-tempula requested a review from orbeckst June 26, 2022 10:37

orbeckst requested changes Jun 27, 2022

View reviewed changes

dotsdl added 2 commits June 29, 2022 17:36

Reverting pandas version exclusion in conda envs

b8c5ab5

dotsdl mentioned this pull request Jun 30, 2022

Switch tests to use .loc for all delta_f_ comparisons #202

Closed

orbeckst mentioned this pull request Jun 30, 2022

Should alchemlyb guarantee layout of dataframes? #203

Closed

orbeckst mentioned this pull request Jun 30, 2022

new version of pandas >=1.4 breaks NAMD-based test due to change in DataFrame column order #200

Closed

orbeckst changed the title ~~Fix the CI~~ [CI] account for changed pandas.concat sorting behavior Jun 30, 2022

orbeckst approved these changes Jun 30, 2022

View reviewed changes

Merge branch 'master' into xiki-tempula-patch-1

a651b3e

orbeckst self-assigned this Jun 30, 2022

orbeckst merged commit e2f8c39 into master Jun 30, 2022

orbeckst deleted the xiki-tempula-patch-1 branch June 30, 2022 01:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] account for changed pandas.concat sorting behavior #201

[CI] account for changed pandas.concat sorting behavior #201

xiki-tempula commented Jun 25, 2022 •

edited

Loading

codecov bot commented Jun 25, 2022 •

edited

Loading

orbeckst commented Jun 27, 2022

orbeckst left a comment

xiki-tempula commented Jun 28, 2022

orbeckst commented Jun 28, 2022 via email

dotsdl commented Jun 29, 2022

dotsdl commented Jun 30, 2022 •

edited

Loading

dotsdl commented Jun 30, 2022

dotsdl commented Jun 30, 2022 •

edited

Loading

dotsdl commented Jun 30, 2022 •

edited

Loading

orbeckst commented Jun 30, 2022

dotsdl commented Jun 30, 2022

dotsdl commented Jun 30, 2022

orbeckst left a comment

[CI] account for changed pandas.concat sorting behavior #201

[CI] account for changed pandas.concat sorting behavior #201

Conversation

xiki-tempula commented Jun 25, 2022 • edited Loading

codecov bot commented Jun 25, 2022 • edited Loading

Codecov Report

orbeckst commented Jun 27, 2022

orbeckst left a comment

Choose a reason for hiding this comment

xiki-tempula commented Jun 28, 2022

orbeckst commented Jun 28, 2022 via email

dotsdl commented Jun 29, 2022

dotsdl commented Jun 30, 2022 • edited Loading

dotsdl commented Jun 30, 2022

dotsdl commented Jun 30, 2022 • edited Loading

dotsdl commented Jun 30, 2022 • edited Loading

orbeckst commented Jun 30, 2022

dotsdl commented Jun 30, 2022

dotsdl commented Jun 30, 2022

orbeckst left a comment

Choose a reason for hiding this comment

xiki-tempula commented Jun 25, 2022 •

edited

Loading

codecov bot commented Jun 25, 2022 •

edited

Loading

dotsdl commented Jun 30, 2022 •

edited

Loading

dotsdl commented Jun 30, 2022 •

edited

Loading

dotsdl commented Jun 30, 2022 •

edited

Loading