Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] account for changed pandas.concat sorting behavior #201

Merged
merged 9 commits into from
Jun 30, 2022

Conversation

xiki-tempula
Copy link
Collaborator

@xiki-tempula xiki-tempula commented Jun 25, 2022

Fix #200 It seems that pandas 1.4.3 has some issue.

@codecov
Copy link

codecov bot commented Jun 25, 2022

Codecov Report

Merging #201 (a651b3e) into master (fd08e1d) will increase coverage by 0.15%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   98.04%   98.20%   +0.15%     
==========================================
  Files          24       24              
  Lines        1333     1340       +7     
  Branches      281      290       +9     
==========================================
+ Hits         1307     1316       +9     
  Misses          5        5              
+ Partials       21       19       -2     
Impacted Files Coverage Δ
src/alchemlyb/parsing/namd.py 99.17% <100.00%> (+<0.01%) ⬆️
src/alchemlyb/parsing/__init__.py 100.00% <0.00%> (ø)
src/alchemlyb/parsing/amber.py 97.48% <0.00%> (+0.02%) ⬆️
src/alchemlyb/estimators/mbar_.py 98.24% <0.00%> (+0.03%) ⬆️
src/alchemlyb/parsing/gmx.py 98.18% <0.00%> (+0.60%) ⬆️
src/alchemlyb/parsing/gomc.py 93.20% <0.00%> (+1.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd08e1d...a651b3e. Read the comment docs.

@xiki-tempula xiki-tempula marked this pull request as draft June 25, 2022 22:08
@xiki-tempula xiki-tempula marked this pull request as ready for review June 26, 2022 10:08
@orbeckst
Copy link
Member

Do we know what the issue with Pandas is? Is 1.4.3 the latest version and does, say, 1.4.4 work? Basically I am wondering if this is has been wrong the whole time and is fixed now and we need to change the test or if this is a temporary issue?

If it's temporary then we also need to change the pip and condas stuff so that version of pandas is never installed.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a known bug in pandas 1.4.3 that explains this behavior?

I'd like to understand the reason before we just ban certain versions (which would also need to be done in the installation meta data).

@xiki-tempula
Copy link
Collaborator Author

@orbeckst Sorry for the delay, I will investigate this issue in the weekends.

@orbeckst
Copy link
Member

orbeckst commented Jun 28, 2022 via email

@dotsdl
Copy link
Member

dotsdl commented Jun 29, 2022

Investigating this now. Looks like in pandas 1.4.3 the columns of our u_nk dataframes are being sorted from smallest to highest lambda value. Digging into if this is intentional behavior in latest pandas.

@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

Behavior of this test prior to pandas 1.4.3 gives a u_nk that looks like:

u_nk
                    1.0       0.9  0.8  0.7  0.6  0.5  0.4  0.3  0.2       0.1  0.0
time    fep-lambda                                                                 
4000.0  1.0         0.0  0.101483  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4010.0  1.0         0.0  0.396537  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4020.0  1.0         0.0  0.813706  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4030.0  1.0         0.0  0.533748  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
4040.0  1.0         0.0 -0.238862  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...       ...  ...
49960.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.880429  0.0
49970.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.901732  0.0
49980.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -3.063265  0.0
49990.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -2.727282  0.0
50000.0 0.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -1.969434  0.0

[30170 rows x 11 columns]

which gives for the resulting BAR estimator's delta_f_:

           1.0       0.9       0.8       0.7       0.6       0.5       0.4        0.3        0.2       0.1       0.0
1.0   0.000000 -1.296011 -2.914735 -4.520049 -5.732406 -6.495677 -8.836413 -10.094600 -10.228049 -8.583977 -4.184060
0.9   1.296011  0.000000 -1.618724 -3.224038 -4.436395 -5.199666 -7.540402  -8.798590  -8.932038 -7.287966 -2.888049
0.8   2.914735  1.618724  0.000000 -1.605314 -2.817671 -3.580942 -5.921678  -7.179866  -7.313314 -5.669242 -1.269325
0.7   4.520049  3.224038  1.605314  0.000000 -1.212357 -1.975628 -4.316364  -5.574552  -5.708000 -4.063928  0.335989
0.6   5.732406  4.436395  2.817671  1.212357  0.000000 -0.763271 -3.104007  -4.362195  -4.495643 -2.851571  1.548346
0.5   6.495677  5.199666  3.580942  1.975628  0.763271  0.000000 -2.340736  -3.598924  -3.732372 -2.088300  2.311617
0.4   8.836413  7.540402  5.921678  4.316364  3.104007  2.340736  0.000000  -1.258188  -1.391636  0.252436  4.652353
0.3  10.094600  8.798590  7.179866  5.574552  4.362195  3.598924  1.258188   0.000000  -0.133449  1.510624  5.910541
0.2  10.228049  8.932038  7.313314  5.708000  4.495643  3.732372  1.391636   0.133449   0.000000  1.644072  6.043989
0.1   8.583977  7.287966  5.669242  4.063928  2.851571  2.088300 -0.252436  -1.510624  -1.644072  0.000000  4.399917
0.0   4.184060  2.888049  1.269325 -0.335989 -1.548346 -2.311617 -4.652353  -5.910541  -6.043989 -4.399917  0.000000

This means that using delta_f_.iloc[0, -1] as we do for each test (and generally advise users) here gives the free energy difference going from lambda = 1 to lambda = 0.

With pandas 1.4.3, a fix to pd.concat's sort parameter was included, which changes the behavior of this line in the namd u_nk parser. We now get for u_nk:

                    0.0       0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8       0.9  1.0
time    fep-lambda                                                                 
4000.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.101483  0.0
4010.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.396537  0.0
4020.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.813706  0.0
4030.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.533748  0.0
4040.0  1.0         NaN       NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN -0.238862  0.0
...                 ...       ...  ...  ...  ...  ...  ...  ...  ...       ...  ...
49960.0 0.0         0.0 -2.880429  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49970.0 0.0         0.0 -2.901732  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49980.0 0.0         0.0 -3.063265  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
49990.0 0.0         0.0 -2.727282  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN
50000.0 0.0         0.0 -1.969434  NaN  NaN  NaN  NaN  NaN  NaN  NaN       NaN  NaN

[30170 rows x 11 columns]

which features sorted columns. The resulting BAR estimator's delta_f_ then gives:

          0.0       0.1        0.2        0.3       0.4       0.5       0.6       0.7       0.8       0.9        1.0
0.0  0.000000 -4.399917  -6.043989  -5.910541 -4.652353 -2.311617 -1.548346 -0.335989  1.269325  2.888049   4.184060
0.1  4.399917  0.000000  -1.644072  -1.510624 -0.252436  2.088300  2.851571  4.063928  5.669242  7.287966   8.583977
0.2  6.043989  1.644072   0.000000   0.133449  1.391636  3.732372  4.495643  5.708000  7.313314  8.932038  10.228049
0.3  5.910541  1.510624  -0.133449   0.000000  1.258188  3.598924  4.362195  5.574552  7.179866  8.798590  10.094600
0.4  4.652353  0.252436  -1.391636  -1.258188  0.000000  2.340736  3.104007  4.316364  5.921678  7.540402   8.836413
0.5  2.311617 -2.088300  -3.732372  -3.598924 -2.340736  0.000000  0.763271  1.975628  3.580942  5.199666   6.495677
0.6  1.548346 -2.851571  -4.495643  -4.362195 -3.104007 -0.763271  0.000000  1.212357  2.817671  4.436395   5.732406
0.7  0.335989 -4.063928  -5.708000  -5.574552 -4.316364 -1.975628 -1.212357  0.000000  1.605314  3.224038   4.520049
0.8 -1.269325 -5.669242  -7.313314  -7.179866 -5.921678 -3.580942 -2.817671 -1.605314  0.000000  1.618724   2.914735
0.9 -2.888049 -7.287966  -8.932038  -8.798590 -7.540402 -5.199666 -4.436395 -3.224038 -1.618724  0.000000   1.296011
1.0 -4.184060 -8.583977 -10.228049 -10.094600 -8.836413 -6.495677 -5.732406 -4.520049 -2.914735 -1.296011   0.000000

So calling delta_f_.iloc[0, -1] gives the negative of the previous result, since we are now getting the free energy difference going from lambda = 0 to lambda = 1.

@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

As for what the fix should be, I'm not exactly sure. We generally do use delta_f_.iloc[0,-1] to access the top-right element of the free energy difference dataframe from an estimator, which in many cases corresponds to the difference between the state in which all lambdas = 0 and all lambdas = 1. However, I don't believe we've ever made this guarantee across the library.

@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

Just checked the docs, and looks like I was wrong: we're advising users there to use delta_f_.loc and explicitly specifying lambda values for [from, to]. This makes our choice of sorting vs. not sorting less important here. Thoughts?

@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

Restoring previous behavior for the namd parser involves switching to sort=False in this line, if that's what we'd like to do. I don't know if there are other repercussions to this choice.

@orbeckst
Copy link
Member

Thank you for getting to the bottom of the problem.

Does pandas-dev/pandas#47206 restore the previous behavior? If so, we can just blacklist all versions of pandas with the changed sorting behavior. That would restore our status-quo and avoid breakage of existing scripts.

Should we use .loc in the tests then?

For the deeper question about guaranteeing sorting orders etc we should raise a separate issue. I am happy, though, that we're not giving wrong advice in our docs.

@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

Thank you for getting to the bottom of the problem.

Does pandas-dev/pandas#47206 restore the previous behavior? If so, we can just blacklist all versions of pandas with the changed sorting behavior. That would restore our status-quo and avoid breakage of existing scripts.

My understanding is that this PR on pandas fixes what was broken sorting behavior, and that our call to sort=True never worked. But now it does work, so now we end up with a sorted-column-names u_nk from that parser where we didn't have one before.

Should we use .loc in the tests then?

Yes, probably. I can make this change.

For the deeper question about guaranteeing sorting orders etc we should raise a separate issue. I am happy, though, that we're not giving wrong advice in our docs.

Would you mind opening this? I'll finish up this and #197.

Previous `pandas` behavior prior to 1.4.3 [did not sort numeric column
names](pandas-dev/pandas#47127), but this now
occurs. We don't sort within other parsers, so switching this flag to be
consistent with previous behavior. There is no clear reason sorting is
necessary here.
@dotsdl
Copy link
Member

dotsdl commented Jun 30, 2022

@orbeckst if satisfied, please merge. Thanks!

@orbeckst orbeckst changed the title Fix the CI [CI] account for changed pandas.concat sorting behavior Jun 30, 2022
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will do it.

@orbeckst orbeckst self-assigned this Jun 30, 2022
@orbeckst orbeckst merged commit e2f8c39 into master Jun 30, 2022
@orbeckst orbeckst deleted the xiki-tempula-patch-1 branch June 30, 2022 01:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

new version of pandas >=1.4 breaks NAMD-based test due to change in DataFrame column order
3 participants