-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add attr to the dataframe #129
Conversation
Codecov Report
@@ Coverage Diff @@
## master #129 +/- ##
==========================================
+ Coverage 97.48% 97.70% +0.21%
==========================================
Files 17 20 +3
Lines 996 1091 +95
Branches 220 230 +10
==========================================
+ Hits 971 1066 +95
Misses 5 5
Partials 20 20
Continue to review full report at Codecov.
|
@orbeckst Sorry for disturbing you. I wonder if you mind have a look and give some advice, please? Thank you. I have added the deprecation of py2 and py356 into this pr for the test to pass but they could be moved to another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial comments
- make 2.7/3.5 removal separate PR
- need to keep 3.6 around (why does pd not publish packages for 3.6 ???????) so we need to work-around
.travis.yml
Outdated
@@ -7,9 +7,6 @@ env: | |||
- MPLBACKEND=agg | |||
|
|||
python: | |||
- "2.7" | |||
- "3.5" | |||
- "3.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to keep 3.6 for now, see #130 , don't do the removal here
setup.py
Outdated
'Programming Language :: Python :: 2.7', | ||
'Programming Language :: Python :: 3', | ||
'Programming Language :: Python :: 3.5', | ||
'Programming Language :: Python :: 3.6', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep 3.6 for now; don't do the removal here
src/alchemlyb/estimators/mbar_.py
Outdated
@@ -92,7 +92,10 @@ def fit(self, u_nk): | |||
columns=self.states_, | |||
index=self.states_) for i in out] | |||
|
|||
(self.delta_f_, self.d_delta_f_, self.theta_) = attrs | |||
(self.delta_f_, self.d_delta_f_, self.theta_) = attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we want to rename those attrs
to something else to avoid confusion. free_energy_differences
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be something that could be improved.
but delta_f_, d_delta_f_ are used in all estimators and in the pymbar module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the use of the name attrs
on the RHS. Now that we are using df.attrs
, it is very confusing because these free energy diffs aren't the attrs that we are using for units.
src/alchemlyb/parsing/gmx.py
Outdated
u_k.attrs['temperature'] = T | ||
u_k.attrs['energy_unit'] = 'kT' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could wrap attribute setting so that we can use old pandas.
# remove once we can require pandas >= 1.2 (after Python 3.
if pd.__version__ >= "1.2": # do this version check properly
def pd_setattrs(df, data):
df.attrs = data
else:
def pd_setattrs(df, data):
if not isinstance(data, dict):
raise TypeError
with warnings.catch_warnings():
warnings.simplefilter("ignore")
setattr(df, 'attrs', data)
and then use pd_setattrs()
everywhere where you set the attributes for the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could live in util or somewhere similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst Sorry, it is a bit unclear to me of what this function is trying to achieve. There should be no problem if one were to only use the alchemlyb functions when pandas < 1.2, but if one were to manipulate the data frame by themselves, pandas < 1.2 would then drop the attrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to set df.attrs
, pandas complained that I was trying to add a column but set the attribute nevertheless. My idea was to just filter the warnings.
It was not clear to me what the problem with attrs for 1.0 ≤ pandas < 1.2 was: Are you saying that only for ≥ 1.2, slicing a df will retain the attrs on the sliced copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried to set df.attrs, pandas complained that I was trying to add a column but set the attribute nevertheless. My idea was to just filter the warnings.
For pandas == 1.0.5,
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.attrs
{}
>>> df.attrs = {1:1}
yields no warning for me.
It was not clear to me what the problem with attrs for 1.0 ≤ pandas < 1.2 was: Are you saying that only for ≥ 1.2, slicing a df will retain the attrs on the sliced copy?
Yes.
for pandas >=1.0 and < 1.2
>>> import pandas as pd
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df.attrs = {1:1}
>>> df.attrs
{1: 1}
>>> df['col1'].attrs
{}
@xiki-tempula I am a bit busy over the next few days, sorry if I don't get to reviewing anything. Please feel free to ping me around Wed/Thu again. Thank you for your understanding & patience. I really appreciate the work you're putting in here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking already quite good. I had various comments and one important additional request: We need to document this change in behavior under https://alchemlyb.readthedocs.io/en/stable/parsing.html#standard-forms-of-raw-data so please add text that explains how we annotate units and what users have to do (namely, use pass_attrs
.
Furthermore, add documentation about the new postprocessing
module to the docs.
Update the visualization docs because now units are used for conversion.
Update CHANGES (these are big important ones).
We might need another round of reviews but this is already looking really nice. Great work @xiki-tempula !
assert 'temperature' in df.attrs, 'Attribute temperature not found in the' \ | ||
' input Dataframe.' | ||
assert 'energy_unit' in df.attrs, 'Attribute energy_unit not found in the' \ | ||
' input Dataframe.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace assert
with raising a TypeError
, saying that we need a df with attrs according to https://alchemlyb.readthedocs.io/en/stable/parsing.html#standard-forms-of-raw-data . Ultimately, the dfs are user input so we have to check them properly.
(assert
can be optimized away and should not be used to check user input.)
src/alchemlyb/tests/test_units.py
Outdated
assert new_dhdl.attrs['temperature'] == 310 | ||
assert new_dhdl.attrs['energy_unit'] == 'kT' | ||
|
||
class Test_estimator(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add as TestUnits
to the test_*_estimators.py
so that in the future it is easier to add tests when anything changes in estimators.
src/alchemlyb/tests/test_units.py
Outdated
assert mbar.d_delta_f_.attrs['temperature'] == 300 | ||
assert mbar.d_delta_f_.attrs['energy_unit'] == 'kT' | ||
|
||
class Test_visualisation(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add as TestUnits
to the test_visualisation.py
so that in the future it is easier to add tests when anything changes in viz.
dF.append(to_kcalmol(dhdl.delta_f_).iloc[i, i + 1]) | ||
error.append(to_kcalmol(dhdl.d_delta_f_).iloc[i, i + 1]) | ||
else: | ||
raise NameError('energy_unit {} can only be kT, kJ/mol ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError – but isn't that already being raised with the conversion functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the dispatch table approach, it's already included.
if units == 'kT': | ||
dF.append(to_kT(dhdl.delta_f_).iloc[i, i + 1]) | ||
error.append(to_kT(dhdl.d_delta_f_).iloc[i, i + 1]) | ||
elif units == 'kJ/mol': | ||
dF.append(to_kJmol(dhdl.delta_f_).iloc[i, i + 1]) | ||
error.append(to_kJmol(dhdl.d_delta_f_).iloc[i, i + 1]) | ||
elif units == 'kcal/mol': | ||
dF.append(to_kcalmol(dhdl.delta_f_).iloc[i, i + 1]) | ||
error.append(to_kcalmol(dhdl.d_delta_f_).iloc[i, i + 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much code duplications. Use a dispatch table
_converters = {'kT': to_kT, 'kJ/mol': to_kJmol, 'kcal/mol': to_kcalmol}
try:
convert = _converters[units]
except KeyError:
raise ValueError(f"Energy unit {units} is not supported, choose one of {list(_converters.keys())}")
dF.append(convert(dhdl.delta_f_).iloc[i, i + 1])
error.append(convert(dhdl.d_delta_f_).iloc[i, i + 1])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could put the dispatch table into units
and import it.
for dhdl in dhdl_list: | ||
if units == 'kT': | ||
new_unit.append(to_kT(dhdl)) | ||
elif units == 'kJ/mol': | ||
new_unit.append(to_kJmol(dhdl)) | ||
elif units == 'kcal/mol': | ||
new_unit.append(to_kcalmol(dhdl)) | ||
else: | ||
raise NameError('energy_unit {} can only be kT, kJ/mol or ' \ | ||
'kcal/mol.'.format(units)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use dispatch table
@orbeckst Thanks for the review and the advice. I think I have addressed all the changes. Would you mind have a look, please? Thank you. |
I won't be able to review until middle of the week, sorry. Thanks for your patience. |
Can you please rebase against master so that the new CI runs? Thanks. |
@orbeckst Thanks for the advice. I have merged the master to this branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good, some minor comments.
I didn't manage to look at everything in as much detail as I wanted, If I see more I add it.
docs/parsing.rst
Outdated
>>> dhdl_coul = alchemlyb.concat([extract_dHdl(xvg, T=300) for xvg in | ||
dataset['Coulomb']]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't line break
docs/parsing.rst
Outdated
|
||
The metadata (such as the unit of the energy and temperature) are stored in | ||
:attr:`pandas.DataFrame.attrs`, a :class:`dict`. Functions in ``alchemlyb`` are | ||
aware |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflow (aware is alone on its line)
'''Concatenate pandas objects along a particular axis with optional set | ||
logic along the other axes. If all pandas objects have the same attrs | ||
attribute, the new pandas objects would have this attrs attribute. A | ||
ValueError would be raised if any pandas object has a different attrs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a Parameters section (see numpydoc)
ValueError | ||
If not all pandas objects have the same attrs. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a See Also section for pandas.concat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if this renders correctly. If we don't have pandas in the intersphinx mapping, add it to sphinx conf.py.
Also make sure that there are at least two empty lines before the version added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is rendered correctly.
kJ2kcal = 1 / calorie | ||
R_kJmol = R / 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc as in
kJ2kcal = 1 / calorie | |
R_kJmol = R / 1000 | |
#: conversion factor from kJ to kcal, based on :data:`scipy.constants.calorie` | |
kJ2kcal = 1 / calorie | |
#: Gas constant :math:`R` in kJ/(mol K), based on :data:`scipy.constants.R` | |
R_kJmol = R / 1000 |
I like your approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add these constants to the documentation somewhere
.. autodata:: kJ2kcal
.. autodata:: R_kJmol
kt_df.attrs['energy_unit'] = 'kJ/mol' | ||
return kt_df | ||
|
||
def get_unit_converter(units): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it appear in docs
Returns | ||
------- | ||
func | ||
converter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no period; maybe link to the docs that list the possible converter functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a versionadded
@orbeckst Thanks for the review. I have updated the docs and checked it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second part of review.
- Major question: Why do the GMX tests now produce different answers? This is a serious change and we need to be very clear about what changed, why, and how much of a difference it makes — if the change is necessary and not a bug.
- There are tests where
attr
is explicitly assigned after use ofextract_dHdl
andalchemlyb.concat()
and I don't understand why: I thought that alchemlyb will seamlessly handle units when using only alchemlyb functions. - Mostly minor doc issues.
- Add
postprocessing
to docs/api_principles.rst
ValueError | ||
If not all pandas objects have the same attrs. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if this renders correctly. If we don't have pandas in the intersphinx mapping, add it to sphinx conf.py.
Also make sure that there are at least two empty lines before the version added.
CHANGES
Outdated
|
||
Deprecations | ||
|
||
Fixes | ||
|
||
Changes | ||
- alchemlyb.concat added to replace pd.concat (PR #129). | ||
- postprocessors.units module for unit conversion (PR #129). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add under enhancements. (The "Changes" parts are listed separately, see other comments.)
Reference #125 (generally, the issue is more important than the PR, you can include both if you like, but definitely reference the issue)
CHANGES
Outdated
@@ -18,12 +18,18 @@ The rules for this file: | |||
* 0.5.0 | |||
|
|||
Enhancements | |||
- Visualisation module will change the data according to input unit (PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference #125 (generally, the issue is more important than the PR, you can include both if you like, but definitely reference the issue)
CHANGES
Outdated
@@ -18,12 +18,18 @@ The rules for this file: | |||
* 0.5.0 | |||
|
|||
Enhancements | |||
- Visualisation module will change the data according to input unit (PR | |||
#129). | |||
- The parser output dataframe with metadata (PR #129). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would list this under Changes because it changes our data format.
>>> u_nk_coul = pd.concat([extract_u_nk(xvg, T=300) for xvg in bz['Coulomb']]) | ||
>>> u_nk_vdw = pd.concat([extract_u_nk(xvg, T=300) for xvg in bz['VDW']]) | ||
>>> u_nk_coul = alchemlyb.concat([extract_u_nk(xvg, T=300) for xvg in bz['Coulomb']]) | ||
>>> u_nk_vdw = alchemlyb.concat([extract_u_nk(xvg, T=300) for xvg in bz['VDW']]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you did a git grep concat
and looked at every occurrence in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did it within the pycharm IDE.
bz = load_benzene().data | ||
dHdl_coul = alchemlyb.concat( | ||
[extract_dHdl(xvg, T=300) for xvg in bz['Coulomb']]) | ||
dHdl_coul.attrs = extract_dHdl(load_benzene().data['Coulomb'][0], T=300).attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why attr assignment?
|
||
u_nk_coul = alchemlyb.concat( | ||
[extract_u_nk(xvg, T=300) for xvg in bz['Coulomb']]) | ||
u_nk_coul.attrs = extract_dHdl(load_benzene().data['Coulomb'][0], T=300).attrs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why attr assignment?
The units variable is for labelling only. Changing it doesn't change the | ||
unit of the underlying variable, which is in the unit of :math:`kT`. | ||
unit of the underlying variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does plot_convergence()
not do unit conversion, even though the other plotting functions do?
Can you change it, not the least the CHANGES says that this is now working for all visualization functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or remove units
and take the label from the df attr
— but having the same behavior as the other plotting functions would be much better and more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orbeckst There is currently no alchemlyb function that will generate the input for plot_convergence.
The user needs to generate the input file, which could be a simple python list.
@orbeckst Thank you for the review. I will do the changes when I got time.
The original parser uses the gromacs constant https://manual.gromacs.org/documentation/2019/reference-manual/definitions.html
The new R uses the scipy constant.
They differ by
Sorry, I initially did this when we only have pd.concat. |
Thanks for the explanation regarding R. Given that we can consistently convert back to the actual kJ/mol units that GROMACS provides, the change in the reference values in the tests is ok. Let me know when the PR is ready for final review. |
We will need to add a note to CHANGES and a Warning to the gmx parser page, though, because the results in kT that people produced previously will now be slightly off. |
@orbeckst Thanks for the review. I have addressed all the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, big step forward.
I have added minor corrections, which I will add in batch.
'alchemtest': ('https://alchemtest.readthedocs.io/en/latest/', None)} | ||
'alchemtest': ( | ||
'https://alchemtest.readthedocs.io/en/latest/', None), | ||
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None)} | |
'scipy': ('https://docs.scipy.org/doc/scipy/reference/', None)} |
- explain change in GROMACS output to ~1e-7 - minor doc fixes
Fix #125