Qcarchive update #187

chrisiacovella · 2023-09-21T03:29:16Z

This updates qcarchive_utils.py to be compatible with v0.5 of qcportal. Relates to issue #185

This code reproduces the same behavior as the prior implementation.

…to be compatible with qcportal >=0.5

mikemhenry · 2023-09-21T15:23:50Z

Awesome! This is good timing with #186

Once we get both in, we should cut a new release.

chrisiacovella · 2023-09-21T16:22:47Z

This PR implements the logic in effectively the same way as the old code, which is on a per-record basis (i.e., a function operates on a single record name). The new version of qcportal has iterators on records, which are substantially faster (like orders of magnitude, due to prefetching and caching). The next commit will include functions that operate on the entire record sets to avoid slow performance.

codecov-commenter · 2023-09-21T16:30:32Z

Codecov Report

❗ No coverage uploaded for pull request base (main@2e61215). Click here to learn what that means.
The diff coverage is n/a.

Additional details and impacted files

mikemhenry · 2023-09-21T16:36:44Z

This line will need to get changed @chrisiacovella https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086L23 Sorry about that!

mikemhenry · 2023-09-21T16:39:27Z

We can probably remove that line since https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086R33 will pull in what we need (I think, I am not sure what the "main" qcarchive package is)

…rds and iterates_entries functions

…_update

…version.

chrisiacovella · 2023-09-21T19:41:53Z

This line will need to get changed @chrisiacovella https://github.com/choderalab/espaloma/pull/187/files#diff-ba5d22563299549a389183418fe5786b83275382be592bf1ed06fae673b7d086L23 Sorry about that!

Good catch.

mikemhenry

LGTM, had two non-blocking notes

espaloma/data/qcarchive_utils.py

mikemhenry · 2023-09-21T21:33:24Z

espaloma/data/qcarchive_utils.py

-        mol = final_molecules[angle]
+        # NOTE: this is calling the first index of the optimization array
+        # this gives the same value as the prior implementation, but I wonder if it
+        # should be molecule_optimization[angle][-1] in both cases


@kntkb or @yuanqing-wang thoughts?

So I've been trying to figure out the structure of the torsion drive datasets (since I have not looked at them really yet prior to this). Considering the example dataset I used in test (I'll put the code below), each angle has n-number of unique initial conformations that are then optimized. In this case, there are 4 configurations (each that has their own trajectory). So I suppose choosing the first vs the last is somewhat irrelevant (I was initially thinking this was a set of chained optimizations, hence my comment...don't ask why I was thinking that).

Should each of these conformations be considered and added to the datasets rather than just arbitrarily picking one?

from espaloma.data import qcarchive_utils import numpy as np record_name = "[h]c1c(c(c(c([c:1]1[n:2]([c:3](=[o:4])c(=c([h])[h])[h])c([h])([h])[h])[h])[h])n(=o)=o)[h]" name = "OpenFF Amide Torsion Set v1.0" collection_type = "torsiondrive" collection, record_names = qcarchive_utils.get_collection(qcarchive_utils.get_client(), collection_type, name) record_info = collection.get_record(record_name, specification_name="default") molecule_optimization = record_info.optimizations angle_keys = list(molecule_optimization.keys()) angle = angle_keys[0] mol = molecule_optimization[angle][0].final_molecule result = molecule_optimization[angle][0].trajectory[-1].properties

looking at the actual configurations:

for i in range(len(molecule_optimization[angle])): init = molecule_optimization[angle][i].initial_molecule.geometry final = molecule_optimization[angle][i].final_molecule.geometry print(init,"\n-\n", final, "\n--\n")

@kntkb or @yuanqing-wang thoughts?

I don't know off the top of my head, but I've played around with different QCArchive workflows in the past. I may have some notes left somewhere, so I'll catch up shortly (tomorrow?).

Oh, the api and the way you access the data changed using qcportal v0.5...

I think in the older version of qcportal, "get_final_molecule()" just picked the first one in the array. The full array was still part of the data record, just you had to dig through the qcvars or something to access. From conversations with Ben, there was a lot of trying to force records into a very rigid schema in the old version; he opted to break the schema in a lot of cases to just make it easier to access the relevant information (and make it clearer what information is available).

As I mentioned in an early comment, it seems that for each angle, multiple (in this case 4) independent starting configurations were used. It seems like it would be better to have the code return data for each replicate, but I'm not sure how this would impact any workflows that use this function.

ijpulidos

This is geat. I'm glad that we are now testing the behavior and have some documentation for these utils. I agree with the comments that have been made. Looks good to be merged, just a single non-blocking comment.

espaloma/data/tests/test_qcarchive.py

espaloma/data/qcarchive_utils.py

mikemhenry

From @jchodera

There are apparently some additional issues with the object model such that datasets beyond OptimizationDataset are not supported

chrisiacovella · 2023-09-22T16:43:29Z

From @jchodera

There are apparently some additional issues with the object model such that datasets beyond OptimizationDataset are not supported

Yes. the get_graph function in the initial code was only setup to work with the OptimizationDataset. I think it would be straight forward to support the SinglepointDataset objects and put in some checking in get_graph and get_graphs to give a descriptive failure message if a different set is tried.

kntkb · 2023-09-22T20:20:43Z

@chrisiacovella I remember when fetching the results from the SinglepointDataset that uses b3lyp-d3bj (openff default level of theory), you needed to combine the results from the DFT and the dispersion correction terms. This is not the case for OptimizationDataset and TorsionDriveDataset. I wonder if this behavior is the same for the latest QCArchive server and qcprotal.

chrisiacovella · 2023-09-22T21:04:42Z

@chrisiacovella I remember when fetching the results from the SinglepointDataset that uses b3lyp-d3bj (openff default level of theory), you needed to combine the results from the DFT and the dispersion correction terms. This is not the case for OptimizationDataset and TorsionDriveDataset. I wonder if this behavior is the same for the latest QCArchive server and qcprotal.

@kntkb This is something I started looking at when switching from the old to the new version, but I can't seem to find my notes; for some reason I think one of the specifications does include the sum, but don't quote me on that. I'm currently trying to figure that out right now actually.

… dataset has the smiles encoded for converting to openff.molecule

…d so that it will raise the desired exception rather than failing.

…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections).

…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections). This PR should sufficiently reproduce the prior behavior, but with new qcportal.

chrisiacovella added 4 commits September 20, 2023 14:58

updated qcharcive code to fetch OpenFF Full Optimization Benchmark 1 …

88c66cb

…to be compatible with qcportal >=0.5

Updated torsiondrive parsing. I'm not sure this has sufficient testing.

6f8198e

Adding in some testing of the torsion function

3e5a58f

Adding in some testing of the torsion function.

94add0c

chrisiacovella requested a review from ijpulidos September 21, 2023 03:32

chrisiacovella added 3 commits September 20, 2023 22:00

Updated collection type name.

908c35f

fixed import issue in test

2afcd1d

fixed parsing of the schema.

aa41891

mikemhenry mentioned this pull request Sep 21, 2023

Test newest DGL Version #186

Merged

fixing a typo that was causing failure of torsion test.

dc7b10c

Merge branch 'main' into qcarchive_update

bc3e983

chrisiacovella added 4 commits September 21, 2023 10:01

Slight change to code to add in a function that uses the iterate_reco…

9cb3189

…rds and iterates_entries functions

Merge remote-tracking branch 'origin/qcarchive_update' into qcarchive…

59dd1a8

…_update

merged with updated dgl update; removing qcportal pinning to the old …

cb28a48

…version.

Made spec_name be a variable

f48ae7c

chrisiacovella requested a review from mikemhenry September 21, 2023 17:57

chrisiacovella added 2 commits September 21, 2023 11:29

Adding in some docstrings

5f638fd

Addressed Mike's comment.s

86042f3

mikemhenry approved these changes Sep 21, 2023

View reviewed changes

ijpulidos reviewed Sep 21, 2023

View reviewed changes

espaloma/data/tests/test_qcarchive.py Show resolved Hide resolved

chrisiacovella added 3 commits September 21, 2023 16:26

Added additional basic docstring.

3861bd7

Added additional basic docstring for torsion parsing

4821aad

Fixed bug in iterate function; added test to catch that bug .

a6a4cdf

jchodera reviewed Sep 22, 2023

View reviewed changes

espaloma/data/qcarchive_utils.py Show resolved Hide resolved

mikemhenry self-requested a review September 22, 2023 16:31

mikemhenry requested changes Sep 22, 2023

View reviewed changes

Added support for singlepoint datasets

54ce464

fixing error in test.

a05bff7

chrisiacovella and others added 5 commits September 22, 2023 14:10

Changing the dataset for singlepoint testing as we need to ensure the…

87e9d5a

… dataset has the smiles encoded for converting to openff.molecule

Changing the dataset for singlepoint testing as we need to ensure the…

384bcc1

… dataset has the smiles encoded for converting to openff.molecule

Move the schema conversion to after checking if a dataset is supporte…

cac1776

…d so that it will raise the desired exception rather than failing.

Removed support for singlepoint dataset, as openff.molecule cannot pa…

01c9de5

…rse the singlepoint records properly at this point. Other issues need to be resolved with singlepoint energy beyond this (i.e., summation of dispersion corrections).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qcarchive update #187

Qcarchive update #187

chrisiacovella commented Sep 21, 2023 •

edited

mikemhenry commented Sep 21, 2023

chrisiacovella commented Sep 21, 2023

codecov-commenter commented Sep 21, 2023 •

edited

mikemhenry commented Sep 21, 2023

mikemhenry commented Sep 21, 2023

chrisiacovella commented Sep 21, 2023

mikemhenry left a comment

mikemhenry Sep 21, 2023

chrisiacovella Sep 21, 2023

kntkb Sep 21, 2023

kntkb Sep 22, 2023

chrisiacovella Sep 22, 2023

ijpulidos left a comment

mikemhenry left a comment

chrisiacovella commented Sep 22, 2023

kntkb commented Sep 22, 2023 •

edited

chrisiacovella commented Sep 22, 2023 •

edited

Qcarchive update #187

Are you sure you want to change the base?

Qcarchive update #187

Conversation

chrisiacovella commented Sep 21, 2023 • edited

mikemhenry commented Sep 21, 2023

chrisiacovella commented Sep 21, 2023

codecov-commenter commented Sep 21, 2023 • edited

Codecov Report

mikemhenry commented Sep 21, 2023

mikemhenry commented Sep 21, 2023

chrisiacovella commented Sep 21, 2023

mikemhenry left a comment

Choose a reason for hiding this comment

mikemhenry Sep 21, 2023

Choose a reason for hiding this comment

chrisiacovella Sep 21, 2023

Choose a reason for hiding this comment

kntkb Sep 21, 2023

Choose a reason for hiding this comment

kntkb Sep 22, 2023

Choose a reason for hiding this comment

chrisiacovella Sep 22, 2023

Choose a reason for hiding this comment

ijpulidos left a comment

Choose a reason for hiding this comment

mikemhenry left a comment

Choose a reason for hiding this comment

chrisiacovella commented Sep 22, 2023

kntkb commented Sep 22, 2023 • edited

chrisiacovella commented Sep 22, 2023 • edited

chrisiacovella commented Sep 21, 2023 •

edited

codecov-commenter commented Sep 21, 2023 •

edited

kntkb commented Sep 22, 2023 •

edited

chrisiacovella commented Sep 22, 2023 •

edited