Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V2]: Overhaul MultiHotAtomFeaturizer #658

Merged
merged 31 commits into from Apr 2, 2024
Merged

[V2]: Overhaul MultiHotAtomFeaturizer #658

merged 31 commits into from Apr 2, 2024

Conversation

oscarwumit
Copy link
Contributor

Description

This PR attempts to improve the initial atom featurization by limiting the default supported elements to common chemistry.

Example / Current workflow

The current setup in Chemprop, which allocates 101 bits for atomic number, assumes that most of the training sets likely contain chemistry involving the first 100 elements of the periodic table. This design choice, while comprehensive, may not be optimally aligned with the practical needs of most chemical property prediction tasks because these tasks typically involve a much narrower range of elements. As a result, the current encoding method tends to create a very sparse vector that is not necessary and also can negatively impact model training speed and memory requirement.

Bugfix / Desired workflow

This PR seeks to address the abovementioned issue by changing the default encoding of atomic number to elements that are commonly used in applications like pharmaceuticals and materials design. Specifically, the default is changed to the first 4 rows of the periodic table plus iodine and a zero padding for other elements. This design choice should be sufficient for most common use-cases, and the implementation can be easily adapted to include additional elements for special cases.

I carried out some pre-liminary benchmark using Chemprop v1. When training on ~300k bi-molecular reactions to predict barrier heights, models with new featurization strategy can be trained ~40% faster while achieving similar accuracy to current implementation. Therefore, I think we should implement this change in Chemprop v2.

Questions

I also included more hybridization types supported by rdkit. I think the s hybridization makes sense for H atoms, especially when explicit H is used. The sp2d hybridization is less common, but I think it does not hurt to be included.

Relevant issues

This PR partially addresses the issue: #547
Further discussions are needed to decide if we want to change some features (e.g., formal charges, bond orders, num of Hs) from one-hot encoding to ordinal encoding.

Checklist

  • (if appropriate) unit tests added?
    All relevant unit tests have been updated and passed check.

@oscarwumit oscarwumit changed the title Improve initial atom featurization v2: Improve initial atom featurization Feb 22, 2024
@oscarwumit oscarwumit added the enhancement a new feature request label Feb 22, 2024
@oscarwumit oscarwumit added this to the v2.0.0 milestone Feb 22, 2024
@oscarwumit oscarwumit linked an issue Feb 22, 2024 that may be closed by this pull request
@davidegraff
Copy link
Contributor

At the point we're altering the initial featurization scheme, is there a reason we don't just eliminate alkali, alkaline earth, and transition metals as well as noble gases? The frequency of these elements in typical inputs must be $\ll0.01$% of all atoms in a standard dataset. At that point, the ability of an MPNN to learn the impact of these atoms on a molecule's representation is almost certainly nonexistent and can be likely be approximated by a simple padding bit. At the same time, if we're eliminating infrequent atom types to reduce sparsity in our representation, why would we add a hybridization bit that is exceedingly rare? We would only expect to observe $sp^xd^y$ in transition metal complexes and hypervalent S/P/Cl/Br/I, and I can't really think of datasets where these types of compounds are present in any appreciable amount.

I agree with the notion that we can improve the featurization scheme to improve density, but I think if we're doing so, then we should take it a step further:

  • encode only extended organic elements with unknown padding: H, B, C, N, O, F, S, P, Cl, Br, and I (possibly Si as well)
  • encode only typical ogranic hybridization states with unknown padding: $s, sp, sp^2, sp^3$

@oscarwumit
Copy link
Contributor Author

oscarwumit commented Feb 22, 2024

Thanks for the comment. I think we all agree to limit the element types. It is worth a discussion regarding the specific elements to include as default. The elements you suggested are commonly seen in drug discovery applications, but I do think some 4th row metals are commonly seen in materials design datasets. Na and Mg are also commonly seen. The reason I included 4th row metals are trying to keep some generalizability, but the inclusion of them is certainly open to discussion. Regarding the hybridization, I think adding s makes sense for H, and the sp2d is mainly because I chose to include 4th row elements that can be hybridized this way.

Another approach is to simply go through the dataset and only include elements that have appeared in the dataset. What is your opinion on this?

@davidegraff
Copy link
Contributor

I think the question comes down to frequency, i.e., what fraction of total atoms in a total dataset are represented by these "less common" elements. If we can produce a bar charts of atomic frequency for representative datasets, we can set some principled criterion of what to include or exclude. I don't doubt that Na and Mg are present in typical datasets, but if less than 1% of compounds have one of these atoms and these compounds only contain a single one, I'd be hard pressed to believe that an MPNN is learning anything about these atoms' contributions to molecular properties beyond just guessing the unconditional mean of an known atom type. More specifically, these atoms contribute an independent channel in the atomic featurization scheme, how many gradient updates do you expect these channels to receive relative to more conventional atom types in a standard organic dataset.

I'm less familiar with materials datasets, so I'd be curious to hear more about chemical composition of these. I come to chemprop from a small molecule background, and my impression is that the large majority of users (e.g., MLPDS) fall into this camp.

Copy link
Contributor

@KnathanM KnathanM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You included the 0 padding in the atomic_nums array. I am afraid that might cause problems for users who want to supply their atomic_nums (e.g. featurizer = MultiHotAtomFeaturizer(atomic_nums = [1, 6, 7, 8, 9])) because they would need to remember to include the 0 padding. (My example would throw an error if one of the inputs had a sulfur with the current configuration.) If having a 0 pad for atomic number is always a good idea, what do you think of reverting your changes that put it in atomic_nums? I've suggested these changes. I didn't make the corresponding required changes in test_atom.py though.
In any case you'll also need to update i = self.atomic_nums.get(a.GetAtomicNum() - 1, len(self.atomic_nums)) in num_only(), maybe to i = self.atomic_nums.get(a.GetAtomicNum(), len(self.atomic_nums)) .

+---------------------+-----------------+--------------+

NOTE: the above signature only applies for the default arguments, as the each slice (save for
the final two) can increase in size depending on the input arguments.
"""

max_atomic_num: InitVar[int] = 100
# all elements in the first 4 rows of periodic talbe plus iodine and 0 padding for other elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# all elements in the first 4 rows of periodic talbe plus iodine and 0 padding for other elements
# all elements in the first 4 rows of periodic table plus iodine

Comment on lines 61 to 64
atomic_nums: Sequence[int] = field(default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 53])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
atomic_nums: Sequence[int] = field(default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
30, 31, 32, 33, 34, 35, 36, 53])
atomic_nums: Sequence[int] = field(default_factory=lambda: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 53])

@@ -88,7 +95,7 @@ def __post_init__(self, max_atomic_num: int = 100):
self.hybridizations,
]
subfeat_sizes = [
1 + len(self.atomic_nums),
len(self.atomic_nums),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
len(self.atomic_nums),
1 + len(self.atomic_nums),

@@ -109,18 +116,23 @@ def __call__(self, a: Atom | None) -> np.ndarray:
return x

feats = [
a.GetAtomicNum() - 1,
a.GetAtomicNum() if a.GetAtomicNum() in self.atomic_nums else 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a.GetAtomicNum() if a.GetAtomicNum() in self.atomic_nums else 0,
a.GetAtomicNum(),

a.GetTotalDegree(),
a.GetFormalCharge(),
int(a.GetChiralTag()),
int(a.GetTotalNumHs()),
a.GetHybridization(),
]
i = 0
pad = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pad = False

Comment on lines 131 to 135
if not pad:
i += len(choices)
pad = True
else:
i += len(choices) + 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not pad:
i += len(choices)
pad = True
else:
i += len(choices) + 1
i += len(choices) + 1

@davidegraff
Copy link
Contributor

davidegraff commented Feb 23, 2024

I have been noticing this trend increasingly in the PRs, but I will not approve any PRs that use branching logic to achieve their desired outcome if the alternative branch will break the computation pipeline.

If you set pad=not pad for this featurizer, you will change the atom feature dimension by 1 and break the computation pipeline because the the input feature dimension to your message passing scheme is now off by 1. Your object model fundamentally assumes that you only ever take one branch, and that is precisely the brittleness that the v2 rewrite was created to address.

edit: this was mistaken. I got confused by a variable that seemingly does nothing?

Copy link
Contributor

@KnathanM KnathanM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is outside the scope of your PR, but I see that the current tests in test_atom.py aren't very robust. The problem is that the list of atoms tested only has carbon, nitrogen, flurine, and oxygen. I suggest that instead of:

SMI = "Cn1nc(CC(=O)Nc2ccc3oc4ccccc4c3c2)c2ccccc2c1=O"
@pytest.fixture(params=list(Chem.MolFromSmiles(SMI).GetAtoms())[:5])
def atom(request):
...
@pytest.mark.parametrize(
    "a,x_v_orig",
    zip(
        list(Chem.MolFromSmiles("Fc1cccc(C2(c3nnc(Cc4cccc5ccccc45)o3)CCOCC2)c1").GetAtoms()),

We use:

parser = Chem.SmilesParserParams()
parser.removeHs = False
SMI = "IC([Rb])([H])c1ccccc1"
@pytest.fixture(params=list(Chem.MolFromSmiles(SMI, parser).GetAtoms())[:5])
def atom(request):
...
@pytest.mark.parametrize(
    "a,x_v_orig",
    zip(
        list(Chem.MolFromSmiles(SMI, parser).GetAtoms())[:5],

This approach tests the featurizer on Iodine, normal Carbon, Rubdium (not in default), Hydrogen, and aromatic Carbon.

]
# fmt: on
),
)
def test_x_orig(a, x_v_orig):
f = MultiHotAtomFeaturizer()
x_v_calc = f(a)
print(x_v_calc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder to remove this print before the PR is finished.

@KnathanM
Copy link
Contributor

KnathanM commented Feb 23, 2024

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

@@ -36,44 +36,51 @@ class MultiHotAtomFeaturizer(AtomFeaturizer):
+---------------------+-----------------+--------------+
| slice [start, stop) | subfeature | unknown pad? |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| slice [start, stop) | subfeature | unknown pad? |
| slice [start, stop) | subfeature | pad for unknown? |

I just realized that "unknown pad" doesn't mean that we don't know how big the padding is, but that it means there is a (single) padding bit set aside for any values not explicitly in the subfeature. That is probably obvious for others, but I've been confused about this for a while. Perhaps "pad for unknown" is more clear. If you made this change, then all the following rows would also need to have this column made wider.

@davidegraff
Copy link
Contributor

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

I was confused by the pad variable (that doesn't do anything?) that is newly included. I thought it was an attribute that affected the length of the features.

@KnathanM
Copy link
Contributor

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

I was confused by the pad variable (that doesn't do anything?) that is newly included. I thought it was an attribute that affected the length of the features.

Okay, my understanding of why pad was added is that {atomic_nums, degrees, formal_charges, chiral_tags, num_Hs, and hybridizations} have their featurization bits set in a for loop. Each of these has a padding bit for unknowns and this padding bit is set using i += len(choices) + 1. But currently atomic_nums has its padding bit built in which means it doesn't need padding to be added and pad (which starts False) is a one time switch to account for that inconsistency. I think Oscar did that so the atomic number pad for unknown could be at the 0th element. In my review, I suggested atomic_nums is changed back to not build in a padding bit.

@KnathanM
Copy link
Contributor

Yesterday Oscar and I had discussions about this PR. I'll summarize a bit here and @oscarwumit /@kevingreenman can correct me if needed.

Defaults

We first focused on what the defaults of featurization should be. Generally we feel that the default length of the 1 hot encoding doesn't need to be very small and that a separate featurizer can be the "small" one. So the lengths as are currently in the PR seem good enough to move forward.

It is a bit unclear though whether there should even be a padding for unknown values for any features, including atomic number, degree, formal charge, chiral tag, #Hs, and hybridization. The added length isn't really the issue as much as the user experience. Here's a breakdown of arguments for padding vs not padding for unknown values:

Feature Why not pad Why pad
atomic number The user should know what elements are in their training dataset and can set the featurizer to include those elements. If there are elements that only appear a couple times, perhabs those datapoints should be removed because the model can't learn much about those elements anyways. Later a user may try to run inference on elements not seen before, but this should probably throw an error as the model doesn't know anything about that. We assume many Chemprop users will just download the software and run without much real ML knowledge. It would be nice if the software just runs to completion for first time users. When they get a couple bad preditions later, they can look at those molecule and figure out that those molecules are different than what the model was trained on.
degree Many of the same arugments. Also, molecules can reasonably bond to max 8 (?) things. We don't need a pad for unknown if we just include all possiblities in the default
formal charge We wouldn't need a pad for unknown if we encoded this as a float. Might be hard for a user to see what all formal charges are in their dataset (compared to atomic number). Having a pad for unknown would catch highly charge species which may be not rare.
chiral tag to be honest, I don't really understand what this is
#Hs Similar thoughts to degree
hybridization I don't know enough about hybridization, but to me SP2D and SP3D are about the same. They could both map to an "everything" bit (unknown pad) along with all the other complicated hybridizations.

A main argument for including the pad is that Chemprop v1 did this. Doesn't mean we have to, but also means that we should have a reason before changing it.

Implementation

As noted before, currently the PR treats atomic_nums and the other features differently when adding the padding bit for unknowns. The unknown pad is explicitly built in to atomic_nums while the rest of the features are padded by adding one to the length of options and using the index of that bit as the default in this line: j = choices.get(feat, len(choices)).

I feel it would be better to treat atomic_nums the same as the other features for two reasons. First, it requires less changes to the code/is simpler (see the summary of my suggestion below). Oscar's view is that the PR works as is. Second, if a user is supplying their own atomic_nums and e.g. formal_charges when creating an object of the MultiHotAtomFeaturizer class, they would need to remember to include an extra pad in atomic_nums and to not include it in formal_charges. Oscar's view is that users should not be passing arguments to MultiHotAtomFeaturizer and should instead make their own featurizer class that perhaps inherits from MultiHotAtomFeaturizer if they want to change anything. It would be over engineering to make make it easy for users to create a custom featurizer by (1) making MultiHotAtomFeaturizer be able to take arguments and (2) having atomic_nums padding consistent with the other features.

Side note

The tests are failing due to CondensedGraphOfReactionFeaturizer using max_atomic_num to calculate feature dimensions. The four cases of max_atomic_num in featurizers/molgraph/reaction.py and the two cases in tests/unit/featurizers/test_cgr.py will also need to be changed in this PR. Also the num_only method needs to be either removed or updated as well (see my previous comments for more details.) Happy to help with that if Oscar doesn't have time.

Simpler method?

My main current concern with the PR is the way it pads atomic_nums is different than the other features which requires a new variable pad and makes the code a little more complex. Below are some of my more granular thoughts. I made a simple fake PR on my forked version to show that only a few changes are needed to change the size of the atomic type one hot encoding.

To make sure we are all on the same page, I want to give a quick summary of how the v1 code works. The purpose of the code is to map (atom type) to (bit index in one hot encoding). V1 does this via a user accessible variable max_atomic_num that is then expanded using range(max_atomic_num). self.atomic_nums = {i: i for i in range(max_atomic_num)} then creates a simple dictionary where the keys are (atomic number - 1) and the values are the bit indices. Later a.GetAtomicNum() - 1 converts the atomic numbers to (atomic number - 1) so that feature can be used as the key down below in for feat, choices in zip(feats, self._subfeats). feat is the value of the key and choices is a dictionary mapping keys to unpadded bit indices. First the index is retrieved using j = choices.get(feat, len(choices)), which if the key is not in the dictionary (i.e. it is an unknown), the default index for that is len(choices). This index of len(choices) would be a problem when we write the bit x[i + j] = 1 where x = np.zeros(self.__size) because if if you create an array with np.zeros(length) the array is zero indexed and doesn't have a index array[length]. But the code handles that by adding 1 to the sizes of the one hot encodings with a pad for unknown values subfeat_sizes = [ 1 + len(self.atomic_nums), ... and then shifts i (the start location of the feature one hot encoding) also by one more than the length of the unpadded feature i += len(choices) + 1.

If we aren't using all consecutive atom types then self.atomic_nums = {i: i for i in range(max_atomic_num)} needs to be changed to not use the same variable for both keys and bit index values. I really like that Oscar changed this to be consistent with how the dictionaries for self.formal_charges and self.hybridizations are set up: self.atomic_nums{j: i for i, j in enumerate(self.atomic_nums)} where self.atomic_nums starts as a list of the atomic numbers the featurizer supports. The resulting self.atomic_nums is a dictionary mapping of atomic number -> bit index. This means a.GetAtomicNum() - 1 can just be changed to a.GetAtomicNum() as the dictionary keys are the atomic numbers. Then if an atomic number that isn't a dictionary key is given, j = choices.get(feat, len(choices)) will still send it to the pad for unknown at the end of the one hot encoding for that feature.

I think the added complexity of this PR stems from trying to put the pad for unknown atomic numbers at the beginning of the one hot encoding, while it goes at the end for the rest of the features.

A final thought about custom featurizers.

The question was raised about how users could use MultiHotAtomFeaturizer to create their own featurizer, for example if they wanted to include only copper, silver, and gold atoms in their model. The simplest way to do that is to pass the list of atomic numbers to the featurizer class when making the featurizer object featurizer = MultiHotAtomFeaturizer(atomic_nums=[29, 47, 79]). I think this is the expected and prefered way to customize the featurizer because it doesn't require copying any code. MultiHotAtomFeaturizer is a dataclass which is used to automatically create an __init__() method whose arguments are the lists to use for the features. This PR itself uses this functionality in the tests:

def featurizer(atomic_num, degree, formal_charge, chiral_tag, num_Hs, hybridization):
    return MultiHotAtomFeaturizer(atomic_num, degree, formal_charge, chiral_tag, num_Hs, hybridization)

@oscarwumit
Copy link
Contributor Author

Thanks for the insightful comments. I have modified the implementation as discussed in the meeting. Currently the test on atoms will pass but not for the CGRs, and resolving this could take some time. Help from someone familiar with the CGR code is appreciated.

@KnathanM
Copy link
Contributor

Agreed that resolving those CGR tests will take some time. So we can plan to include this in the 2.0 formal release and not the release candidate.

@KnathanM KnathanM modified the milestones: v2.0.0-rc.1, v2.0.0 Mar 1, 2024
davidegraff
davidegraff previously approved these changes Mar 4, 2024
Copy link
Contributor

@davidegraff davidegraff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to approve this PR but would like to add food for thought: we should consider refactoring the "setup"s into separate @classmethods. That is, we define a MultiHotAtomFeaturizer class with no default argument values and instead rely on separate constructors that set the defaults, e.g.,:

class MultiHotAtomFeaturizer:
    ... # dataclass fields go here but WITHOUT the `default_factory` values

    @classmethod
    def yang2020(cls, condensed: bool=True):
        r"""build the atom featurizer used in [1]_

        Parameters
        -----------
        condensed : bool, default=False
            whether to use a condensed list of atom types. If `False`, use all atomic numbers :math:`z in [1, 100]` . Otherwise, use atomic numbers  :math:`z in [1, 37] \union {53}`

        References
        -----------
        .. [1] REF TO OG CHEMPROP PAPER
        """

    @classmethod
    def organic(cls):
        r"""build a minimal featurizer with atom types for typical organic elements, i.e., :math:`z \in {1, 5, 6, 7, 8, 9, 15, 16, 17, 35, 53}`"""

This would be more idiomatic because users would now build their atom featurizers like so:

af = MultiHotAtomFeaturizer.yang2020(condensed=False)`

rather than just assuming that the initializer is doing one thing when in reality the documentation/init has changed in a previous commit.

@shihchengli shihchengli self-requested a review March 6, 2024 20:02
@oscarwumit
Copy link
Contributor Author

oscarwumit commented Mar 6, 2024

Thanks for the comment. @shihchengli and I will work on making sure the CGR tests pass for this PR before merging. And I will incorporate the comments. Please do not merge this PR yet.

Copy link
Contributor

@shihchengli shihchengli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR. The changes look good to me. Some minor suggestions are left. I will work on the CGR tests.

choices=list(RxnMode.keys()),
help="""Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive):
- 'default': Includes all elements in the first 4 rows of the periodic talbe plus iodine and an 0 padding for other elements (default in Chemprop v2).
- 'v1': Same implementation as Chemprop v1 default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 'v1': Same implementation as Chemprop v1 default.
- 'v1': Includes the first 100 elements in the periodic table (same implementation as Chemprop v1 default).

Comment on lines 98 to 104
if n == 53: # special check for Iodine
assert x[len(atomic_num) - 1] == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using a SMILES containing iodine as a test case to avoid these two lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think using an iodine example will eliminate these two lines because for atomic number 1-36, it directly corresponds to entries 0 to 35 of the atomic number feature vector. However, iodine has an atomic number of 53 but it is mapped to the 36 index of the feature vector, and therefore need a special check anyway.

def test_x_orig(a, x_v_orig):
f = MultiHotAtomFeaturizer()
def test_x_orig_default(a, x_v_orig):
f = MultiHotAtomFeaturizer.default()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, test the other two methods. It would also be good to make the test code as similar as test_bond.py. The zip is used here to extract the first 4 atoms to compare, but the index is used in test_bond.py instead.


case "ORGANIC":
atom_featurizer=MultiHotAtomFeaturizer.organic()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to raise an error for an unknow multi_hot_atom_featurizer_mode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a new enum:

class AtomFeatureMode(EnumMapping):
    DEFAULT = auto()
    V1 = auto()
    ORGANIC = auto()

so that the unknown case will be handled by that and we can throw a RuntimeError if it falls through the match-statement

oscarwumit and others added 3 commits April 1, 2024 23:13
Various updates based on PR review comments.
Default behavior for atom featurizer is set in mixins.py, so no need to specify here.
@shihchengli
Copy link
Contributor

After investigation, the failure of the tests is due to the fact that the output scalers are not saved in the checkpoint files. I have manually updated the values in the checkpoint files so that we can pass the tests. The issue with the output scalers has been mentioned in #694 and will be resolved in #726.

@oscarwumit
Copy link
Contributor Author

Thanks for the update. @shihchengli Can you rebase to consolidate similar commits together? After that, we can merge this in.

@oscarwumit
Copy link
Contributor Author

Thanks everyone for the good work. I will merge.

@oscarwumit oscarwumit merged commit edf7f2c into v2/dev Apr 2, 2024
9 checks passed
@oscarwumit oscarwumit deleted the abridged_atom_num branch April 2, 2024 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a new feature request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TODO]: v2 Improve initial featurization
4 participants