[V2]: Overhaul `MultiHotAtomFeaturizer` #658

oscarwumit · 2024-02-22T08:23:03Z

Description

This PR attempts to improve the initial atom featurization by limiting the default supported elements to common chemistry.

Example / Current workflow

The current setup in Chemprop, which allocates 101 bits for atomic number, assumes that most of the training sets likely contain chemistry involving the first 100 elements of the periodic table. This design choice, while comprehensive, may not be optimally aligned with the practical needs of most chemical property prediction tasks because these tasks typically involve a much narrower range of elements. As a result, the current encoding method tends to create a very sparse vector that is not necessary and also can negatively impact model training speed and memory requirement.

Bugfix / Desired workflow

This PR seeks to address the abovementioned issue by changing the default encoding of atomic number to elements that are commonly used in applications like pharmaceuticals and materials design. Specifically, the default is changed to the first 4 rows of the periodic table plus iodine and a zero padding for other elements. This design choice should be sufficient for most common use-cases, and the implementation can be easily adapted to include additional elements for special cases.

I carried out some pre-liminary benchmark using Chemprop v1. When training on ~300k bi-molecular reactions to predict barrier heights, models with new featurization strategy can be trained ~40% faster while achieving similar accuracy to current implementation. Therefore, I think we should implement this change in Chemprop v2.

Questions

I also included more hybridization types supported by rdkit. I think the s hybridization makes sense for H atoms, especially when explicit H is used. The sp2d hybridization is less common, but I think it does not hurt to be included.

Relevant issues

This PR partially addresses the issue: #547
Further discussions are needed to decide if we want to change some features (e.g., formal charges, bond orders, num of Hs) from one-hot encoding to ordinal encoding.

Checklist

(if appropriate) unit tests added?
All relevant unit tests have been updated and passed check.

davidegraff · 2024-02-22T14:49:07Z

At the point we're altering the initial featurization scheme, is there a reason we don't just eliminate alkali, alkaline earth, and transition metals as well as noble gases? The frequency of these elements in typical inputs must be $\ll0.01$% of all atoms in a standard dataset. At that point, the ability of an MPNN to learn the impact of these atoms on a molecule's representation is almost certainly nonexistent and can be likely be approximated by a simple padding bit. At the same time, if we're eliminating infrequent atom types to reduce sparsity in our representation, why would we add a hybridization bit that is exceedingly rare? We would only expect to observe $sp^xd^y$ in transition metal complexes and hypervalent S/P/Cl/Br/I, and I can't really think of datasets where these types of compounds are present in any appreciable amount.

I agree with the notion that we can improve the featurization scheme to improve density, but I think if we're doing so, then we should take it a step further:

encode only extended organic elements with unknown padding: H, B, C, N, O, F, S, P, Cl, Br, and I (possibly Si as well)
encode only typical ogranic hybridization states with unknown padding: $s, sp, sp^2, sp^3$

oscarwumit · 2024-02-22T17:16:47Z

Thanks for the comment. I think we all agree to limit the element types. It is worth a discussion regarding the specific elements to include as default. The elements you suggested are commonly seen in drug discovery applications, but I do think some 4th row metals are commonly seen in materials design datasets. Na and Mg are also commonly seen. The reason I included 4th row metals are trying to keep some generalizability, but the inclusion of them is certainly open to discussion. Regarding the hybridization, I think adding s makes sense for H, and the sp2d is mainly because I chose to include 4th row elements that can be hybridized this way.

Another approach is to simply go through the dataset and only include elements that have appeared in the dataset. What is your opinion on this?

davidegraff · 2024-02-22T17:34:22Z

I think the question comes down to frequency, i.e., what fraction of total atoms in a total dataset are represented by these "less common" elements. If we can produce a bar charts of atomic frequency for representative datasets, we can set some principled criterion of what to include or exclude. I don't doubt that Na and Mg are present in typical datasets, but if less than 1% of compounds have one of these atoms and these compounds only contain a single one, I'd be hard pressed to believe that an MPNN is learning anything about these atoms' contributions to molecular properties beyond just guessing the unconditional mean of an known atom type. More specifically, these atoms contribute an independent channel in the atomic featurization scheme, how many gradient updates do you expect these channels to receive relative to more conventional atom types in a standard organic dataset.

I'm less familiar with materials datasets, so I'd be curious to hear more about chemical composition of these. I come to chemprop from a small molecule background, and my impression is that the large majority of users (e.g., MLPDS) fall into this camp.

KnathanM

You included the 0 padding in the atomic_nums array. I am afraid that might cause problems for users who want to supply their atomic_nums (e.g. featurizer = MultiHotAtomFeaturizer(atomic_nums = [1, 6, 7, 8, 9])) because they would need to remember to include the 0 padding. (My example would throw an error if one of the inputs had a sulfur with the current configuration.) If having a 0 pad for atomic number is always a good idea, what do you think of reverting your changes that put it in atomic_nums? I've suggested these changes. I didn't make the corresponding required changes in test_atom.py though.
In any case you'll also need to update i = self.atomic_nums.get(a.GetAtomicNum() - 1, len(self.atomic_nums)) in num_only(), maybe to i = self.atomic_nums.get(a.GetAtomicNum(), len(self.atomic_nums)) .

KnathanM · 2024-02-23T17:22:32Z

chemprop/featurizers/atom.py

    +---------------------+-----------------+--------------+

    NOTE: the above signature only applies for the default arguments, as the each slice (save for
    the final two) can increase in size depending on the input arguments.
    """

-    max_atomic_num: InitVar[int] = 100
+    # all elements in the first 4 rows of periodic talbe plus iodine and 0 padding for other elements


Suggested change

# all elements in the first 4 rows of periodic talbe plus iodine and 0 padding for other elements

# all elements in the first 4 rows of periodic table plus iodine

KnathanM · 2024-02-23T17:42:32Z

chemprop/featurizers/atom.py

+    atomic_nums: Sequence[int] = field(default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
+                                                                10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
+                                                                20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
+                                                                30, 31, 32, 33, 34, 35, 36, 53])


Suggested change

atomic_nums: Sequence[int] = field(default_factory=lambda: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,

10, 11, 12, 13, 14, 15, 16, 17, 18, 19,

20, 21, 22, 23, 24, 25, 26, 27, 28, 29,

30, 31, 32, 33, 34, 35, 36, 53])

atomic_nums: Sequence[int] = field(default_factory=lambda: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,

11, 12, 13, 14, 15, 16, 17, 18, 19, 20,

21, 22, 23, 24, 25, 26, 27, 28, 29, 30,

31, 32, 33, 34, 35, 36, 53])

KnathanM · 2024-02-23T17:42:51Z

chemprop/featurizers/atom.py

@@ -88,7 +95,7 @@ def __post_init__(self, max_atomic_num: int = 100):
            self.hybridizations,
        ]
        subfeat_sizes = [
-            1 + len(self.atomic_nums),
+            len(self.atomic_nums),


Suggested change

len(self.atomic_nums),

1 + len(self.atomic_nums),

KnathanM · 2024-02-23T17:43:03Z

chemprop/featurizers/atom.py

@@ -109,18 +116,23 @@ def __call__(self, a: Atom | None) -> np.ndarray:
            return x

        feats = [
-            a.GetAtomicNum() - 1,
+            a.GetAtomicNum() if a.GetAtomicNum() in self.atomic_nums else 0,


Suggested change

a.GetAtomicNum() if a.GetAtomicNum() in self.atomic_nums else 0,

a.GetAtomicNum(),

KnathanM · 2024-02-23T17:43:11Z

chemprop/featurizers/atom.py

            a.GetTotalDegree(),
            a.GetFormalCharge(),
            int(a.GetChiralTag()),
            int(a.GetTotalNumHs()),
            a.GetHybridization(),
        ]
        i = 0
+        pad = False


Suggested change

pad = False

KnathanM · 2024-02-23T17:43:25Z

chemprop/featurizers/atom.py

+            if not pad:
+                i += len(choices)
+                pad = True
+            else:
+                i += len(choices) + 1


Suggested change

if not pad:

i += len(choices)

pad = True

else:

i += len(choices) + 1

i += len(choices) + 1

davidegraff · 2024-02-23T18:43:16Z

I have been noticing this trend increasingly in the PRs, but I will not approve any PRs that use branching logic to achieve their desired outcome if the alternative branch will break the computation pipeline.

If you set pad=not pad for this featurizer, you will change the atom feature dimension by 1 and break the computation pipeline because the the input feature dimension to your message passing scheme is now off by 1. Your object model fundamentally assumes that you only ever take one branch, and that is precisely the brittleness that the v2 rewrite was created to address.

edit: this was mistaken. I got confused by a variable that seemingly does nothing?

KnathanM

Maybe this is outside the scope of your PR, but I see that the current tests in test_atom.py aren't very robust. The problem is that the list of atoms tested only has carbon, nitrogen, flurine, and oxygen. I suggest that instead of:

SMI = "Cn1nc(CC(=O)Nc2ccc3oc4ccccc4c3c2)c2ccccc2c1=O"
@pytest.fixture(params=list(Chem.MolFromSmiles(SMI).GetAtoms())[:5])
def atom(request):
...
@pytest.mark.parametrize(
    "a,x_v_orig",
    zip(
        list(Chem.MolFromSmiles("Fc1cccc(C2(c3nnc(Cc4cccc5ccccc45)o3)CCOCC2)c1").GetAtoms()),

We use:

parser = Chem.SmilesParserParams()
parser.removeHs = False
SMI = "IC([Rb])([H])c1ccccc1"
@pytest.fixture(params=list(Chem.MolFromSmiles(SMI, parser).GetAtoms())[:5])
def atom(request):
...
@pytest.mark.parametrize(
    "a,x_v_orig",
    zip(
        list(Chem.MolFromSmiles(SMI, parser).GetAtoms())[:5],

This approach tests the featurizer on Iodine, normal Carbon, Rubdium (not in default), Hydrogen, and aromatic Carbon.

KnathanM · 2024-02-23T18:33:27Z

tests/unit/featurizers/test_atom.py

        ]
        # fmt: on
    ),
 )
 def test_x_orig(a, x_v_orig):
    f = MultiHotAtomFeaturizer()
    x_v_calc = f(a)
+    print(x_v_calc)


Just a reminder to remove this print before the PR is finished.

KnathanM · 2024-02-23T19:51:41Z

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

KnathanM · 2024-02-23T19:55:48Z

chemprop/featurizers/atom.py

@@ -36,44 +36,51 @@ class MultiHotAtomFeaturizer(AtomFeaturizer):
    +---------------------+-----------------+--------------+
    | slice [start, stop) | subfeature      | unknown pad? |


Suggested change

| slice [start, stop) | subfeature | unknown pad? |

| slice [start, stop) | subfeature | pad for unknown? |

I just realized that "unknown pad" doesn't mean that we don't know how big the padding is, but that it means there is a (single) padding bit set aside for any values not explicitly in the subfeature. That is probably obvious for others, but I've been confused about this for a while. Perhaps "pad for unknown" is more clear. If you made this change, then all the following rows would also need to have this column made wider.

davidegraff · 2024-02-23T20:14:08Z

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

I was confused by the pad variable (that doesn't do anything?) that is newly included. I thought it was an attribute that affected the length of the features.

KnathanM · 2024-02-23T20:23:53Z

If you set pad=not pad for this featurizer ...

Could you clarify what you mean here David? The MultiHotAtomFeaturizer doesn't take pad as an argument.

I was confused by the pad variable (that doesn't do anything?) that is newly included. I thought it was an attribute that affected the length of the features.

Okay, my understanding of why pad was added is that {atomic_nums, degrees, formal_charges, chiral_tags, num_Hs, and hybridizations} have their featurization bits set in a for loop. Each of these has a padding bit for unknowns and this padding bit is set using i += len(choices) + 1. But currently atomic_nums has its padding bit built in which means it doesn't need padding to be added and pad (which starts False) is a one time switch to account for that inconsistency. I think Oscar did that so the atomic number pad for unknown could be at the 0th element. In my review, I suggested atomic_nums is changed back to not build in a padding bit.

KnathanM · 2024-02-27T13:16:58Z

Yesterday Oscar and I had discussions about this PR. I'll summarize a bit here and @oscarwumit /@kevingreenman can correct me if needed.

Defaults

We first focused on what the defaults of featurization should be. Generally we feel that the default length of the 1 hot encoding doesn't need to be very small and that a separate featurizer can be the "small" one. So the lengths as are currently in the PR seem good enough to move forward.

It is a bit unclear though whether there should even be a padding for unknown values for any features, including atomic number, degree, formal charge, chiral tag, #Hs, and hybridization. The added length isn't really the issue as much as the user experience. Here's a breakdown of arguments for padding vs not padding for unknown values:

Feature	Why not pad	Why pad
atomic number	The user should know what elements are in their training dataset and can set the featurizer to include those elements. If there are elements that only appear a couple times, perhabs those datapoints should be removed because the model can't learn much about those elements anyways. Later a user may try to run inference on elements not seen before, but this should probably throw an error as the model doesn't know anything about that.	We assume many Chemprop users will just download the software and run without much real ML knowledge. It would be nice if the software just runs to completion for first time users. When they get a couple bad preditions later, they can look at those molecule and figure out that those molecules are different than what the model was trained on.
degree	Many of the same arugments. Also, molecules can reasonably bond to max 8 (?) things. We don't need a pad for unknown if we just include all possiblities in the default
formal charge	We wouldn't need a pad for unknown if we encoded this as a float.	Might be hard for a user to see what all formal charges are in their dataset (compared to atomic number). Having a pad for unknown would catch highly charge species which may be not rare.
chiral tag	to be honest, I don't really understand what this is
#Hs	Similar thoughts to degree
hybridization		I don't know enough about hybridization, but to me SP2D and SP3D are about the same. They could both map to an "everything" bit (unknown pad) along with all the other complicated hybridizations.

A main argument for including the pad is that Chemprop v1 did this. Doesn't mean we have to, but also means that we should have a reason before changing it.

Implementation

As noted before, currently the PR treats atomic_nums and the other features differently when adding the padding bit for unknowns. The unknown pad is explicitly built in to atomic_nums while the rest of the features are padded by adding one to the length of options and using the index of that bit as the default in this line: j = choices.get(feat, len(choices)).

I feel it would be better to treat atomic_nums the same as the other features for two reasons. First, it requires less changes to the code/is simpler (see the summary of my suggestion below). Oscar's view is that the PR works as is. Second, if a user is supplying their own atomic_nums and e.g. formal_charges when creating an object of the MultiHotAtomFeaturizer class, they would need to remember to include an extra pad in atomic_nums and to not include it in formal_charges. Oscar's view is that users should not be passing arguments to MultiHotAtomFeaturizer and should instead make their own featurizer class that perhaps inherits from MultiHotAtomFeaturizer if they want to change anything. It would be over engineering to make make it easy for users to create a custom featurizer by (1) making MultiHotAtomFeaturizer be able to take arguments and (2) having atomic_nums padding consistent with the other features.

Side note

The tests are failing due to CondensedGraphOfReactionFeaturizer using max_atomic_num to calculate feature dimensions. The four cases of max_atomic_num in featurizers/molgraph/reaction.py and the two cases in tests/unit/featurizers/test_cgr.py will also need to be changed in this PR. Also the num_only method needs to be either removed or updated as well (see my previous comments for more details.) Happy to help with that if Oscar doesn't have time.

Simpler method?

My main current concern with the PR is the way it pads atomic_nums is different than the other features which requires a new variable pad and makes the code a little more complex. Below are some of my more granular thoughts. I made a simple fake PR on my forked version to show that only a few changes are needed to change the size of the atomic type one hot encoding.

To make sure we are all on the same page, I want to give a quick summary of how the v1 code works. The purpose of the code is to map (atom type) to (bit index in one hot encoding). V1 does this via a user accessible variable max_atomic_num that is then expanded using range(max_atomic_num). self.atomic_nums = {i: i for i in range(max_atomic_num)} then creates a simple dictionary where the keys are (atomic number - 1) and the values are the bit indices. Later a.GetAtomicNum() - 1 converts the atomic numbers to (atomic number - 1) so that feature can be used as the key down below in for feat, choices in zip(feats, self._subfeats). feat is the value of the key and choices is a dictionary mapping keys to unpadded bit indices. First the index is retrieved using j = choices.get(feat, len(choices)), which if the key is not in the dictionary (i.e. it is an unknown), the default index for that is len(choices). This index of len(choices) would be a problem when we write the bit x[i + j] = 1 where x = np.zeros(self.__size) because if if you create an array with np.zeros(length) the array is zero indexed and doesn't have a index array[length]. But the code handles that by adding 1 to the sizes of the one hot encodings with a pad for unknown values subfeat_sizes = [ 1 + len(self.atomic_nums), ... and then shifts i (the start location of the feature one hot encoding) also by one more than the length of the unpadded feature i += len(choices) + 1.

If we aren't using all consecutive atom types then self.atomic_nums = {i: i for i in range(max_atomic_num)} needs to be changed to not use the same variable for both keys and bit index values. I really like that Oscar changed this to be consistent with how the dictionaries for self.formal_charges and self.hybridizations are set up: self.atomic_nums{j: i for i, j in enumerate(self.atomic_nums)} where self.atomic_nums starts as a list of the atomic numbers the featurizer supports. The resulting self.atomic_nums is a dictionary mapping of atomic number -> bit index. This means a.GetAtomicNum() - 1 can just be changed to a.GetAtomicNum() as the dictionary keys are the atomic numbers. Then if an atomic number that isn't a dictionary key is given, j = choices.get(feat, len(choices)) will still send it to the pad for unknown at the end of the one hot encoding for that feature.

I think the added complexity of this PR stems from trying to put the pad for unknown atomic numbers at the beginning of the one hot encoding, while it goes at the end for the rest of the features.

A final thought about custom featurizers.

The question was raised about how users could use MultiHotAtomFeaturizer to create their own featurizer, for example if they wanted to include only copper, silver, and gold atoms in their model. The simplest way to do that is to pass the list of atomic numbers to the featurizer class when making the featurizer object featurizer = MultiHotAtomFeaturizer(atomic_nums=[29, 47, 79]). I think this is the expected and prefered way to customize the featurizer because it doesn't require copying any code. MultiHotAtomFeaturizer is a dataclass which is used to automatically create an __init__() method whose arguments are the lists to use for the features. This PR itself uses this functionality in the tests:

def featurizer(atomic_num, degree, formal_charge, chiral_tag, num_Hs, hybridization):
    return MultiHotAtomFeaturizer(atomic_num, degree, formal_charge, chiral_tag, num_Hs, hybridization)

oscarwumit · 2024-02-29T21:07:39Z

Thanks for the insightful comments. I have modified the implementation as discussed in the meeting. Currently the test on atoms will pass but not for the CGRs, and resolving this could take some time. Help from someone familiar with the CGR code is appreciated.

KnathanM · 2024-02-29T21:41:44Z

Agreed that resolving those CGR tests will take some time. So we can plan to include this in the 2.0 formal release and not the release candidate.

davidegraff

I am going to approve this PR but would like to add food for thought: we should consider refactoring the "setup"s into separate @classmethods. That is, we define a MultiHotAtomFeaturizer class with no default argument values and instead rely on separate constructors that set the defaults, e.g.,:

class MultiHotAtomFeaturizer:
    ... # dataclass fields go here but WITHOUT the `default_factory` values

    @classmethod
    def yang2020(cls, condensed: bool=True):
        r"""build the atom featurizer used in [1]_

        Parameters
        -----------
        condensed : bool, default=False
            whether to use a condensed list of atom types. If `False`, use all atomic numbers :math:`z in [1, 100]` . Otherwise, use atomic numbers  :math:`z in [1, 37] \union {53}`

        References
        -----------
        .. [1] REF TO OG CHEMPROP PAPER
        """

    @classmethod
    def organic(cls):
        r"""build a minimal featurizer with atom types for typical organic elements, i.e., :math:`z \in {1, 5, 6, 7, 8, 9, 15, 16, 17, 35, 53}`"""

This would be more idiomatic because users would now build their atom featurizers like so:

af = MultiHotAtomFeaturizer.yang2020(condensed=False)`

rather than just assuming that the initializer is doing one thing when in reality the documentation/init has changed in a previous commit.

oscarwumit · 2024-03-06T20:57:20Z

Thanks for the comment. @shihchengli and I will work on making sure the CGR tests pass for this PR before merging. And I will incorporate the comments. Please do not merge this PR yet.

shihchengli

Thanks for making this PR. The changes look good to me. Some minor suggestions are left. I will work on the CGR tests.

shihchengli · 2024-03-11T16:07:29Z

chemprop/cli/common.py

+        choices=list(RxnMode.keys()),
+        help="""Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive):
+- 'default': Includes all elements in the first 4 rows of the periodic talbe plus iodine and an 0 padding for other elements (default in Chemprop v2).
+- 'v1': Same implementation as Chemprop v1 default.


Suggested change

- 'v1': Same implementation as Chemprop v1 default.

- 'v1': Includes the first 100 elements in the periodic table (same implementation as Chemprop v1 default).

shihchengli · 2024-03-11T18:55:22Z

tests/unit/featurizers/test_atom.py

+    if n == 53: # special check for Iodine
+        assert x[len(atomic_num) - 1] == 1


I suggest using a SMILES containing iodine as a test case to avoid these two lines.

I do not think using an iodine example will eliminate these two lines because for atomic number 1-36, it directly corresponds to entries 0 to 35 of the atomic number feature vector. However, iodine has an atomic number of 53 but it is mapped to the 36 index of the feature vector, and therefore need a special check anyway.

shihchengli · 2024-03-11T19:04:25Z

tests/unit/featurizers/test_atom.py

-def test_x_orig(a, x_v_orig):
-    f = MultiHotAtomFeaturizer()
+def test_x_orig_default(a, x_v_orig):
+    f = MultiHotAtomFeaturizer.default()


Also, test the other two methods. It would also be good to make the test code as similar as test_bond.py. The zip is used here to extract the first 4 atoms to compare, but the index is used in test_bond.py instead.

shihchengli · 2024-03-11T19:23:48Z

chemprop/cli/utils/parsing.py

+
+        case "ORGANIC":
+            atom_featurizer=MultiHotAtomFeaturizer.organic()
+


do we need to raise an error for an unknow multi_hot_atom_featurizer_mode?

There should be a new enum:

class AtomFeatureMode(EnumMapping): DEFAULT = auto() V1 = auto() ORGANIC = auto()

so that the unknown case will be handled by that and we can throw a RuntimeError if it falls through the match-statement

Various updates based on PR review comments.

Default behavior for atom featurizer is set in mixins.py, so no need to specify here.

shihchengli · 2024-04-02T03:31:04Z

After investigation, the failure of the tests is due to the fact that the output scalers are not saved in the checkpoint files. I have manually updated the values in the checkpoint files so that we can pass the tests. The issue with the output scalers has been mentioned in #694 and will be resolved in #726.

oscarwumit · 2024-04-02T06:22:04Z

Thanks for the update. @shihchengli Can you rebase to consolidate similar commits together? After that, we can merge this in.

Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> remove scheme tables in MultiHotAtomFeaturizer

Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com>

oscarwumit · 2024-04-02T17:29:06Z

Thanks everyone for the good work. I will merge.

oscarwumit requested a review from kevingreenman February 22, 2024 08:23

oscarwumit changed the title ~~Improve initial atom featurization~~ v2: Improve initial atom featurization Feb 22, 2024

oscarwumit added the enhancement a new feature request label Feb 22, 2024

oscarwumit added this to the v2.0.0 milestone Feb 22, 2024

oscarwumit linked an issue Feb 22, 2024 that may be closed by this pull request

[TODO]: v2 Improve initial featurization #547

Closed

oscarwumit requested a review from davidegraff February 22, 2024 17:17

KnathanM reviewed Feb 23, 2024

View reviewed changes

KnathanM mentioned this pull request Feb 29, 2024

reduce num of atoms in default featurizer KnathanM/chemprop#1

Closed

oscarwumit force-pushed the abridged_atom_num branch from 2d94a81 to b4fbe07 Compare February 29, 2024 21:05

oscarwumit requested a review from KnathanM February 29, 2024 21:07

KnathanM modified the milestones: v2.0.0-rc.1, v2.0.0 Mar 1, 2024

davidegraff previously approved these changes Mar 4, 2024

View reviewed changes

shihchengli self-requested a review March 6, 2024 20:02

oscarwumit force-pushed the abridged_atom_num branch from b4fbe07 to a4f237a Compare March 7, 2024 22:57

shihchengli reviewed Mar 11, 2024

View reviewed changes

oscarwumit and others added 3 commits April 1, 2024 23:13

Overhaul: address the pr comments

a52d0de

Various updates based on PR review comments.

Fix: SimpleMoleculeMolGraphFeaturizer default call

f7c82cc

Default behavior for atom featurizer is set in mixins.py, so no need to specify here.

define MultiHotAtomFeaturizer as a regular class

2edbb25

shihchengli force-pushed the abridged_atom_num branch from e1d4224 to 053b017 Compare April 2, 2024 03:13

shihchengli force-pushed the abridged_atom_num branch from 138c516 to e73e85b Compare April 2, 2024 15:04

shihchengli and others added 18 commits April 2, 2024 11:06

improve docstring

88e1128

Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com> remove scheme tables in MultiHotAtomFeaturizer

make MultiHotAtomFeaturizer as a parent class

cde57b3

add a new enum class AtomFeaturesType(EnumMapping)

68be359

correct file name for multiclass test

67a58c4

update model and checkpoint files

9767c10

change back to use classmethod for MultiHotAtomFeaturizer

5e192dc

no default values for MultiHotAtomFeaturizer

da04a46

update the error message for match function

a28842e

rename get_MultiHotAtomFeaturizer as get_multi_hot_atom_featurizer

6e34a5d

improve the docstring in MultiHotAtomFeaturizer()

a5d985e

rename the default to v2 in MultiHotAtomFeaturizer()

438f7f9

update test

c8ef8ce

update --multi-hot-atom-featurizer-mode CLI

3540160

Apply suggestions from code review

b93c96b

Co-authored-by: david graff <60193893+davidegraff@users.noreply.github.com>

use most targeted imports in conf.py

bf0a463

update model and checkpoint files

fbbc778

set the ap_location as cpu for model loading in tests

a0910ad

formatting

e972c33

shihchengli force-pushed the abridged_atom_num branch from e73e85b to e972c33 Compare April 2, 2024 15:06

oscarwumit merged commit edf7f2c into v2/dev Apr 2, 2024
9 checks passed

oscarwumit deleted the abridged_atom_num branch April 2, 2024 17:29

KnathanM mentioned this pull request Apr 2, 2024

[BUG]: AttributeError: Can't get attribute 'AttributeDict' on <module 'lightning.fabric.utilities.data' #753

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V2]: Overhaul `MultiHotAtomFeaturizer` #658

[V2]: Overhaul `MultiHotAtomFeaturizer` #658

oscarwumit commented Feb 22, 2024

davidegraff commented Feb 22, 2024

oscarwumit commented Feb 22, 2024 •

edited

davidegraff commented Feb 22, 2024

KnathanM left a comment

KnathanM Feb 23, 2024

KnathanM Feb 23, 2024

KnathanM Feb 23, 2024

KnathanM Feb 23, 2024

KnathanM Feb 23, 2024

KnathanM Feb 23, 2024

davidegraff commented Feb 23, 2024 •

edited

KnathanM left a comment

KnathanM Feb 23, 2024

KnathanM commented Feb 23, 2024 •

edited

KnathanM Feb 23, 2024

davidegraff commented Feb 23, 2024

KnathanM commented Feb 23, 2024

KnathanM commented Feb 27, 2024

oscarwumit commented Feb 29, 2024

KnathanM commented Feb 29, 2024

davidegraff left a comment •

edited

oscarwumit commented Mar 6, 2024 •

edited

shihchengli left a comment

shihchengli Mar 11, 2024

shihchengli Mar 11, 2024

oscarwumit Mar 13, 2024

shihchengli Mar 11, 2024

shihchengli Mar 11, 2024

davidegraff Mar 12, 2024

shihchengli commented Apr 2, 2024

oscarwumit commented Apr 2, 2024

oscarwumit commented Apr 2, 2024

	# all elements in the first 4 rows of periodic talbe plus iodine and 0 padding for other elements
	# all elements in the first 4 rows of periodic table plus iodine

	a.GetAtomicNum() if a.GetAtomicNum() in self.atomic_nums else 0,
	a.GetAtomicNum(),

		@@ -36,44 +36,51 @@ class MultiHotAtomFeaturizer(AtomFeaturizer):
		+---------------------+-----------------+--------------+
		\| slice [start, stop) \| subfeature \| unknown pad? \|

	\| slice [start, stop) \| subfeature \| unknown pad? \|
	\| slice [start, stop) \| subfeature \| pad for unknown? \|

	- 'v1': Same implementation as Chemprop v1 default.
	- 'v1': Includes the first 100 elements in the periodic table (same implementation as Chemprop v1 default).

		if n == 53: # special check for Iodine
		assert x[len(atomic_num) - 1] == 1


		case "ORGANIC":
		atom_featurizer=MultiHotAtomFeaturizer.organic()

[V2]: Overhaul MultiHotAtomFeaturizer #658

[V2]: Overhaul MultiHotAtomFeaturizer #658

Conversation

oscarwumit commented Feb 22, 2024

Description

Example / Current workflow

Bugfix / Desired workflow

Questions

Relevant issues

Checklist

davidegraff commented Feb 22, 2024

oscarwumit commented Feb 22, 2024 • edited

davidegraff commented Feb 22, 2024

KnathanM left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidegraff commented Feb 23, 2024 • edited

KnathanM left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnathanM commented Feb 23, 2024 • edited

Choose a reason for hiding this comment

davidegraff commented Feb 23, 2024

KnathanM commented Feb 23, 2024

KnathanM commented Feb 27, 2024

Defaults

Implementation

Side note

Simpler method?

A final thought about custom featurizers.

oscarwumit commented Feb 29, 2024

KnathanM commented Feb 29, 2024

davidegraff left a comment • edited

Choose a reason for hiding this comment

oscarwumit commented Mar 6, 2024 • edited

shihchengli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shihchengli commented Apr 2, 2024

oscarwumit commented Apr 2, 2024

oscarwumit commented Apr 2, 2024

[V2]: Overhaul `MultiHotAtomFeaturizer` #658

[V2]: Overhaul `MultiHotAtomFeaturizer` #658

oscarwumit commented Feb 22, 2024 •

edited

davidegraff commented Feb 23, 2024 •

edited

KnathanM commented Feb 23, 2024 •

edited

davidegraff left a comment •

edited

oscarwumit commented Mar 6, 2024 •

edited