[FEATURE]: Extract atomic encodings before aggregating them into molecular encodings #722

zarkoivkovicc · 2024-03-12T13:35:59Z

It would be nice if we could have an interface to extract atomic encodings from the last layer before they are aggregated to make molecular fingerprint.

Desired solution/workflow
Provide clear interface similar to chemprop.fingerprint to get atomic encodings. Perhaps something like chemprop.atomic_encodings(molecule) -> List of vectors (atomic encodings)

Discussion
This would make transfer learning using chemprop for atomistic predictions much easier. It should be almost trivial to implement because we already have a function that returns aggregated encodings as molecular fingerprint.

Additional context
Some other libraries already provide this feature and it's proving useful. The community would greatly benefit from this feature.

JacksonBurns · 2024-03-12T13:37:12Z

Hi @zarkoivkovicc could you link to some of the mentioned other libraries that implement this feature so that we could learn from them? I will also add that we are not accepting feature requests for v1, so this feature would be implemented in v2.

zarkoivkovicc · 2024-03-12T13:45:04Z

Hi @JacksonBurns thank for a fast answer. Sure, here are some examples:

unimol
MACE

Is there any potential relase date for v2? I am currently working on the project that compares different atomic representations from latent spaces of different models. This should be relatively easy to implement, but I can't do it alone because the code base is too large and I don't have much time

davidegraff · 2024-03-12T14:23:59Z

You can already do this in v2 without any new features:

import chemprop
import lightning as L
from torch_scatter import scatter_sum

trainer: L.Trainer = ...
model: chemprop.MPNN = ...
train, val: tuple[DataLoader, DataLoader] = ...

trainer.train(model, train, val) # fit your MPNN

H_vs = []
for batch in train: # could use any dataloader here
    bmg, V_d, *_ = batch
    H_v = model.message_passing(bmg, V_d)
    split_sizes = scatter_sum(torch.ones_like(bmg.batch), bmg.batch, dim=0, dim_size=len(bmg)).to_list()
    H_vs.extend(H_v.split(split_sizes))

H_vs is a list[Tensor] of shape $n \times \ast \times d_h$, where $n$ is the number of molecules in the dataloader, $\ast$ is the number of atoms in each molecule, and $d_h$ is the encoding dimensionality.

zarkoivkovicc added the enhancement a new feature request label Mar 12, 2024

kevingreenman added this to the v2.0.0 milestone Mar 12, 2024

KnathanM modified the milestones: v2.0.0, v2.1.0 Apr 4, 2024

KnathanM mentioned this issue May 9, 2024

Atomic encodings #861

Open

2 tasks

KnathanM self-assigned this May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Extract atomic encodings before aggregating them into molecular encodings #722

[FEATURE]: Extract atomic encodings before aggregating them into molecular encodings #722

zarkoivkovicc commented Mar 12, 2024

JacksonBurns commented Mar 12, 2024

zarkoivkovicc commented Mar 12, 2024

davidegraff commented Mar 12, 2024

[FEATURE]: Extract atomic encodings before aggregating them into molecular encodings #722

[FEATURE]: Extract atomic encodings before aggregating them into molecular encodings #722

Comments

zarkoivkovicc commented Mar 12, 2024

JacksonBurns commented Mar 12, 2024

zarkoivkovicc commented Mar 12, 2024

davidegraff commented Mar 12, 2024