Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prediction of H2 using the pre-trained DimeNet++ model #13

Closed
qikuizhu opened this issue Feb 4, 2021 · 1 comment
Closed

Prediction of H2 using the pre-trained DimeNet++ model #13

qikuizhu opened this issue Feb 4, 2021 · 1 comment

Comments

@qikuizhu
Copy link

qikuizhu commented Feb 4, 2021

I have trained the DimeNet++ model with the QM9 dataset and then predicted the hydrogen molecule, H2 which is the simplest molecule as a benchmark, using the pre-trained model. However, I obtained the poor result for the atomization energy of H2. My result may be something wrong, so please try and report your result.

Because the atomic distance (i.e. bond length) of H2 is 0.74 Å, the 3D structure of the hydrogen molecule can be written as

atom x y z
H 0.0 0.0 0.0
H 0.74 0.0 0.0

and the atomization energy is 4.54 eV (see https://wiki.fysik.dtu.dk/gpaw/dev/tutorials/H2/atomization.html). The prediction by the pre-trained DimeNet++ model, however, was about 9.79 eV and its error is 9.79 - 4.54 = 5.25 eV = 120 kcal/mol. This very poor result for the simplest molecule seems to be something wrong because the DimeNet++ model learned and predicted the atomization energy of molecules in the QM9 dataset with less than MAE = 0.01 eV = 0.23 kcal/mol.

Probably, the main reason is that the QM9 dataset does not include the "diatomic molecules" such as H2, N2, and O2. Even If a machine learning model achieved a low MAE on such QM9 dataset, if the error for the simplest hydrogen molecule H2 is over 100 kcal/mol, can we say that the model could capture the molecular energy?

@gasteigerjo
Copy link
Owner

gasteigerjo commented Feb 8, 2021

This is indeed a very interesting point, and the pretrained model in this repository yields a smaller, but still similarly high error of 2.47eV in my quick experiment.

From a physical perspective the H2 molecule is certainly the simplest molecule there is, but from the QM9 data perspective this is an extreme outlier. There are no H-H bonds in the QM9 dataset, and the model will predict something that is similar to the data and bonds it has seen.

I would say this beautifully shows the weaknesses of a purely data-driven approach: By mostly decoupling the model from the physical ground-truth you can get extreme outliers in cases you would not expect, and your extrapolation abilities are severely limited. This is certainly something that can be improved, but I think we should always be aware of the chemical space our data covers.

Remember that the model only knows the atom types and geometry. It only knows those things about the wave function that it learned from this data. Atom types and direct interactions that it has never seen before will be neigh impossible for it to predict.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants