Skip to content

Commit

Permalink
Merge pull request #452 from datamol-io/baselines
Browse files Browse the repository at this point in the history
Added ultralarge baselines
  • Loading branch information
DomInvivo committed Aug 31, 2023
2 parents ce45285 + 6eb7906 commit c211dac
Showing 1 changed file with 58 additions and 15 deletions.
73 changes: 58 additions & 15 deletions docs/baseline.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,19 @@ One can observe that the smaller datasets (`Zinc12k` and `Tox21`) beneficiate fr
| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
|
| **QM9** | GCN | 0.102 ± 0.0003 | 0.958 ± 0.0007 | 0.920 ± 0.002 | 0.119 ± 0.01 | 0.955 ± 0.001 | 0.915 ± 0.001 |
| | GIN | 0.0976 ± 0.0006 | **0.959 ± 0.0002** | **0.922 ± 0.0004** | 0.117 ± 0.01 | 0.950 ± 0.002 | 0.908 ± 0.003 |
| | GINE | **0.0959 ± 0.0002** | 0.955 ± 0.002 | 0.918 ± 0.004 | 0.102 ± 0.01 | 0.956 ± 0.0009 | 0.918 ± 0.002 |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
|
| **Zinc12k** | GCN | 0.348 ± 0.02 | 0.941 ± 0.002 | 0.863 ± 0.01 | 0.226 ± 0.004 | 0.973 ± 0.0005 | 0.940 ± 0.003 |
| | GIN | 0.303 ± 0.007 | 0.950 ± 0.003 | 0.889 ± 0.003 | 0.189 ± 0.004 | 0.978 ± 0.006 | 0.953 ± 0.002 |
| | GINE | 0.266 ± 0.02 | 0.961 ± 0.003 | 0.915 ± 0.01 | **0.147 ± 0.009** | **0.987 ± 0.001** | **0.971 ± 0.003** |

| | | BCE ↓ | AUROC ↑ | AP ↑ | BCE ↓ | AUROC ↑ | AP ↑ |
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
|
| **Tox21** | GCN | 0.202 ± 0.005 | 0.773 ± 0.006 | 0.334 ± 0.03 | **0.176 ± 0.001** | **0.850 ± 0.006** | 0.446 ± 0.01 |
| | GIN | 0.200 ± 0.002 | 0.789 ± 0.009 | 0.350 ± 0.01 | 0.176 ± 0.001 | 0.841 ± 0.005 | 0.454 ± 0.009 |
| | GINE | 0.201 ± 0.007 | 0.783 ± 0.007 | 0.345 ± 0.02 | 0.177 ± 0.0008 | 0.836 ± 0.004 | **0.455 ± 0.008** |
Expand All @@ -36,11 +36,11 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
| Pcqm4m_g25 | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
|
| **Pcqm4m_g25** | GCN | 0.2362 ± 0.0003 | 0.8781 ± 0.0005 | 0.7803 ± 0.0006 | 0.2458 ± 0.0007 | 0.8701 ± 0.0002 | 0.8189 ± 0.0004 |
| | GIN | 0.2270 ± 0.0003 | 0.8854 ± 0.0004 | 0.7912 ± 0.0006 | 0.2352 ± 0.0006 | 0.8802 ± 0.0007 | 0.7827 ± 0.0005 |
| | GINE| **0.2223 ± 0.0007** | **0.8874 ± 0.0003** | **0.7949 ± 0.0001** | 0.2315 ± 0.0002 | 0.8823 ± 0.0002 | 0.7864 ± 0.0008 |
| Pcqm4m_n4 | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
| **Pcqm4m_n4** | GCN | 0.2080 ± 0.0003 | 0.5497 ± 0.0010 | 0.2942 ± 0.0007 | 0.2040 ± 0.0001 | 0.4796 ± 0.0006 | 0.2185 ± 0.0002 |
| | GIN | 0.1912 ± 0.0027 | **0.6138 ± 0.0088** | **0.3688 ± 0.0116** | 0.1966 ± 0.0003 | 0.5198 ± 0.0008 | 0.2602 ± 0.0012 |
| | GINE| **0.1910 ± 0.0001** | 0.6127 ± 0.0003 | 0.3666 ± 0.0008 | 0.1941 ± 0.0003 | 0.5303 ± 0.0023 | 0.2701 ± 0.0034 |

Expand All @@ -49,13 +49,13 @@ While `PCQM4M_G25` has no noticeable changes, the node predictions of `PCQM4M_N4
|-----------|-------|-----------|-----------|-----------|---------|-----------|---------|
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
| Pcba\_1328 | GCN | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
| **Pcba\_1328** | GCN | **0.0316 ± 0.0000** | **0.7960 ± 0.0020** | **0.3368 ± 0.0027** | 0.0349 ± 0.0002 | 0.7661 ± 0.0031 | 0.2527 ± 0.0041 |
| | GIN | 0.0324 ± 0.0000 | 0.7941 ± 0.0018 | 0.3328 ± 0.0019 | 0.0342 ± 0.0001 | 0.7747 ± 0.0025 | 0.2650 ± 0.0020 |
| | GINE | 0.0320 ± 0.0001 | 0.7944 ± 0.0023 | 0.3337 ± 0.0027 | 0.0341 ± 0.0001 | 0.7737 ± 0.0007 | 0.2611 ± 0.0043 |
| L1000\_vcap | GCN | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
| **L1000\_vcap** | GCN | 0.1900 ± 0.0002 | 0.5788 ± 0.0034 | 0.3708 ± 0.0007 | 0.1872 ± 0.0020 | 0.6362 ± 0.0012 | 0.4022 ± 0.0008 |
| | GIN | 0.1909 ± 0.0005 | 0.5734 ± 0.0029 | 0.3731 ± 0.0014 | 0.1870 ± 0.0010 | 0.6351 ± 0.0014 | 0.4062 ± 0.0001 |
| | GINE | 0.1907 ± 0.0006 | 0.5708 ± 0.0079 | 0.3705 ± 0.0015 | **0.1862 ± 0.0007** | **0.6398 ± 0.0043** | **0.4068 ± 0.0023** |
| L1000\_mcf7 | GCN | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
| **L1000\_mcf7** | GCN | 0.1869 ± 0.0003 | 0.6123 ± 0.0051 | 0.3866 ± 0.0010 | 0.1863 ± 0.0011 | **0.6401 ± 0.0021** | 0.4194 ± 0.0004 |
| | GIN | 0.1862 ± 0.0003 | 0.6202 ± 0.0091 | 0.3876 ± 0.0017 | 0.1874 ± 0.0013 | 0.6367 ± 0.0066 | **0.4198 ± 0.0036** |
| | GINE | **0.1856 ± 0.0005** | 0.6166 ± 0.0017 | 0.3892 ± 0.0035 | 0.1873 ± 0.0009 | 0.6347 ± 0.0048 | 0.4177 ± 0.0024 |

Expand All @@ -67,27 +67,70 @@ This is not surprising as they contain two orders of magnitude more datapoints a

| | | CE or MSE loss in single-task $\downarrow$ | CE or MSE loss in multi-task $\downarrow$ |
|------------|-------|-----------------------------------------|-----------------------------------------|
| | | | |
|
| **Pcqm4m\_g25** | GCN | **0.2660 ± 0.0005** | 0.2767 ± 0.0015 |
| | GIN | **0.2439 ± 0.0004** | 0.2595 ± 0.0016 |
| | GINE | **0.2424 ± 0.0007** | 0.2568 ± 0.0012 |
| | | | |
|
| **Pcqm4m\_n4** | GCN | **0.2515 ± 0.0002** | 0.2613 ± 0.0008 |
| | GIN | **0.2317 ± 0.0003** | 0.2512 ± 0.0008 |
| | GINE | **0.2272 ± 0.0001** | 0.2483 ± 0.0004 |
| | | | |
|
| **Pcba\_1328** | GCN | **0.0284 ± 0.0010** | 0.0382 ± 0.0005 |
| | GIN | **0.0249 ± 0.0017** | 0.0359 ± 0.0011 |
| | GINE | **0.0258 ± 0.0017** | 0.0361 ± 0.0008 |
| | | | |
|
| **L1000\_vcap** | GCN | 0.1906 ± 0.0036 | **0.1854 ± 0.0148** |
| | GIN | 0.1854 ± 0.0030 | **0.1833 ± 0.0185** |
| | GINE | **0.1860 ± 0.0025** | 0.1887 ± 0.0200 |
| | | | |
|
| **L1000\_mcf7** | GCN | 0.1902 ± 0.0038 | **0.1829 ± 0.0095** |
| | GIN | 0.1873 ± 0.0033 | **0.1701 ± 0.0142** |
| | GINE | 0.1883 ± 0.0039 | **0.1771 ± 0.0010** |

# UltraLarge Baseline
Coming soon!

## UltraLarge test set metrics

For `UltraLarge`, we provide results for the same GNN baselines as for
`LargeMix`. Each model is trained for 50 epochs and results are averaged over 3 seeds. The remaining
setup is the same as for TOYMIX (Section E.1), reporting metrics on the Single Dataset and Multi Dataset using the same performance metrics. We further use the same models (in terms of size) as used for `LargeMix`.

For now, we report only the results for a subset representing 5% of the total dataset due to computational constraint, but aim to provide the full results soon.

Results discussion. `UltraLarge` results can be found in Table 6. Interestingly, on both graph- and node-level tasks we observe that there is no advantage of multi-tasking in terms of performance. We
expect that for this ultra-large dataset, significantly larger models are needed to successfully leverage the multi-task setup. This could be attributed to underfitting, as already demonstrated for `LargeMix`. Nonetheless, our baselines set the stage for large-scale pre-training on `UltraLarge`.

The results presented used approximately 500 GPU hours of compute, with
more compute used for development and hyperparameter search.

We further note that the graph-level tasks results are very strong. Regarding the node-level tasks, they are expected to underperform in low-parameters regime, due to clear signs of underfitting, a very large amount of labels to learn, and susceptibility to over-smoothing from traditional GNNs.


| Dataset | Model | MAE ↓ | Pearson ↑ | R² ↑ | MAE ↓ | Pearson ↑ | R² ↑ |
|------------------|-------|-------------------|-------------------|-------------------|-------------------|-------------------|-------------------|
| | <th colspan="3" style="text-align: center;">Single-Task Model</th> <th colspan="3" style="text-align: center;">Multi-Task Model</th> |
| <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> | <hi> |
| **Pm6_83m_g62** | GCN | .2606 ± .0011 | .9004 ± .0003 | .7997 ± .0009 | .2625 ± .0011 | .8896 ± .0001 | .7982 ± .0001 |
| | GIN | .2546 ± .0021 | .9051 ± .0019 | .8064 ± .0037 | .2562 ± .0000 | .8901 ± .0000 | .806 ± .0000 |
| | GINE | **.2538 ± .0006** | **.9059 ± .0010** | **.8082 ± .0015** | .258 ± .0011 | .904 ± .0000 | .8048 ± .0001 |
|
| **Pm6_83m_n7** | GCN | .5803 ± .0001 | .3372 ± .0004 | .1191 ± .0002 | .5971 ± .0002 | .3164 ± .0001 | .1019 ± .0011 |
| | GIN | .573 ± .0002 | .3478 ± .0001 | **.1269 ± .0002** | .5831 ± .0001 | .3315 ± .0005 | .1141 ± .0000 |
| | GINE | **.572 ± .0004** | **.3487 ± .0002** | .1266 ± .0001 | .5839 ± .0004 | .3294 ± .0002 | .1104 ± .0000 |

## UltraLarge training set loss

In the table below, we observe that the multi-task model slightly underfits the single-task model, indicating that parameters can be efficiently shared between the node-level and graph-level tasks. We further note that the training loss and the test MAE are almost equal for all tasks, indicating further benefits as we scale both the model and the data.

| | | **MAE loss in single-task ↓** | **MAE loss in multi-task ↓** |
|------------------|-------|---------------------------|--------------------------|
|
| Pm6_83m_g62 | GCN | **.2679 ± .0020** | .2713 ± .0017 |
| | GIN | **.2582 ± .0018** | .2636 ± .0014 |
| | GINE | **.2567 ± .0036** | .2603 ± .0021 |
|
| Pm6_83m_n7 | GCN | **.5818 ± .0021** | .5955 ± .0023 |
| | GIN | **.5707 ± .0019** | .5851 ± .0038 |
| | GINE | **.5724 ± .0015** | .5832 ± .0027 |

0 comments on commit c211dac

Please sign in to comment.