# Comperison Multview

To fairly benchmark our approach against Multview from Janghoon Ock et al. ([paper available here](https://arxiv.org/pdf/2401.07408v4)) we have to select the same subset they used for testing.

For this wee need following files from multiple reposetories:
ml_relaxed_dft_targets.pkl available at [https://github.com/Open-Catalyst-Project/AdsorbML/tree/main/adsorbml/2023_neurips_challenge](https://github.com/Open-Catalyst-Project/AdsorbML/tree/main/adsorbml/2023_neurips_challenge)

As well as for each of the three ML approaches used for the initial structure to relaxed structure prediction (GemNet, SCN, eSCN) from [OC2023](https://opencatalystproject.org/challenge_2023): 

[GemNet-OC-S2EF-2M](https://dl.fbaipublicfiles.com/opencatalystproject/data/neurips_2023/gemnet_oc_2M_oc20dense_val_ood.tar.gz)

[SCN-S2EF-2M](https://dl.fbaipublicfiles.com/opencatalystproject/data/neurips_2023/scn_2M_oc20dense_val_ood.tar.gz)

[eSCN-S2EF-2M](https://dl.fbaipublicfiles.com/opencatalystproject/data/neurips_2023/escn_2M_oc20dense_val_ood.tar.gz)


In [1]:
### Load packages:
import os
import pandas as pd
from fairchem.core.datasets import LmdbDataset

### Find ids
For each ML-model (GemNet, SCN, eSCN) downloaded we go trough the folder and find the system_id and config_id

In [2]:
model_path = '../../S2EF-models/escn_2M_oc20dense_val_ood/escn_2M/oc20_dense_ood_val'
model_name = 'escn-2M'

txt_files = [f for f in os.listdir(model_path) if f.endswith('.txt')]
temp = [pd.read_csv(f'{model_path}/{f}', usecols=[0, 1,3], header=None).values.tolist() for f in txt_files]
configs_lmdb = [item for sublist in temp for item in sublist]

# id, config, system_id (this is unique)
print(configs_lmdb[0:2])

[['17_405_22', 'rand73', 64193], ['17_405_22', 'heur54', 60274]]


Itterate over the energies find for each configuration the energy and check if it smaller than 100eV (same conditions as original authors)

In [3]:
energies_ml = pd.read_pickle('../../S2EF-models/ml_relaxed_dft_targets.pkl')[model_name]

correct_samples = []
sids = []
for row in configs_lmdb: 
    energy = energies_ml[row[0]][row[1]]
    if energy < 100: 
        correct_samples.append(row)
        sids.append(row[2])
        


Read Lambs file and collect the correct samples

In [4]:
dataset = LmdbDataset({"src": model_path})
lmdb_samples = []
for data in dataset:
    if data.sid in sids:
        lmdb_samples.append(data)

In [5]:
len(lmdb_samples)

919

In [6]:
lmdb_samples[10]

Data(y=-1.3641252517700195, pos=[158, 3], cell=[1, 3, 3], atomic_numbers=[158], natoms=158, tags=[158], force=[158, 3], fixed=[158], sid=50646, fid=130, id='1_110')

In [9]:
from transformers import BertForSequenceClassification
from torch import load
import torch
from textcat.ml.tokenizer.adsorption_tokenizer import AdsorptionTokenizer
from textcat.ml.tokenizer.bert_tokenizer import GeneralTokenizer
from textcat.utils import lmdb_to_atoms, adsorption2smiles
from tqdm import tqdm

model = BertForSequenceClassification.from_pretrained("../../models/regression/fs/ft/hetsmiles_fs_o1_big_twall_notilde/checkpoint-345000")
tokenizer = GeneralTokenizer("../../models/regression/fs/ft/hetsmiles_fs_o1_big_twall_notilde/vocabulary.txt", 
                             AdsorptionTokenizer(True, False))

scaler = torch.load("../../models/regression/fs/ft/hetsmiles_fs_o1_big_twall_notilde/scaler.pt")



  scaler = torch.load("../../models/regression/fs/ft/hetsmiles_fs_o1_big_twall_notilde/scaler.pt")


In [8]:
predictions, trues = [], []
with torch.no_grad():
  for data in tqdm(lmdb_samples):
      atoms = lmdb_to_atoms(data, False)  # False as OC20 dense only contains coordinates for relaxed state
      smiles = adsorption2smiles(atoms, 
                                data.tags,
                                order=1
                                )[0]
      trues.append(data.y)
      x = tokenizer(smiles, 
                    None,
                  add_special_tokens=True,
                  padding='max_length',
                  max_length=176,
                  return_token_type_ids=True,
                  truncation=True,
          )
      x = {'input_ids': torch.tensor([x["input_ids"]], dtype=torch.long), 
          'attention_mask': torch.tensor([x["attention_mask"]], dtype=torch.long)}
      predictions.append(model(**x))

100%|██████████| 919/919 [03:34<00:00,  4.28it/s]


In [11]:
predictions[0]

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2783]]), hidden_states=None, attentions=None)

In [20]:
std = scaler.scale_
mean = scaler.mean_
preds_eV = [y['logits']*std + mean for y in predictions]

mae = sum([abs(i-j) for i,j in zip(preds_eV, trues)])/ len(trues)
mae

tensor([[0.4789]], dtype=torch.float64)

In [21]:
ml_energies = []
for row in configs_lmdb: 
    energy = energies_ml[row[0]][row[1]]
    if energy < 100: 
        ml_energies.append(energy)
mae = sum([abs(i-j) for i,j in zip(preds_eV, ml_energies)])/ len(ml_energies)
mae

tensor([[0.4955]], dtype=torch.float64)