# Dataset migration from v3 to Mnova ablation studies

Going to remove 1d data and replace it with Mnova simulated data

`v4`: Same data, just with Mnova 1d data replacing the existing train/val/test

`v4_large`: Train on whole retrieval set with Mnova simulations

### Step 1: Cleaning Dataset for Mnova Simulation

There's a lot of repeated SMILES strings, we want to simulate only once and also clear out anything with huge molecular weight (>1800 Da)

Statistics:
- 216,586 entries in train/val/test
- 190,164 unique SMILES across train/val/test
- 189,691 unique SMILES kept for simulation (<=1800 Da)
- 182,637 unique SMILES kept for simulation (<=1000 Da)

In [6]:
from rdkit import Chem
from rdkit.Chem import Descriptors

def check_invalid_mol(smiles: str) -> bool:
    """
    Returns True if the molecule described by `smiles` has
    molecular weight > 1800 Da or is invalid. Returns False otherwise.
    """
    if not smiles:
        return True

    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return True

    mw = Descriptors.MolWt(mol)
    return mw > 1000.0

In [7]:
import pickle
import os
from tqdm import tqdm

DATASET_ROOT = '/data/nas-gpu/wang/atong/MoonshotDatasetv3'
OUTPUT_ROOT = '/data/nas-gpu/wang/atong/MoonshotDatasetv4/WorkingDir'
os.makedirs(OUTPUT_ROOT, exist_ok=True)

index: dict[int, dict] = pickle.load(open(os.path.join(DATASET_ROOT, 'index.pkl'), 'rb'))
all_smiles: list[str] = list(set([e['smiles'] for e in index.values()]))
valid_smiles: list[str] = [s for s in tqdm(all_smiles) if not check_invalid_mol(s)]

valid_smiles_path = os.path.join(OUTPUT_ROOT, 'smiles_1000.txt')
with open(valid_smiles_path, 'w') as f:
    f.write('\n'.join(valid_smiles))

print(f'Wrote {len(valid_smiles)}/{len(index)} valid SMILES to {valid_smiles_path}')

100%|██████████| 190164/190164 [00:18<00:00, 10218.07it/s]


Wrote 182637/216586 valid SMILES to /data/nas-gpu/wang/atong/MoonshotDatasetv4/WorkingDir/smiles_1000.txt


# JSONL Format for MoonshotDataset
```json
{
    "idx": 0,
    "smiles": "C=CC1CN2CCC1CC2C(O)c1ccnc2ccc(OC)cc12",
    "split": "train",
    "has_hsqc": true,
    "has_c_nmr": false,
    "has_h_nmr": false,
    "has_mass_spec": true,
    "has_iso_dist": true,
    "mw": 404.1099,
    "name": "quinine hydrobromide",
    "has_mw": true,
    "formula": "C20H25BrN2O2",
    "has_formula": true,
    "np_pathway": ["Alkaloids"],
    "np_superclass": ["Tryptophan alkaloids"],
    "np_class": [],
    "hsqc": [
        [54.89, 3.077, -1.0], 
        ...
    ],
    "mass_spec": [
        [93.06987762451172, 1.3903098106384277], 
        ...
    ],
    "fragidx": [
        1, 3, 4, 5, 6, 8, 9, 14, 15, 17, 19, 20, 29, 32, 35, 46, 50, 59, 60, 62, 69, 80, 81, 125, 184, 210, 240, 378, 382, 392, 472, 491, 698, 891, 933, 1439, 1486, 1639, 2185, 2792, 5479, 5610, 6182, 7318, 7903, 11697, 14346
    ]
}
```

# JSONL Format for Predictions

```json
{
    "idx":0,
    "smiles":"COc1ccc2cc1Oc1cc(ccc1O)C(O)C13SSSC4(C(=O)N1C)C(O)C1=COC=CC(OC2=O)C1N4C3=O",
    "status":"SUCCESS",
    "error":null,
    "atoms":[
        {"number":"1","name":"CH3"},
        {"number":"2","name":"O"},
        ...
    ],
    "predictions":{
        "hsqc":{
            "status":"SUCCESS",
            "H":[
                {
                    "atom":[{"index":1}],
                    "shift":{"value":3.9237684493895095,"error":0.1},
                    "js":[{"atom":[{"index":1}],"j":{"value":5.46,"error":1.72}}]
                },
                {
                    "atom":[{"index":4}],
                    "shift":{"value":7.14351619175921,"error":0.1},
                    "js":[{"atom":[{"index":5}],"j":{"value":8.65,"error":0.31}},{"atom":[{"index":7}],"j":{"value":0.1,"error":0.1}}]
                },
                {
                    "atom":[{"index":5}],
                    "shift":{"value":7.5815223902024345,"error":0.1},
                    "js":[{"atom":[{"index":4}],"j":{"value":8.65,"error":0.31}},{"atom":[{"index":7}],"j":{"value":2.03,"error":0.48}}]
                },
                ...
            ],
            "C":[
                {
                    "atom":[{"index":1}],
                    "shift":{"value":56.0195331969619,"error":3},
                    "js":[{"atom":[{"index":1}],"j":{"value":143.96,"error":3.91}},{"atom":[{"index":4}],"j":{"value":0.4,"error":1.18}}]
                },
                {
                    "atom":[{"index":3}],
                    "shift":{"value":154.3666380106334,"error":3},
                    "js":[{"atom":[{"index":1}],"j":{"value":3.97,"error":12.36}},{"atom":[{"index":4}],"j":{"value":2.52,"error":4.27}},{"atom":[{"index":5}],"j":{"value":7.17,"error":2.84}},{"atom":[{"index":7}],"j":{"value":6.43,"error":2.84}}]
                },
                ...
            ],
            "error":null
        }
    }
}
```

# Dataset Forms

We will reduce the dataset to the following forms:

Form 1

- **MARINABase1**:
    - MoonshotDatasetv3 without any molecules that errored during predictions and filter all molecules <= 1000 Da
- **MARINADataset1**:
    - MARINABase1, with replacing all existing C/H NMRs
- **MARINADataset2**:
    - MARINABase1, put all C/H simulated NMR that exist
- **MARINADataset3**:
    - MARINABase1, replacing all existing C/H/HSQC NMRs
- **MARINADataset4**:
    - MARINABase1, put all C/H/HSQC simulated NMR that exist

Form 2

- **MARINABase2**:
    - MoonshotDatasetv3 without any molecules that errored during predictions and filter all molecules within [100Da, 1000Da]
- **MARINAMedDataset1**:
    - MARINABase2, with replacing all existing C/H NMRs
- **MARINAMedDataset2**:
    - MARINABase2, put all C/H simulated NMR that exist
- **MARINAMedDataset3**:
    - MARINABase2, replacing all existing C/H/HSQC NMRs
- **MARINAMedDataset4**:
    - MARINABase2, put all C/H/HSQC simulated NMR that exist

Form 3

- **MARINABaseNoDup**:
    - MoonshotDatasetv3 without any molecules that errored during predictions and filter all molecules <= 1000 Da, and also no duplicate SMILES


In [7]:
import glob
import os
import json
from tqdm import tqdm
PRED_ROOT = '/data/nas-gpu/wang/atong/Datasets/MnovaPredictions/jsonl'
files = sorted(glob.glob(os.path.join(PRED_ROOT, '*.jsonl')))
def process_file(file):
    with open(file, 'r') as f:
        for line in f:
            data = json.loads(line)
            if data['predictions']['hsqc']['status'] == 'SUCCESS':
                pass
            else:
                print(data)
for file in tqdm(files):
    if process_file(file) == -1:
        break


 72%|███████▏  | 36/50 [00:05<00:02,  6.66it/s]

{'idx': 35190, 'smiles': 'Oc1ccc(C2Oc3cc(O)c4c(c3C2c2cc(O)cc(O)c2)C2C(c3ccc(O)cc3)c3c(O)cc5c(c3C4~C2c2ccc(O)cc2)C(c2cc(O)cc(O)c2)C(c2ccc(O)cc2)O5)cc1', 'status': 'FAILED', 'error': "atom_info_failed: TypeError: Result of expression 'mol.atoms()' [undefined] is not an object. | hsqc_failed: Error: Incorrect Arguments", 'atoms': [], 'predictions': {'hsqc': {'status': 'FAILED', 'H': None, 'C': None, 'error': 'Error: Incorrect Arguments'}}}


100%|██████████| 50/50 [00:07<00:00,  6.54it/s]
