# Dataset migration from v3 to v4

Going to remove 1d data and replace it with Mnova simulated data

`v4`: Same data, just with Mnova 1d data replacing the existing train/val/test

`v4_large`: Train on whole retrieval set with Mnova simulations

### Step 1: Cleaning Dataset for Mnova Simulation

There's a lot of repeated SMILES strings, we want to simulate only once and also clear out anything with huge molecular weight (>1800 Da)

Statistics:
- 216,586 entries in train/val/test
- 190,164 unique SMILES across train/val/test
- 189,691 unique SMILES kept for simulation (<=1800 Da)
- 182,637 unique SMILES kept for simulation (<=1000 Da)

In [6]:
from rdkit import Chem
from rdkit.Chem import Descriptors

def check_invalid_mol(smiles: str) -> bool:
    """
    Returns True if the molecule described by `smiles` has
    molecular weight > 1800 Da or is invalid. Returns False otherwise.
    """
    if not smiles:
        return True

    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return True

    mw = Descriptors.MolWt(mol)
    return mw > 1000.0

In [7]:
import pickle
import os
from tqdm import tqdm

DATASET_ROOT = '/data/nas-gpu/wang/atong/MoonshotDatasetv3'
OUTPUT_ROOT = '/data/nas-gpu/wang/atong/MoonshotDatasetv4/WorkingDir'
os.makedirs(OUTPUT_ROOT, exist_ok=True)

index: dict[int, dict] = pickle.load(open(os.path.join(DATASET_ROOT, 'index.pkl'), 'rb'))
all_smiles: list[str] = list(set([e['smiles'] for e in index.values()]))
valid_smiles: list[str] = [s for s in tqdm(all_smiles) if not check_invalid_mol(s)]

valid_smiles_path = os.path.join(OUTPUT_ROOT, 'smiles_1000.txt')
with open(valid_smiles_path, 'w') as f:
    f.write('\n'.join(valid_smiles))

print(f'Wrote {len(valid_smiles)}/{len(index)} valid SMILES to {valid_smiles_path}')

100%|██████████| 190164/190164 [00:18<00:00, 10218.07it/s]


Wrote 182637/216586 valid SMILES to /data/nas-gpu/wang/atong/MoonshotDatasetv4/WorkingDir/smiles_1000.txt
