# Dataset Preparation Notebook
This notebook describes how to prepare the datasets used for both pre-training and fine-tuning of the chemical language models.  


## 1. Download Datasets

For reproducibility, the curated compound ID lists (i.e., **ChEMBL_ID**, **PubChem_CID**, or **ZINC_ID**) are provided in CSV format.  
Please retrieve the corresponding SMILES directly from the original databases using these IDs:

- **ChEMBL**: https://www.ebi.ac.uk/chembl/  
- **PubChem**: https://pubchem.ncbi.nlm.nih.gov/  
- **ZINC**: https://zinc15.docking.org/  
  
After downloading the SMILES data, standardize the molecular representations using RDKit canonicalization before saving.  
An example function for canonicalization is shown below:
```python
def canonicalize_smiles(smi: str):
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol)
```
Save the canonicalized SMILES in **TSV** format with a column named `rdkit_smiles`, and place the data in the following directories:

- **Pre-training datasets** → `data/pretrain/sampled_datasets/`  
- **Fine-tuning datasets** → `data/finetune/target_actives/`  

## 2. Pre-training Datasets

### 2-1. Dataset SMILES Randomization

In [None]:
import itertools
import pandas as pd
from rdkit.Chem import rdmolfiles
from src.dataset_curation import SmilesToRandomSmiles
from src.paths import DATA_DIR

filename_list = ['pubchem_filtered_ac',
                 'pubchem_unfiltered_ac', 
                 'pubchem_inac', 
                 'chembl_filtered', 
                 'chembl_unfiltered', 
                 'zinc']

for filename in filename_list:
    original_smiles = pd.read_table(f'{DATA_DIR}pretrain/sampled_datasets/{filename}.tsv')
    original_smiles_list = original_smiles['rdkit_smiles'].to_list()
    print(f'{filename}')

    uniq_rdsmiles_list = SmilesToRandomSmiles(original_smiles['rdkit_smiles'], num=3, seed=42)
    uniq_rdsmiles_df   = pd.DataFrame(uniq_rdsmiles_list, columns=['rdkit_smiles'])
    uniq_rdsmiles_df.to_csv(f'{DATA_DIR}/pretrain/{filename}_rdsmi3.tsv', sep='\t')

## 3. Fine-tuning Datasets


In [None]:
import os
import sys
sys.path.append('../')
import pandas as pd
from sklearn.model_selection import train_test_split
from src.paths import DATA_DIR
from src.dataset_curation import filter_structural_alerts, SmilesToRandomSmiles

filtered_dir = f'{DATA_DIR}/finetune/filtered/'
unfiltered_dir = f'{DATA_DIR}/finetune/unfiltered/'

os.makedirs(filtered_dir, exist_ok=True)
os.makedirs(unfiltered_dir, exist_ok=True)

f_data_list = ['CHEMBL4005', 'CHEMBL1908389', 'CHEMBL284', 'CHEMBL214', 'CHEMBL253']

for f_data in f_data_list:
    data = pd.read_table(f'{DATA_DIR}/finetune/target_actives/{f_data}_actives.tsv')

    # unfiltered fine-tuning datasets
    train, test = train_test_split(data['rdkit_smiles'], train_size=0.2, random_state=42)
    train.to_csv(f'{unfiltered_dir}/unfiltered-{f_data}_train.tsv', sep='\t', index=False)
    test.to_csv(f'{unfiltered_dir}/unfiltered-{f_data}_test.tsv', sep='\t', index=False)

    rd3_train = SmilesToRandomSmiles(train, num=3, seed=42)
    rd3_train_df = pd.DataFrame(rd3_train, columns=['rd3_smiles'])
    rd3_train_df.to_csv(f'{unfiltered_dir}/unfiltered-{f_data}_train_rdsmi3.tsv', sep='\t', index=False)

    # filtered fine-tuning datasets
    rdkit_smi_filtered, pass_rate, _ = filter_structural_alerts(data, 'rdkit_smiles')
    print(f'{f_data} pass rate after RDKit filters: {pass_rate:.3f}')
    
    train_f, test_f = train_test_split(rdkit_smi_filtered['rdkit_smiles'], train_size=0.2, random_state=42)
    train_f.to_csv(f'{filtered_dir}/filtered-{f_data}_train.tsv', sep='\t', index=False)
    test_f.to_csv(f'{filtered_dir}/filtered-{f_data}_test.tsv', sep='\t', index=False)

    rd3_train_f = SmilesToRandomSmiles(train_f, num=3, seed=42)
    rd3_train_f_df = pd.DataFrame(rd3_train_f, columns=['rd3_smiles'])
    rd3_train_f_df.to_csv(f'{filtered_dir}/filtered-{f_data}_train_rdsmi3.tsv', sep='\t', index=False)
