Author: Yu Hamakawa

Created: September 9, 2024

# Data Curation
In this tutorial, you can follow the curation process of the data used in the analysis. Two types of datasets are utilized: the PQC dataset and the APTCs dataset. The PQC dataset is carefully created by me.

---


# Data Curation Flow


## Data Curation Flow for PQC dataset


#### Make SDF files consists of compounds with diverse conformational changes.
  - Step 1. Randomly download the json files from PubchemQC PM6 dataset [here](https://chibakoudai.sharepoint.com/sites/stair01/Shared%20Documents/Forms/AllItems.aspx?ga=1&id=%2Fsites%2Fstair01%2FShared%20Documents%2Fdata%2FPubChemQC%2FPM6%2Fpm6opt%5Fver2%2E0%2E0%2Fjson%2FCHON500noSalt&viewid=2a7fb7f8%2Df3f8%2D4ad2%2D931e%2Dfc786e938ea8) (509,453M compounds are downloaded). CHON500noSalt sub-dataset is used.
  - Step 2. Convert json format to csv format.
  - Step 3. Calculate rotatable number of bonds use RDKit and extract top 100,000 compunds to ensure a variety of conformatins for each compounds.
  - Step 4. Convert csv format to SDF format use RDKit. The coordinates of all hydrogens, including deuterium, are deleted and the coordinates of the heavy atoms are recorded in the SDF (this SDF will be referred to as "Ground-truth conformation SDF").

From here, curation flow branch into 3 parts (2D descriptors, 3D descriptors, and Uni-Mol input) to create different data. Note: At this stage, 100,000 compounds are included in the SDF, but conformers were successfully generated for only 97,696 compounds. To perform a valid validation, all compounds used in the dataset below are the same 97,696 compounds.

#### 1. 2D descriptors
  - Step 1. Get canonical SMILES from Ground-truth conformation SDF files.
  - Step 2. Calculate ECFP4 using OpenEye toolkit.
#### 2. 3D descriptors
   - Step 1. Generate conformers from Ground-truth conformation SDF files for calculation using OMEGA from OpenEye (set RMSD thresholding = 2Å, and maximum number of generated conformers = 40). 644,648 conformers are generated.
   - Step 2. Run structual optimization of generated conformers using PM6 Hamiltonian in MOPAC.
   - Step 3. Convert MOPAC output file to SDF (this SDF will be referred to as "Generated conformers SDF").
   - Step 4. Calculate 2 different 3d descriptors (MOE & Pmapper) from Generated conformers SDF and Ground-truth conformation SDF: MOE descriptors using MOE software or Pmapper descriptors using [the code](https://github.com/Laboratoire-de-Chemoinformatique/3D-MIL-QSSR/blob/main/miqssr/descriptor_calculation/pmapper_3d.py).
   - Step 5. Aggregate MOE descriptors for descriptors calculated from Generated conformers SDF using 6 different aggregation methods: Boltzmann Weight, Mean, Global Minimum, Random, RMSD Max, and RMSD Min.
#### 3. Uni-Mol input
- Step 1. Extract atom elements and coordinates from Ground-truth conformation SDF
- Step 2. Extract atom elements and coordinates from Generated conformers SDF. Global Minimum and RMSD Max conformation is selected.

2D descriptors and 3D descriptors are used to the random forest model and multi-instance leraning model.
Uni-Mol input is used to the Uni-Mol model.

---

The final list of datasets created is as follows:
- ECFP4 count
- Aggregated MOE/Pmapper decsriptors using 6 diffetent aggregation methods
- Non-aggregated MOE/Pmapper descriptors
- Ground-truth MOE/Pmapper descriptors
- Uni-Mol input (Ground-truth, Global minimum, RMSD Max)

As mentioned above, all datasets consist of the same 97,696 compounds. In the non-aggregated dataset, the number of conformers is 644,648.

## Data Curation Flow for APTCs dataset (APTC-1 and APTC-2 dataset)



- Get dataset from [here](https://github.com/Laboratoire-de-Chemoinformatique/3D-MIL-QSSR/tree/main/datasets). APTC-1 have 88 catalysts, and APTC-2 have 40 catalysts experimental data.
- Generate conformers using RDKit and save as pkl format (use [the code of previous research](https://github.com/Laboratoire-de-Chemoinformatique/3D-MIL-QSSR/blob/main/miqssr/conformer_generation/gen_conformers.py)). 4371 conformers for APTC-1, and 1864 conformers for APTC-2 are generated.

- Calculate 2d descriptors (ECFP bit/count, 2D pharmacophore fingerprint (2D PFP) ) using RDKit.
- Calculate 3d descriptors (MOE using MOE software, Pmapper) from generated conformers/
- Aggregate 3d descriptors using 4 aggregation methods: Boltzmann weight, Mean, Global Minimum, Random.

---

The final list of datasets created is as follows:
- ECFP4 bit/count
- 2D PFP
- Aggregated MOE/Pmapper decsriptors using 4 diffetent aggregation methods
- Non-aggregated MOE/Pmapper descriptors



# Run Data Curation Code

Because OpenEye and MOE are trademarked software, licensing issues prohibit us from releasing the code. Therefore, please note that running the code below will not automatically create data.

## Data Curation Code for PQC dataset

In [1]:
from pathlib import Path

BASE_DIR = Path.cwd().parent
DATA_DIR = BASE_DIR / 'data' / 'PQC_dataset'
CURATION_DIR = BASE_DIR / 'src' / 'data_curation' / 'PQC_dataset'

Convert downloaded json files to csv format.

In [2]:
FILE_PATH = CURATION_DIR / 'sampling' / 'json2csv.py'
# %run $FILE_PATH

Calculate rotatable number of bonds use RDKit and extract top 100,000 compunds to ensure a variety of conformatins for each compounds.

In [3]:
FILE_PATH = CURATION_DIR / 'sampling' / 'rdkit_rotatableBond.py'
# %run $FILE_PATH

Convert csv format to SDF format use RDKit.

In [4]:
FILE_PATH = CURATION_DIR / 'sampling' / 'rdkit_csv2sdf.py'
# %run $FILE_PATH

**The code of ECFP calculation, conformers generation, MOPAC result perser and MOE descriptors generation could not be released.**

Pmapper descriptors are calculated by [the code](https://github.com/Laboratoire-de-Chemoinformatique/3D-MIL-QSSR/blob/main/miqssr/descriptor_calculation/pmapper_3d.py)


Aggregate MOE descriptors

In [5]:
FILE_PATH = CURATION_DIR / '3d_descriptor' / 'calc_aggregation.py'
# %run $FILE_PATH

Make unimol input data

In [6]:
FILE_PATH = CURATION_DIR / 'unimol_input' / 'extract_coords.py'
# %run $FILE_PATH

## Data Curation Code for APTCs dataset

In [7]:
from pathlib import Path

BASE_DIR = Path.cwd().parent
DATA_DIR = BASE_DIR / 'data' / 'APTCs_dataset'
CURATION_DIR = BASE_DIR / 'src' / 'data_curation' / 'APTCs_dataset'

Calculate 2d descriptors (ECFP4 bit/count, 2DPFP)

In [8]:
FILE_PATH = CURATION_DIR / '2d_descriptor' / 'calc_2d.py'
# %run $FILE_PATH

**The code of MOE descriptors generation could not be released.**

Pmapper descriptors are calculated by [the code](https://github.com/Laboratoire-de-Chemoinformatique/3D-MIL-QSSR/blob/main/miqssr/descriptor_calculation/pmapper_3d.py)


Aggregate MOE descriptors

In [9]:
FILE_PATH = CURATION_DIR / '3d_descriptor' / 'calc_agg_MOE.py'
# %run $FILE_PATH

Preprocess Pmapper descriptors

In [10]:
FILE_PATH = CURATION_DIR / '3d_descriptor' / 'merge_pmapper.py'
# %run $FILE_PATH

Aggregate Pmapper descriptors

In [11]:
FILE_PATH = CURATION_DIR / '3d_descriptor' / 'calc_agg_pmapper.py'
# %run $FILE_PATH

Since it was found that the accuracy of APTC data varies significantly depending on the splitting method, 
we will concatenate the data first and then split it again for training. 

In [12]:
FILE_PATH = CURATION_DIR / 'concat_split.py'
# %run $FILE_PATH