# FS-Mol: Preprocessed data

This notebook should given an overview of the files available in the preprocessed data folder

In [1]:
import numpy as np
import sklearn
import pickle

import random

In [4]:
# Define root path to your data

root_path = '/system/user/publicdata/FS-Mol/'  # This has to be adjusted
data_folder = 'preprocessed/'

Splitting data into training, validation and test was already done by the FS-Mol authors. Thus, train, val and test fold can be found in the folders 'training', 'validation' and 'test'. 

In [110]:
train_dir = 'training/'
val_dir = 'validation/'
test_dir = 'test/'

In the following, we will guide you through the preprocessed data, using the training data. 

# Triplett data format

The data is stored in "tripletts". Therefore, each data point consists of a molecule ID, a task ID and a label. 

E.g., (34, 7, True) means: Molecule with ID 34 is active (True) regarding task 7. 

These tripletts are stored in three different files, i. e. 'mol_ids.npy', 'task_ids' and 'labels.npy'.

### Labels

In [31]:
labels = np.load(root_path + data_folder + train_dir + 'labels.npy').flatten()

In [32]:
labels.shape   # 426,138 binary labels

(426138,)

In [33]:
labels[:10]

array([False, False, False, False, False,  True, False,  True, False,
        True])

### Molecule IDs

In [34]:
mol_ids = np.load(root_path + data_folder + train_dir + 'mol_ids.npy').flatten()

In [35]:
mol_ids.shape  # 426,138 mol ID entries

(426138,)

In [36]:
# Number of different molecues in training set
len(np.unique(mol_ids))

216827

In [37]:
mol_ids[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Task IDs

In [38]:
task_ids = np.load(root_path + data_folder + train_dir + 'task_ids.npy').flatten()

In [39]:
task_ids.shape  # 426,138 task ID entries

(426138,)

In [40]:
# Number of different tasks in training set
len(np.unique(task_ids))

4938

In [41]:
mol_ids[:10]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

For example, the first two data point are the tripletts:

In [42]:
print(f'First data point: ({mol_ids[0]},{mol_ids[0]}, {labels[0]})')
print(f'Second data point: ({mol_ids[1]},{mol_ids[1]}, {labels[1]})')

First data point: (0,0, False)
Second data point: (1,1, False)


The input for our machine learning models still is missing. Information regarding this is stored in the descriptor matrix.

## Descriptor matrix

The FS-Mol authors already precomputed features, i.e. extended connectivity fingerprints plus additional molecular descriptors. We stacked these features together, normalized them, whereby all training data were used to fit the scaler, and created a descriptor matrix, which includes information about all molecules in the training split.

Attention: The size is quite big (~4 GB).

In [8]:
descriptor_matrix = np.load(root_path + data_folder + train_dir + 'mol_inputs.npy')

In [9]:
descriptor_matrix.shape  # 216,827 unique molecules in data set, 2,248 tasks

(216827, 2248)

In [11]:
descriptor_matrix[0:10, 0:10]  # contains already normalized values

array([[-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557, -0.14601174, -0.15363863],
       [-0.0557684 , -0.40804977, -0.1993379 , -0.09619749, -0.1376988 ,
        -0.134385  , -0.04979712, -0.07441557

E.g. for the first triplett (0,0, False), we get the descriptors which can be feed into the machine learning model by:

In [44]:
descr_triplett_0 = descriptor_matrix[[0],:]  # descriptor_matrix[mol_id, :]
descr_triplett_0.shape

(1, 2248)

This is basically all data you need for training. For the validation and test set you can access the data in the same way.

## Epochs and Mini-batches

For every epoch, it is a good idead to shuffle the tripletts:

In [72]:
numb_rows = mol_ids.shape[0]
shuffled_rows = random.sample(range(numb_rows), numb_rows)

mol_ids_shuffled = mol_ids[shuffled_rows]   # Attention: Take care not to shuffle the arrays differently!
task_ids_shuffled = task_ids[shuffled_rows]
labels_shuffled = labels[shuffled_rows]

In [73]:
mol_ids_shuffled[:10]

array([   850,  21361, 166473,  54741,  47225, 115005,  71261, 134852,
          842, 140591])

In [74]:
task_ids_shuffled[:10]

array([ 302, 4809, 1641,  508, 2072, 1662, 4050, 2029, 4550, 4726])

In [75]:
labels_shuffled[:10]

array([ True, False, False,  True, False,  True, False,  True,  True,
       False])

Let's assume the first 50 tripletts build a mini-batch. Then, we easly can get the features and labels:

In [78]:
labels = labels_shuffled[:50].reshape(-1,1)
labels.shape

(50, 1)

In [79]:
labels

array([[ True],
       [False],
       [False],
       [ True],
       [False],
       [ True],
       [False],
       [ True],
       [ True],
       [False],
       [ True],
       [ True],
       [ True],
       [ True],
       [ True],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [ True],
       [ True],
       [False],
       [False],
       [ True],
       [ True],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [ True],
       [ True],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [ True]])

Features:

In [80]:
batch_mol_ids = mol_ids_shuffled[:50]

batch_descriptors = descriptor_matrix[batch_mol_ids, :]

batch_descriptors.shape

(50, 2248)

In [82]:
# Task IDs
batch_task_ids = task_ids_shuffled[:50]
batch_task_ids

array([ 302, 4809, 1641,  508, 2072, 1662, 4050, 2029, 4550, 4726, 2211,
       4402, 3677, 2581, 2512, 2354, 3194, 4512,  141, 4923, 3956, 1173,
       4253, 3881, 4778,  585, 2289, 1300,  506, 1199, 3109, 3534, 1193,
       4167,  532, 3538, 3855, 1305, 3718, 1058, 3021, 1267,  675, 1514,
        104, 3109, 4166, 4328,  398,  789])

## Additional data available in preprocessed data folder

Probably not needed for experiments

Smiles information for the molecules:

In [85]:
with open(root_path + data_folder + train_dir + 'dict_mol_smiles_id.pkl', 'rb') as f:
    dict_mol_smiles = pickle.load(f)

In [89]:
dict_mol_smiles  # Dict; Keys: Smiles information about molecule, Values: Molecule ID (used in tripletts)

{'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(C)c(-c3ccc(OC)cc3)c12': 0,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(C)c(-c3ccccc3)c12': 1,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(C)c(Br)c12': 2,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3ccc(C)s3)cc12': 3,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3cc(C)cs3)cc12': 4,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3cccs3)cc12': 5,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3ccoc3)cc12': 6,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3ccco3)cc12': 7,
 'CCCCCn1cc(C(=O)NC2CCCCCC2)c(=O)n2nc(-c3ccco3)cc12': 8,
 'CCCCCn1cc(C(=O)NC2CCCCC2)c(=O)n2nc(-c3ccco3)cc12': 9,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3c(Cl)cccc3Cl)cc12': 10,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2nc(-c3ccc(Cl)cc3Cl)cc12': 11,
 'CCCCCn1cc(C(=O)NC2CCCCCC2)c(=O)n2nc(-c3ccc(Cl)cc3Cl)cc12': 12,
 'CCCCCn1cc(C(=O)NC2CCCCC2)c(=O)n2nc(-c3ccc(Cl)cc3Cl)cc12': 13,
 'CCCCCn1cc(C(=O)NC23CC4CC(CC(C4)C2)C3)c(=O)n2n

Task information:

In [90]:
with open(root_path + data_folder + train_dir + 'dict_task_names_id.pkl', 'rb') as f:
    dict_task_names = pickle.load(f)

In [92]:
dict_task_names # Dict; Keys: Task names, Values: Task ID (used in tripletts)

{'CHEMBL3226104': 0,
 'CHEMBL1769274': 1,
 'CHEMBL1118313': 2,
 'CHEMBL3871852': 3,
 'CHEMBL4037498': 4,
 'CHEMBL2071936': 5,
 'CHEMBL882039': 6,
 'CHEMBL823919': 7,
 'CHEMBL4177855': 8,
 'CHEMBL3858698': 9,
 'CHEMBL660806': 10,
 'CHEMBL618129': 11,
 'CHEMBL861854': 12,
 'CHEMBL2043890': 13,
 'CHEMBL1908562': 14,
 'CHEMBL1908838': 15,
 'CHEMBL4137913': 16,
 'CHEMBL3887205': 17,
 'CHEMBL1614533': 18,
 'CHEMBL862877': 19,
 'CHEMBL1006441': 20,
 'CHEMBL4015671': 21,
 'CHEMBL1051342': 22,
 'CHEMBL4004372': 23,
 'CHEMBL3374323': 24,
 'CHEMBL3424809': 25,
 'CHEMBL4189611': 26,
 'CHEMBL761443': 27,
 'CHEMBL854265': 28,
 'CHEMBL2045992': 29,
 'CHEMBL1074684': 30,
 'CHEMBL910036': 31,
 'CHEMBL702708': 32,
 'CHEMBL3583832': 33,
 'CHEMBL1062797': 34,
 'CHEMBL2173194': 35,
 'CHEMBL863825': 36,
 'CHEMBL4154650': 37,
 'CHEMBL808179': 38,
 'CHEMBL997737': 39,
 'CHEMBL3888896': 40,
 'CHEMBL1260853': 41,
 'CHEMBL3721022': 42,
 'CHEMBL1963755': 43,
 'CHEMBL1641189': 44,
 'CHEMBL4040377': 45,
 'CHEMBL190

Information about active molecules per task:

In [93]:
with open(root_path + data_folder + train_dir + 'dict_task_id_activeMolecules.pkl', 'rb') as f:
    dict_task_activeMols = pickle.load(f)

In [94]:
dict_task_activeMols # Dict; Keys: Task ids, Values: mol_ids where label=True

{0: [283,
  127909,
  127910,
  37208,
  127911,
  127912,
  37212,
  127913,
  37211,
  37213,
  127914,
  37198,
  37209,
  37228,
  37230,
  37200,
  37236,
  37210],
 1: [29946,
  29947,
  29951,
  29952,
  29953,
  29958,
  29959,
  29960,
  29961,
  29962,
  29963,
  29964,
  29969,
  29970,
  29971,
  29972,
  29979],
 2: [183356,
  183362,
  183368,
  183374,
  183375,
  183376,
  183377,
  183378,
  183379,
  183380,
  183381,
  183382,
  183383,
  183384,
  183385,
  183386,
  183387,
  183388],
 3: [178177,
  178178,
  178179,
  178181,
  178182,
  178187,
  178188,
  178190,
  178195,
  178196,
  178197,
  178199,
  178202,
  178209,
  178212,
  178213,
  202516],
 4: [63950,
  63952,
  78954,
  63954,
  63957,
  63970,
  63971,
  102911,
  102918,
  63949,
  63948,
  63955,
  63980,
  63956,
  63979,
  63960,
  63961,
  78957,
  63964,
  63965,
  63966,
  63967,
  63968,
  63969,
  78966],
 5: [1162,
  1163,
  1165,
  1166,
  1167,
  1168,
  1169,
  1171,
  1172,
  1173,
 

Information about inactive molecules per task:

In [96]:
with open(root_path + data_folder + train_dir + 'dict_task_id_inactiveMolecules.pkl', 'rb') as f:
    dict_task_inactiveMols = pickle.load(f)

In [97]:
dict_task_inactiveMols # Dict; Keys: Task ids, Values: mol_ids where label=False

{0: [37204,
  37207,
  37214,
  127915,
  37216,
  37203,
  37238,
  37202,
  37218,
  37205,
  37201,
  37233,
  37217,
  127916,
  127917,
  303,
  127918,
  127919,
  127920],
 1: [29945,
  29948,
  29949,
  29950,
  29954,
  29955,
  29956,
  29957,
  29965,
  29966,
  29967,
  29968,
  29973,
  29974,
  29975,
  29976,
  29977,
  29978],
 2: [183353,
  183354,
  183355,
  183357,
  183358,
  183359,
  183360,
  183361,
  183363,
  183364,
  183365,
  183366,
  183367,
  183369,
  183370,
  183371,
  183372,
  183373,
  44925],
 3: [66322,
  178180,
  178183,
  178184,
  178185,
  178186,
  178189,
  178191,
  178192,
  178193,
  178194,
  178198,
  178200,
  178201,
  178203,
  178204,
  178205,
  178206,
  178207,
  178208],
 4: [63951,
  78951,
  102919,
  63953,
  63977,
  78970,
  63962,
  63972,
  63978,
  102910,
  102912,
  102913,
  102914,
  126010,
  102915,
  102916,
  102917,
  63976,
  102920,
  63958,
  63963,
  102921,
  102922,
  102923,
  102924,
  63974],
 5: [11

Scaler to transform features regarding preprocessing of this data:

In [98]:
with open(root_path + data_folder + 'scaler_trainFitted.pkl', 'rb') as f:
    scaler = pickle.load(f)

In [108]:
# toy data
new_features = np.random.random(2248).reshape(1,-1)
new_features

array([[0.40102333, 0.86074821, 0.36505494, ..., 0.53907576, 0.07462755,
        0.6799911 ]])

In [109]:
new_features_scaled = scaler.transform(new_features)
new_features_scaled 

array([[ 4.50176202,  0.7017201 ,  1.10909115, ...,  1.76731497,
        -0.14423922,  2.64454121]])