# **1. Creation of the Alphabets linking Morgan bits to atomic signatures**

In this notebook we show how to obtain the Alphabets of MetaNetX, eMolecules, DrugBank and how to merge them. The MetaNetX, eMolecules and DrugBank datasets can be found on https://doi.org/10.5281/zenodo.15682264.

In [2]:
import pandas as pd

from molsig.SignatureAlphabet import compatible_alphabets, load_alphabet, merge_alphabets, SignatureAlphabet

We select the parameters of the Alphabets.

In [3]:
radius = 2
nBits = 2048
use_stereo = True

# Datasets path

In [1]:
path_datasets = "C:/Users/meyerp/Documents/INRAE/Datasets/"

### MetaNetX

We import the MetaNetx data and select the precomputed molecular signatures with atomic signatures associated to Morgan bits. To obtain the full data we have to merge the test, the train_fold0 and the valid_fold0 data.

In [None]:
path_metanetx_0 = path_datasets + "metanetx/test.tsv"
df_metanetx_0 = pd.read_csv(path_metanetx_0, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_metanetx_0 = list(df_metanetx_0["SIGNATURE_MORGANS"])

path_metanetx_1 = path_datasets + "metanetx/train_fold0.tsv"
df_metanetx_1 = pd.read_csv(path_metanetx_1, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_metanetx_1 = list(df_metanetx_1["SIGNATURE_MORGANS"])

path_metanetx_2 = path_datasets + "metanetx/valid_fold0.tsv"
df_metanetx_2 = pd.read_csv(path_metanetx_2, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_metanetx_2 = list(df_metanetx_2["SIGNATURE_MORGANS"])

signatures_metanetx = signatures_metanetx_0 + signatures_metanetx_1 + signatures_metanetx_2

We compute the MetaNetX Alphabet.

In [None]:
Alphabet_metanetx = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)
Alphabet_metanetx.fill_from_signatures(signatures_metanetx, atomic=False)
Alphabet_metanetx.print_out()

In [None]:
Alphabet_metanetx.save("metanetx_alphabet")

### eMolecules

We import the eMolecules data and select the precomputed molecular signatures with atomic signatures associated to Morgan bits. This file being quite large, we compute the Alphabet by batches.

In [None]:
Alphabet_emolecules = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)

In [None]:
path_emolecules_0 = path_datasets + "emolecules/test.tsv"
df_emolecules_0 = pd.read_csv(path_emolecules_0, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_emolecules_0 = list(df_emolecules_0["SIGNATURE_MORGANS"])

Alphabet_emolecules.fill_from_signatures(signatures_emolecules_0, atomic=False)

In [None]:
path_emolecules_1 = path_datasets + "emolecules/valid.tsv"
df_emolecules_1 = pd.read_csv(path_emolecules_1, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_emolecules_1 = list(df_emolecules_1["SIGNATURE_MORGANS"])

Alphabet_emolecules.fill_from_signatures(signatures_emolecules_1, atomic=False)

In [None]:
path_emolecules_2 = path_datasets + "emolecules/train.tsv"
df_emolecules_2_chunks = pd.read_csv(path_emolecules_2, sep='\t', usecols=["SIGNATURE_MORGANS"], chunksize=100000)

for chunk in df_emolecules_2_chunks:
    print(chunk.index)
    signatures_emolecules_chunk = set(chunk["SIGNATURE_MORGANS"])
    Alphabet_emolecules.fill_from_signatures(signatures_emolecules_chunk, atomic=False)

Alphabet_emolecules.print_out()

In [None]:
Alphabet_emolecules.save("emolecules_alphabet")

### DrugBank

We import the DrugBank molecules.

In [None]:
path_drugbank = path_datasets + "drugbank/drugbank_500_no_duplicates.tsv"
df_drugbank = pd.read_csv(path_drugbank, sep='\t')
smiles_drugbank = list(df_drugbank["SMILES_STEREO"])

We compute the DrugBank Alphabet.

In [None]:
Alphabet_drugbank = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)
Alphabet_drugbank.fill(smiles_drugbank)
Alphabet_drugbank.print_out()

In [None]:
Alphabet_drugbank.save("drugbank_alphabet")

### ChemBL

In [None]:
path_chembl = path_datasets + "chembl/chembl.tsv"
df_chembl = pd.read_csv(path_chembl, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_chembl = list(df_chembl["SIGNATURE_MORGANS"])

Alphabet_chembl = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)
Alphabet_chembl.fill_from_signatures(signatures_chembl, atomic=False)
Alphabet_chembl.print_out()

Alphabet_chembl.save("chembl_alphabet")

### MolForge

In [None]:
path_molforge = chembl + "molforge/molforge.tsv"
df_molforge = pd.read_csv(path_molforge, sep='\t', usecols = ["SIGNATURE_MORGANS"])
signatures_molforge = list(df_molforge["SIGNATURE_MORGANS"])

In [None]:
Alphabet_molforge = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)
Alphabet_molforge.fill_from_signatures(signatures_molforge, atomic=False)
Alphabet_molforge.print_out()

In [None]:
Alphabet_molforge.save("molforge_alphabet")

# Merge alphabets

We now merge the Alphabets. If necessary, we start by importing the Alphabets.

In [None]:
Alphabet_metanetx = load_alphabet("metanetx_alphabet")
Alphabet_metanetx.print_out()

In [None]:
Alphabet_emolecules = load_alphabet("emolecules_alphabet")
Alphabet_emolecules.print_out()

In [None]:
Alphabet_chembl = load_alphabet("chembl_alphabet")
Alphabet_chembl.print_out()

In [None]:
Alphabet_drugbank = load_alphabet("drugbank_alphabet")
Alphabet_drugbank.print_out()

We verify that the Alphabets are compatible.

In [None]:
compatible_alphabets(Alphabet_metanetx, Alphabet_emolecules), compatible_alphabets(Alphabet_metanetx, Alphabet_drugbank), compatible_alphabets(Alphabet_drugbank, Alphabet_chembl)

We merge the Alphabets.

In [None]:
Alphabet_merged = merge_alphabets(Alphabet_metanetx, Alphabet_emolecules)
Alphabet_merged.print_out()
Alphabet_merged = merge_alphabets(Alphabet_merged, Alphabet_drugbank)
Alphabet_merged.print_out()
Alphabet_merged = merge_alphabets(Alphabet_merged, Alphabet_chembl)
Alphabet_merged.print_out()

We export the Alphabet.

In [None]:
Alphabet_merged.save("metanetx_emolecules_drugbank_chembl_merged_alphabet")