# **How-To: create an Alphabet linking Morgan bits to atomic signatures**

In this notebook we show how to compute an Alphabet from a list of SMILES. This Alphabet links atomic signatures to their Morgan bits and is essential to the enumeration algorithms that enumerate molecules from ECFPs, see the `enumeration_basics` notebook.

For an extensive list of SMILES the computation time of the Alphabet could be important. For the MetaNetX database, composed of around 200,000 molecules, this computation takes around 1.5 hours.

Methods to merge Alphabets to facilitate batches computation are also presented at the end of the notebook.

In [1]:
from signature.signature_alphabet import compatible_alphabets, load_alphabet, merge_alphabets, SignatureAlphabet

## **Creation of an Alphabet**

We first import a list of SMILES.

In [2]:
list_smiles = [
    'O=C(O)[C@@H]1O[C@H](Oc2cccc3c(=O)oc(/C=C/CCO)cc23)[C@H](O)[C@H](O)[C@@H]1O',
    'Cc1ccc(Cn2nc(C)cc2C(=O)Nc2ccc(Cl)cc2)cc1',
    'COc1ccc([C@@H]2NC(=O)c3ccccc3O2)c(OC)c1OC',
    'COc1ccc([C@@H]2CCC[C@H](CCc3ccc(O)cc3)O2)cc1',
    'C[C@]12CC[C@H]3[C@@](O)(CCC4=CC(=O)CC[C@@]43C)[C@@H]1CC[C@@H]2C(=O)CO',
    'C=C1/C(=C\\C=C2/CCC[C@]3(C)[C@@H]([C@@H](C)[C@@H](C#CC(O)(CC)CC)OCC)CC[C@@H]23)C[C@@H](O)C[C@@H]1O',
    'CC[C@H](C)[C@H](N)C(=O)N[C@H](C(=O)N[C@@H](CO)C(=O)O)[C@@H](C)O',
    'CSCCN=C=S',
    'O=C1C[C@H](O)[C@](O)([C@@H]2C(=O)[C@]3(Cl)[C@H](Cl)C[C@]2(Cl)C3(Cl)Cl)[C@@H]1O',
    'CN(C)[C@@H]1C(=O)[C@H](C(N)=O)[C@H](O)[C@]2(O)C(=O)[C@@H]3C(=O)c4c(O)ccc(Cl)c4[C@](C)(O)[C@@H]3C[C@H]12',
    'O[C@@]12[C@@H]3C[C@@](O)(C(Cl)=C3Cl)[C@H]1[C@@H]1C[C@]2(O)[C@@H]2O[C@@H]12',
    'CC(=O)NCC/C(=C\\N)C(=O)OC(=O)C(=O)C(=O)[O-]',
    'CSCC[C@H](NC(=O)[C@@H](N)CO)C(=O)N1CCC[C@@H]1C(=O)O',
    'C[C@@H](O)[C@H](NC(=O)CNC(=O)[C@@H](N)CC(=O)O)C(=O)O',
    'CCc1ccccc1OC[C@@H](O)CN[C@H]1CCCc2ccccc21',
    'CO[C@@H]1CN(C)C(=O)c2ccc(NC(C)=O)cc2OC[C@@H](C)N(Cc2cc(F)ccc2F)C[C@@H]1C',
    'NC(=O)CCCCCC[C@H](O)/C=C/CCCCCCCCO',
    'CC[C@H](C)[C@H](N)C(=O)N[C@@H](CC(N)=O)C(=O)N[C@@H](CCSC)C(=O)O',
    'CCC[C@H](O)C(=O)NCC(=O)[O-]',
    'O=C[C@@H]1[C@H](O)[C@](O)(C(=O)[O-])[C@@H]2[C@]3(Cl)C(Cl)=C(Cl)[C@](Cl)([C@@H]3Cl)[C@]12O'
]

We select the parameters of the Alphabet:
- `radius` is the radius used in the computation the ECFP and the molecular signature representations;
- `nBits` is the number of bits used in the computation of the ECFP representation;
- `use_stereo` is a boolean indicating if we want to include or not stereochemistry information in the computation of the ECFP representation.

In [3]:
radius = 2
nBits = 2048
use_stereo = True

We now initialize an empty Alphabet with these parameters.

In [4]:
Alphabet = SignatureAlphabet(radius=radius, nBits=nBits, use_stereo=use_stereo)

We now fill the Alphabet with the list of SMILES strings `list_smiles` previously imported. After the computation the `print_out` method will indicate the number of fragments associating Morgan bits with atomic signatures that have been obtained.

In [5]:
Alphabet.fill(list_smiles, verbose=True)
Alphabet.print_out()

... processing alphabet iteration: 0 size: 0 time: 0.000003
filename: 
radius: 2
nBits: 2048
use_stereo: True
alphabet length: 381


We finally export the Alphabet using the `save` method.

In [None]:
path_alphabet = "YOUR_PATH_HERE"
Alphabet.save(path_alphabet)

## **Loading of an Alphabet** 

We now show how to import a precomputed Alphabet.

In [None]:
path_alphabet = "YOUR_PATH_HERE"
Alphabet = load_alphabet(path_alphabet, verbose=True)

## **Merging of Alphabets**

We first import two Alphabets.

In [6]:
path_alphabet_1 = "YOUR_PATH_HERE"
Alphabet_1 = load_alphabet(path_alphabet_1, verbose=True)

filename: C:/Users/meyerp/Documents/INRAE/Datasets/new/alphabets/metanetx_alphabet.npz
radius: 2
nBits: 2048
use_stereo: True
alphabet length: 227717


In [7]:
path_alphabet_2 = "YOUR_PATH_HERE"
Alphabet_2 = load_alphabet(path_alphabet_2, verbose=True)

filename: C:/Users/meyerp/Documents/INRAE/Datasets/new/alphabets/emolecules_alphabet.npz
radius: 2
nBits: 2048
use_stereo: True
alphabet length: 570421


We verify if the parameters of the two Alphabets are compatible.

In [8]:
compatible_alphabets(Alphabet_1, Alphabet_2)

True

We merge the Alphabets.

In [9]:
Alphabet_merged = merge_alphabets(Alphabet_1, Alphabet_2)
Alphabet_merged.print_out()

filename: C:/Users/meyerp/Documents/INRAE/Datasets/new/alphabets/metanetx_alphabet.npz
radius: 2
nBits: 2048
use_stereo: True
alphabet length: 712930
