Skip to content

Alternate molecule representation with some interesting properties

License

Notifications You must be signed in to change notification settings

bp-kelley/molvector

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

molvector

Alternate molecule representation with some interesting properties

This is set of simple functions that convert molecules to and from a vector representation.

Represenation

The representation is quite simple, it consists of blocks each of which contain atoms and relevant bond records.

 [atom_record][bond_records]

These blocks are fixed size and each of these blocks has the property that they can be moved and mutated between vectors and will always form a graph[with one exception]. Note that the graph may not be chemical.

The only graphs that are not generated contain self references, i.e. an atom that has a bond to itself.

This property is generated by how the bonds are encoded in the vector.

A bond record has a bond type and an offset. The offset is the relative index to the atom to which it is bonded.

I.e. a single bond between atom index 1 and atom index 3 is encoded with bond_type=1, bond_offset=2 from atom 1 to atom 3 and bond_type=1, bond_offset=-2 from atom 3 to atom 1

When the graph is decoded, atom index generated via the offset is modulo the number of atoms, i.e. the generated index can wrap around the vector.

Random mutation example:

Note a mutation will always generate a graph, but it may not be chemical, so we may have to try a few times.

The mutation code is really quite random, and exists only to show how to swap pieces from one molecule to another.

from rdkit import Chem
from molvector import encode, decode, canonical_order, mutate
test3 = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
test4 = "NCCCCC(C(=O)NCCc1ccccc1)N1Cc2[nH]c3ccccc3c2CC(NC(=O)Cc2ccccc2)C1=O.O=C(O)C(F)(F)F"
m = Chem.MolFromSmiles(test3)
m2 = Chem.MolFromSmiles(test4)
v = encode(m, canonical_order)[0]
v2 = encode(m2, canonical_order)[0]
while 1:
  r = mutate(v,v2)
  mol = decode(r)
  if mol: 
    smi = Chem. MolToSmiles(mol)
    print(smi)
    break

Generating ensembles for learning

The default N is 100,000 random samples are tested. Many will not generate unique smiles during smiles traversals.

from molvector import encode
from rdkit.Chem import MolFromSmiles
test = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
m = MolFromSmiles(test3)
vectors = encode(m)

To control N

import functools
from molvector import generate_random_smiles_orders
vectors = encode(m, functools.partial(generate_random_smiles_orders, N=100))

Notes

Now it may be that this encoding ends up not being useful, however I have noted that the MAE in my training sessions drops quicker as (I believe) there is not as much to learn as when using smiles strings as inputs.

Additionally, stereochemistry can be encoded correctly (this is not yet done:)

By default the encoder generates an ensemble of molvectors in random but unique smiles orders. This can take a bit of time but is easy to parallelize.

Future work:

There needs to be some heuristic on how many random smiles we should check for a given size of input. Currently we try 10,000 times to generate random smiles for any input size.

About

Alternate molecule representation with some interesting properties

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages