molvector

Alternate molecule representation with some interesting properties

This is set of simple functions that convert molecules to and from a vector representation.

Represenation

The representation is quite simple, it consists of blocks each of which contain atoms and relevant bond records.

 [atom_record][bond_records]

These blocks are fixed size and each of these blocks has the property that they can be moved and mutated between vectors and will always form a graph[with one exception]. Note that the graph may not be chemical.

The only graphs that are not generated contain self references, i.e. an atom that has a bond to itself.

This property is generated by how the bonds are encoded in the vector.

A bond record has a bond type and an offset. The offset is the relative index to the atom to which it is bonded.

I.e. a single bond between atom index 1 and atom index 3 is encoded with bond_type=1, bond_offset=2 from atom 1 to atom 3 and bond_type=1, bond_offset=-2 from atom 3 to atom 1

When the graph is decoded, atom index generated via the offset is modulo the number of atoms, i.e. the generated index can wrap around the vector.

Random mutation example:

Note a mutation will always generate a graph, but it may not be chemical, so we may have to try a few times.

The mutation code is really quite random, and exists only to show how to swap pieces from one molecule to another.

from rdkit import Chem
from molvector import encode, decode, canonical_order, mutate
test3 = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
test4 = "NCCCCC(C(=O)NCCc1ccccc1)N1Cc2[nH]c3ccccc3c2CC(NC(=O)Cc2ccccc2)C1=O.O=C(O)C(F)(F)F"
m = Chem.MolFromSmiles(test3)
m2 = Chem.MolFromSmiles(test4)
v = encode(m, canonical_order)[0]
v2 = encode(m2, canonical_order)[0]
while 1:
  r = mutate(v,v2)
  mol = decode(r)
  if mol: 
    smi = Chem. MolToSmiles(mol)
    print(smi)
    break

Generating ensembles for learning

The default N is 100,000 random samples are tested. Many will not generate unique smiles during smiles traversals.

from molvector import encode
from rdkit.Chem import MolFromSmiles
test = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
m = MolFromSmiles(test3)
vectors = encode(m)

To control N

import functools
from molvector import generate_random_smiles_orders
vectors = encode(m, functools.partial(generate_random_smiles_orders, N=100))

Notes

Now it may be that this encoding ends up not being useful, however I have noted that the MAE in my training sessions drops quicker as (I believe) there is not as much to learn as when using smiles strings as inputs.

Additionally, stereochemistry can be encoded correctly (this is not yet done:)

By default the encoder generates an ensemble of molvectors in random but unique smiles orders. This can take a bit of time but is easy to parallelize.

Future work:

There needs to be some heuristic on how many random smiles we should check for a given size of input. Currently we try 10,000 times to generate random smiles for any input size.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
license.txt		license.txt
molvector.py		molvector.py
test_molvector.py		test_molvector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

molvector

Represenation

Random mutation example:

Generating ensembles for learning

Notes

Future work:

About

Releases

Packages

Languages

License

bp-kelley/molvector

Folders and files

Latest commit

History

Repository files navigation

molvector

Represenation

Random mutation example:

Generating ensembles for learning

Notes

Future work:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages