Alternate molecule representation with some interesting properties
This is set of simple functions that convert molecules to and from a vector representation.
The representation is quite simple, it consists of blocks each of which contain atoms and relevant bond records.
[atom_record][bond_records]
These blocks are fixed size and each of these blocks has the property that they can be moved and mutated between vectors and will always form a graph[with one exception]. Note that the graph may not be chemical.
The only graphs that are not generated contain self references, i.e. an atom that has a bond to itself.
This property is generated by how the bonds are encoded in the vector.
A bond record has a bond type and an offset. The offset is the relative index to the atom to which it is bonded.
I.e. a single bond between atom index 1 and atom index 3 is encoded with bond_type=1, bond_offset=2 from atom 1 to atom 3 and bond_type=1, bond_offset=-2 from atom 3 to atom 1
When the graph is decoded, atom index generated via the offset is modulo the number of atoms, i.e. the generated index can wrap around the vector.
Note a mutation will always generate a graph, but it may not be chemical, so we may have to try a few times.
The mutation code is really quite random, and exists only to show how to swap pieces from one molecule to another.
from rdkit import Chem
from molvector import encode, decode, canonical_order, mutate
test3 = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
test4 = "NCCCCC(C(=O)NCCc1ccccc1)N1Cc2[nH]c3ccccc3c2CC(NC(=O)Cc2ccccc2)C1=O.O=C(O)C(F)(F)F"
m = Chem.MolFromSmiles(test3)
m2 = Chem.MolFromSmiles(test4)
v = encode(m, canonical_order)[0]
v2 = encode(m2, canonical_order)[0]
while 1:
r = mutate(v,v2)
mol = decode(r)
if mol:
smi = Chem. MolToSmiles(mol)
print(smi)
break
The default N is 100,000 random samples are tested. Many will not generate unique smiles during smiles traversals.
from molvector import encode
from rdkit.Chem import MolFromSmiles
test = "NCCCCCOCC1OC(OCCc2c[nH]c3ccccc23)C(OCc2ccccc2)C(OCc2ccccc2)C1OCc1ccccc1"
m = MolFromSmiles(test3)
vectors = encode(m)
To control N
import functools
from molvector import generate_random_smiles_orders
vectors = encode(m, functools.partial(generate_random_smiles_orders, N=100))
Now it may be that this encoding ends up not being useful, however I have noted that the MAE in my training sessions drops quicker as (I believe) there is not as much to learn as when using smiles strings as inputs.
Additionally, stereochemistry can be encoded correctly (this is not yet done:)
By default the encoder generates an ensemble of molvectors in random but unique smiles orders. This can take a bit of time but is easy to parallelize.
There needs to be some heuristic on how many random smiles we should check for a given size of input. Currently we try 10,000 times to generate random smiles for any input size.