# Tutorial Part 4: Going Deeper On Molecular Featurizations

One of the most important steps of doing machine learning on molecular data is transforming this data into a form amenable to the application of learning algorithms. This process is broadly called "featurization" and involves tutrning a molecule into a vector or tensor of some sort. There are a number of different ways of doing such transformations, and the choice of featurization is often dependent on the problem at hand.

In this tutorial,  we explore the different featurization methods available for molecules. These featurization methods include:

1. `ConvMolFeaturizer`, 
2. `WeaveFeaturizer`, 
3. `CircularFingerprints`
4. `RDKitDescriptors`
5. `BPSymmetryFunction`
6. `CoulombMatrix`
7. `CoulombMatrixEig`
8. `AdjacencyFingerprints`

Let's start with some basic imports

In [6]:
from __future__ import print_function
from __future__ import division
from __future__ import unicode_literals

import numpy as np
from rdkit import Chem

from deepchem.feat import ConvMolFeaturizer, WeaveFeaturizer, CircularFingerprint
from deepchem.feat import AdjacencyFingerprint, RDKitDescriptors
from deepchem.feat import BPSymmetryFunctionInput, CoulombMatrix, CoulombMatrixEig
from deepchem.utils import conformers

We use `propane`( $CH_3 CH_2 CH_3 $ ) as a running example throughout this tutorial. Many of the featurization methods use conformers or the molecules. A conformer can be generated using the `ConformerGenerator` class in `deepchem.utils.conformers`. 

### RDKitDescriptors

`RDKitDescriptors` featurizes a molecule by computing descriptors values for specified descriptors. Intrinsic to the featurizer is a set of allowed descriptors, which can be accessed using `RDKitDescriptors.allowedDescriptors`.

The featurizer uses the descriptors in `rdkit.Chem.Descriptors.descList`, checks if they are in the list of allowed descriptors and computes the descriptor value for the molecule.

In [7]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

Let's check the allowed list of descriptors. As you will see shortly, there's a wide range of chemical properties that RDKit computes for us.

In [8]:
for descriptor in RDKitDescriptors.allowedDescriptors:
    print(descriptor)

NumAromaticHeterocycles
EState_VSA7
EState_VSA6
MolMR
BertzCT
SMR_VSA10
NHOHCount
MinPartialCharge
HallKierAlpha
MinEStateIndex
Chi1n
Chi4n
ExactMolWt
VSA_EState8
SMR_VSA5
SMR_VSA9
NumAliphaticCarbocycles
VSA_EState2
SlogP_VSA6
VSA_EState7
PEOE_VSA7
NumHeteroatoms
Chi1v
PEOE_VSA2
SMR_VSA4
PEOE_VSA9
HeavyAtomCount
NumRadicalElectrons
EState_VSA3
NumValenceElectrons
EState_VSA5
PEOE_VSA10
EState_VSA11
EState_VSA10
SMR_VSA7
Chi1
RingCount
NumHDonors
LabuteASA
VSA_EState1
Chi2v
NumSaturatedCarbocycles
SMR_VSA8
Chi3v
EState_VSA9
Kappa2
NumAliphaticHeterocycles
Chi0
SMR_VSA1
SMR_VSA2
PEOE_VSA1
MolLogP
NumAliphaticRings
MinAbsPartialCharge
BalabanJ
Kappa1
PEOE_VSA13
EState_VSA4
SlogP_VSA11
MolWt
SMR_VSA3
Chi2n
VSA_EState3
MaxEStateIndex
PEOE_VSA11
Ipc
MaxAbsPartialCharge
Chi0n
VSA_EState10
VSA_EState5
EState_VSA1
FractionCSP3
Kappa3
MaxPartialCharge
PEOE_VSA6
SlogP_VSA7
NumHAcceptors
NumAromaticCarbocycles
SMR_VSA6
Chi3n
HeavyAtomMolWt
SlogP_VSA8
VSA_EState9
PEOE_VSA3
SlogP_VSA5
NumRotatableB

In [9]:
rdkit_desc = RDKitDescriptors()
features = rdkit_desc._featurize(example_mol)

print('The number of descriptors present are: ', len(features))

The number of descriptors present are:  111


### BPSymmetryFunction

`Behler-Parinello Symmetry function` or `BPSymmetryFunction` featurizes a molecule by computing the atomic number and coordinates for each atom in the molecule. The features can be used as input for symmetry functions, like `RadialSymmetry`, `DistanceMatrix` and `DistanceCutoff` . More details on these symmetry functions can be found in [this paper](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.98.146401). These functions can be found in `deepchem.feat.coulomb_matrices`

The featurizer takes in `max_atoms` as an argument. As input, it takes in a conformer of the molecule and computes:

1. coordinates of every atom in the molecule (in Bohr units)
2. the atomic numbers for all atoms. 

These features are concantenated and padded with zeros to account for different number of atoms, across molecules.

In [10]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)
engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

Let's now take a look at the actual featurized matrix that comes out.

In [12]:
bp_sym = BPSymmetryFunctionInput(max_atoms=20)
features = bp_sym._featurize(mol=example_mol)
features

array([[ 6.        ,  2.33166293, -0.52962788, -0.48097309],
       [ 6.        ,  0.0948792 ,  1.07597567, -1.33579553],
       [ 6.        , -2.40436371, -0.29483572, -0.90388318],
       [ 1.        ,  2.18166462, -0.95639011,  1.569049  ],
       [ 1.        ,  4.1178375 ,  0.51816193, -0.81949623],
       [ 1.        ,  2.39319787, -2.32844253, -1.56157176],
       [ 1.        ,  0.29919987,  1.51730566, -3.37889252],
       [ 1.        ,  0.08875543,  2.88229706, -0.26437996],
       [ 1.        , -3.99100651,  0.92016315, -1.54358853],
       [ 1.        , -2.66167993, -0.71627602,  1.136556  ],
       [ 1.        , -2.45014726, -2.08833123, -1.99406318],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.

A simple check for the featurization would be to count the different atomic numbers present in the features.

In [13]:
atomic_numbers = features[:, 0]
from collections import Counter

unique_numbers = Counter(atomic_numbers)
print(unique_numbers)

Counter({0.0: 9, 1.0: 8, 6.0: 3})


For propane, we have $3$ `C-atoms` and $8$ `H-atoms`, and these numbers are in agreement with the results shown above. There's also the additional padding of 9 atoms, to equalize with `max_atoms`.

### CoulombMatrix

`CoulombMatrix`, featurizes a molecule by computing the coulomb matrices for different conformers of the molecule, and returning it as a list.

A Coulomb matrix tries to encode the energy structure of a molecule. The matrix is symmetric, with the off-diagonal elements capturing the Coulombic repulsion between pairs of atoms and the diagonal elements capturing atomic energies using the atomic numbers. More information on the functional forms used can be found [here](https://journals.aps.org/prl/pdf/10.1103/PhysRevLett.108.058301).

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`), and getting only the upper triangular matrix (`upper_tri`).

In [14]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))

Number of available conformers for propane:  1


In [15]:
coulomb_mat = CoulombMatrix(max_atoms=20, randomize=False, remove_hydrogens=False, upper_tri=False)
features = coulomb_mat._featurize(mol=example_mol)

A simple check for the featurization is to see if the feature list has the same length as the number of conformers

In [16]:
print(len(example_mol.GetConformers()) == len(features))

True


### CoulombMatrixEig

`CoulombMatrix` is invariant to molecular rotation and translation, since the interatomic distances or atomic numbers do not change. However the matrix is not invariant to random permutations of the atom's indices. To deal with this, the `CoulumbMatrixEig` featurizer was introduced, which uses the eigenvalue spectrum of the columb matrix, and is invariant to random permutations of the atom's indices.

`CoulombMatrixEig` inherits from `CoulombMatrix` and featurizes a molecule by first computing the coulomb matrices for different conformers of the molecule and then computing the eigenvalues for each coulomb matrix. These eigenvalues are then padded to account for variation in number of atoms across molecules.

The featurizer takes in `max_atoms` as an argument and also has options for removing hydrogens from the molecule (`remove_hydrogens`), generating additional random coulomb matrices(`randomize`).

In [17]:
example_smile = "CCC"
example_mol = Chem.MolFromSmiles(example_smile)

engine = conformers.ConformerGenerator(max_conformers=1)
example_mol = engine.generate_conformers(example_mol)

print("Number of available conformers for propane: ", len(example_mol.GetConformers()))

Number of available conformers for propane:  1


In [18]:
coulomb_mat_eig = CoulombMatrixEig(max_atoms=20, randomize=False, remove_hydrogens=False)
features = coulomb_mat_eig._featurize(mol=example_mol)

In [19]:
print(len(example_mol.GetConformers()) == len(features))

True


### Adjacency Fingerprints