## Atom featurizers

In [1]:
from chemprop.featurizers.atom import MultiHotAtomFeaturizer

This is an example atom to featurize.

In [2]:
from rdkit import Chem

atom_to_featurize = Chem.MolFromSmiles("CC").GetAtoms()[0]

### Atom features

The following atom features are generated by `rdkit` and cast to one-hot vectors (except for mass which is divided by 100). These feature vectors are joined together to a single multi-hot feature vector (with a final float32 bit for mass). All of these features (except aromaticity and mass) are padded with an extra bit for all unknown values.

 - atomic number
 - degree
 - formal charge
 - chiral tag
 - number of hydrogens
 - hybridization
 - aromaticity
 - mass

### v2

The v2 atom featurizer is the default. It provides bits in the feature vector for:

 - atomic number
    - first four rows of the period table plus iodine
 - degree
    - 0 bonds - 5 bonds
 - formal charge
    - -2, -1, 0, 1, 2
 - chiral tag
    - 0, 1, 2, 3 - See `rdkit.Chem.rdchem.ChiralType` for more details
 - number of hydrogens
    - 0 - 4
 - hybridization
    - S, SP, SP2, SP2D, SP3, SP3D, SP3D2

In [3]:
featurizer = MultiHotAtomFeaturizer.v2()
featurizer(atom_to_featurize)

array([0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       1.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       1.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
       0.     , 0.12011])

### v1

The v1 atom featurizer is the same as was used in Chemprop v1. It is the same as the v2 atom featurizer except for:

 - atomic number
    - first 100 elements (customizable)
 - hybridization
    - SP, SP2, SP3, SP3D, SP3D2

In [4]:
featurizer = MultiHotAtomFeaturizer.v1()
featurizer(atom_to_featurize)

array([0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.  

In [5]:
featurizer = MultiHotAtomFeaturizer.v1(max_atomic_num=53)
featurizer(atom_to_featurize)

array([0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 1.     , 0.     , 1.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
       0.     , 0.12011])

### organic

The organic atom featurizer is optimized to reduce feature vector size for organic molecule. It is the same as the v2 atom featurizer except for:

 - atomic number
    - H, B, C, N, O, F, Si, P, S, Cl, Br, and I atoms
 - hybridization
    - S, SP, SP2, SP3

In [6]:
featurizer = MultiHotAtomFeaturizer.organic()
featurizer(atom_to_featurize)

array([0.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 1.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 1.     ,
       0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
       0.     , 0.12011])

### Custom

Custom atom featurizers can also be created by specifying the choices. Custom choices for atomic number, degree, formal charge, chiral tag, # of hydrogens, and hybridization can be specified to create a custom atom featurizer. Aromaticity featurization is always True/False. 

In [7]:
from rdkit.Chem.rdchem import HybridizationType

atomic_nums = [1, 6, 7, 8]
degrees = [0, 1, 2, 3, 4]
formal_charges = [-2, -1, 0, 1, 2]
chiral_tags = [0, 1, 2, 3]
num_Hs = [0, 1, 2, 3, 4]
hybridizations = [HybridizationType.SP, HybridizationType.SP2, HybridizationType.SP3]
featurizer = MultiHotAtomFeaturizer(
    atomic_nums, degrees, formal_charges, chiral_tags, num_Hs, hybridizations
)
featurizer(atom_to_featurize)

array([0.     , 1.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 1.     , 0.     , 0.     , 0.     , 1.     ,
       0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
       0.     , 0.     , 0.     , 0.     , 1.     , 0.     , 0.     ,
       0.     , 0.     , 1.     , 0.     , 0.     , 0.12011])

### Generic

Any class that has a length and returns a numpy array when given an `rdkit.Chem.rdchem.Atom` can be used as an atom featurizer. 

In [8]:
from rdkit.Chem.rdchem import Atom
import numpy as np


class MyAtomFeaturizer:
    def __len__(self):
        return 1

    def __call__(self, a: Atom):
        return np.array([a.GetAtomicNum()], dtype=float)


featurizer = MyAtomFeaturizer()
featurizer(atom_to_featurize)

array([6.])