In [1]:
from docking_benchmark.data.results import OptimizedMolecules



# Optimized Molecules

`OptimizedMolecules` is an immutable wrapper around pandas' DataFrame serving as a container for experiment results. It also provides some commonly used statistics and methods for serialization. To create its instance `OptimizedMolecules.Builder` should be used. We will go through an example of creating `OptimizedMolecules` using builder and show how to use its methods.

In [2]:
builder = OptimizedMolecules.Builder()

New molecules can be stored in the container using the `append` method. First parameter is SMILES of the molecule and the remaining keywords are features of the molecule to be stored. The method will return True if molecule is successfully appended.

In [3]:
builder.append('C', docking_score=-7.2, gauss=23)
builder.append('Cl', docking_score=-2.3, gauss=123)

True

Molecules appended to builder must be unique. If you try to add a molecule already present in the builder it **will not override the existing entry** -- which means that entries in builder are, and must be unique.

In [4]:
# Molecule already in the builder
builder.append('C')

False

If you want to add features with name that are not legal Python keyword argument names, you can use kwargs unpacking.

In [5]:
properties = { '1-illegal-keyword': 1, 'docking_score': -5.2, 'gauss': 356, 'another_feature': [1, 2, 3] }
builder.append('N', **properties)

True

As you can see, it is not required for each molecule to have the same set of properties, however some of the metrics may not be possible to calculate if NaN entries are present. Also the features may be of any type that fits in pandas' DataFrame.

You can check how many molecules are stored in builder by retrieving its `size` property

In [6]:
builder.size

3

You may also store number of valid samples to `total_samples` property. If `total_samples` is bigger than 0, proportion of valid samples will be automatically calculated when `OptimizedMolecules` is built.

In [7]:
builder.total_samples = 2

When you are finished with generating molecules, you can build `OptimizedMolecules` using its `build` method. Afterwards you can retrieve all the data you saved in pandas' DataFrame through `molecules` property.

In [8]:
result = builder.build()
result.molecules

Unnamed: 0,docking_score,gauss,1-illegal-keyword,another_feature
C,-7.2,23,,
Cl,-2.3,123,,
N,-5.2,356,1.0,"[1, 2, 3]"


We will go through the metrics available in `OptimizedMolecules`. The first one is root mean squared error calculated between columns of the dataframe.

In [9]:
result.rmse('docking_score', 'gauss')

221.41783276571618

This metric will not be calculated if compared columns contain NaNs.

In [10]:
try:
    result.rmse('docking_score', '1-illegal-keyword')
except ValueError as e:
    print(e)

Selected columns contain NaNs


The second is internal diversity, defined as mean of Tanimoto diversity values calculated pairwise between all molecules

In [11]:
result.internal_diversity()

1.0

We can retrieve first $n$ results, sorted by given column.

In [12]:
result.get_first_n(2, by_column='docking_score')

Unnamed: 0,docking_score,gauss,1-illegal-keyword,another_feature
C,-7.2,23,,
N,-5.2,356,1.0,"[1, 2, 3]"


By default, ascending sorting is used. You may change it by setting `sort_ascending` keyword argument to False.

Similiar method to `get_first_n` is `get_first_fraction`, but -- as the name says -- it returns a fraction of top results rather than concrete number.

In [13]:
result.get_first_fraction(0.2, by_column='docking_score')

Unnamed: 0,docking_score,gauss,1-illegal-keyword,another_feature
C,-7.2,23,,


You may also find most similar molecule according to Tanimoto similarity from provided SMILES dataset.

In [14]:
result.most_similar_tanimoto(['F', 'Cl'])

Unnamed: 0,tanimoto_similarity,most_similar_smiles
C,0.0,F
Cl,1.0,Cl
N,0.0,F


Finally, you can use `save` method to pickle your results and afterwards use static method `OptimizedMolecules.load` to load it. If you prefer you may use `to_csv` method instead of binary representation.