In [3]:
from docking_benchmark.data.proteins import Protein, get_proteins

# Protein

`Protein` class is a container for protein related data. It also provides an easy to use interface with abstracted file interaction.

Some of the proteins, namely 5HT1B, 5HT2B and ACM2 are already provided in the benchmark. However for your experiments you may want to use your own proteins. This notebook serves as an introduction to how the protein data must be structured to be able to create `Protein` instances and how to use them.

## Data structure

### Protein

The class' `__init__` method requires a path to directory with protein data. The most important file that must be present in directory is protein in `.pdbqt` format. **There must be only one file with this extension present in the directory**, otherwise the class will fail to init.

### Metadata

Apart from the protein, `__init__` requires `metadata.json` file to be present in passed directory. `metadata.json` is a file describing various properties of the protein. It should contain a single JSON object.

Only one field must be present in the object -- this is `pocket_center`, which must be a list of three floating point numbers. Without it, docking would be impossible, as we would not know at which place the molecule should be actually docked.

Currently one more field is supported -- `datasets`. It lists available datasets with protein. Each dataset object must provide path to .csv dataset file relative to protein directory. It may also specify `smiles_column` -- a name of column containing SMILES representation of the molecule and `score_column` -- name of the column with score that model is supposed to optimize.

See example `metadata.json` file below for details.

```json
{
  "pocket_center": [
    -16.210,
    -15.874,
    5.523
  ],
  "datasets": {
    "default": {
      "path": "datasets/default.csv"
    },
    "second": {
        "path": "datasets/second.csv",
        "smiles_column": "smi"
    },
    "third": {
        "path": "datasets/third.csv",
        "score_column": "score"
    }
  }
}
```

## Usage

To demonstrate usage of `Protein` class we will use one of the built-in proteins.

In [4]:
protein = get_proteins()['5ht1b']

If we want we may access protein's path and metadata directly.

In [5]:
protein.directory

'/home/tobi/edu/magisterka/smina-docking-benchmark//data/proteins_data/5ht1b'

In [6]:
protein.metadata

{'datasets': {'default': {'path': 'datasets/sabina.csv'},
  'sabina_gauss1': {'path': 'datasets/sabina_actives_inactives_decomposed.csv',
   'score_column': 'gauss(o=0__w=0.5__c=8)'},
  'sabina_gauss2': {'path': 'datasets/sabina_actives_inactives_decomposed.csv',
   'score_column': 'gauss(o=3__w=2__c=8)'},
  'sabina_hydrophobic': {'path': 'datasets/sabina_actives_inactives_decomposed.csv',
   'score_column': 'hydrophobic(g=0.5__b=1.5__c=8)'},
  'sabina_non_dir_h_bond': {'path': 'datasets/sabina_actives_inactives_decomposed.csv',
   'score_column': 'non_dir_h_bond(g=-0.7__b=0__c=8)'},
  'sabina_repulsion': {'path': 'datasets/sabina_actives_inactives_decomposed.csv',
   'score_column': 'repulsion(o=0__c=8)'},
  'sabina_physics_plain_gauss': {'path': 'datasets/sabina_different_physics.csv',
   'score_column': 'plain_gauss_min'},
  'sabina_physics_plain_vdw': {'path': 'datasets/sabina_different_physics.csv',
   'score_column': 'plain_vdw_min'},
  'sabina_physics_medium_gauss': {'path': 'da

However, for most use cases, we probably will want to use `dock_smiles_to_protein` method, which in its basic form takes only SMILES to be docked and returns its docking score with all the components calculated by SMINA.

In [7]:
protein.dock_smiles_to_protein('CCCOc1ccc2[nH]cc(CCN)c2c1')

{'intramolecular_energy': -0.31534,
 'docking_score': -7.34047,
 'gauss(o=0__w=0.5__c=8)': 75.73586,
 'gauss(o=3__w=2__c=8)': 980.5655,
 'repulsion(o=0__c=8)': 1.16258,
 'hydrophobic(g=0.5__b=1.5__c=8)': 42.38931,
 'non_dir_h_bond(g=-0.7__b=0__c=8)': 2.08701,
 'num_tors_div': 0.0}

# Datasets

`Protein` class also provides `datasets` field, which exposes datasets listed in `metadata.json` file.

In [10]:
datasets = protein.datasets

We may retrieve datasets by using names defined in `metadata.json`.

In [12]:
smiles, scores = datasets['default']

In [13]:
smiles[:5]

['O=C1CN(Cc2ccccc2)CCN1CCCc3c[nH]c4ccc(cc34)n5cnnc5',
 'CN(C)CCO\\N=C/1\\c2cccn2c3c(C)csc13',
 'CCC(CC(=O)c1ccc(Cl)c(Cl)c1)N2CCCC2',
 'CCC(CC(=O)c1ccc(C)cc1)N2CCCC2',
 'Fc1ccc(cc1)C(=O)C2CCN(CCN3C(=O)Nc4ccccc4C3=O)CC2']

In [14]:
scores[:5]

array([-10.1,  -7. ,  -7.6,  -7.5, -10.4])

If we don't want to use score predefined in `metadata.json`, we can use the `with_linear_combination_score` method. It allows as to return the dataset with any linear combination of columns in .csv file.

In [16]:
components = {
    'gauss(o=0__w=0.5__c=8)': 0.5,
    'hydrophobic(g=0.5__b=1.5__c=8)': 0.7
}
smiles, scores = datasets.with_linear_combination_score('sabina_gauss1', **components)

In [17]:
smiles[:5]

0    Cn1nccc1c2ccc3c(c2)c(cn3c4ccc(F)cc4)C5CCN(CCN6...
1                                 CCNCCOc1cccc2ccccc12
2    C(CN1CCC(CNc2ccccn2)CC1)Cc3c[nH]c4ccc(cc34)n5c...
3    OC1(CNCc2ccccn2)CCN(CCCc3c[nH]c4ccc(cc34)n5cnn...
4    Cc1ccc2c(cccc2n1)N3CCN(CCc4cccc(NC(=O)c5cccc(F...
Name: SMILES, dtype: object

In [18]:
scores[:5]

array([103.629411,  72.859272,  88.903178,  87.801484, 125.000313])

If column names contain illegal characters that cannot be used in keyword argument name, just like above, use kwargs unpacking.