# SAR Tutorial

This tutorial shows some of the features of the SAR module. For installation of the tools, please refer to the main [README.md](../README.md). The same data set as for the tools demonstration is used.

The code is based on the [RDKit Cookbook](http://www.rdkit.org/docs/Cookbook.html) and the [paper](https://jcheminf.springeropen.com/articles/10.1186/1758-2946-5-43) by Sereina and Greg.

An example similarity map output is shown [here](./sim_map.html).

**Note:** Some features (like the progressbar) are only correctly displayed when the notebook is actually run, not when it is just displayed in NBviewer. 

In [84]:
%reload_ext autoreload
%autoreload 2
def warn(*args, **kwargs):
    pass  # to suppress sklearn warnings

import warnings
warnings.warn = warn

from rdkit.Chem import AllChem as Chem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

# The next two lines are for optical reasons only. They can be safely disabled.
Draw.DrawingOptions.atomLabelFontFace = "DejaVu Sans"
Draw.DrawingOptions.atomLabelFontSize = 18

from rdkit_ipynb_tools import tools, pipeline as p, sar

## Example Data Set
Endothelin Receptor A (ET-A) Antagonists from [ChEMBL](https://www.ebi.ac.uk/chembl/), downloaded as tab-separated file on 31-Aug-2016, gzipped.

### Preparation
Count the lines and display the first line

In [5]:
!zcat chembl_et-a_antagonists.txt.gz | wc -l
print()
!zcat chembl_et-a_antagonists.txt.gz | head -n 1

2324

CMPD_CHEMBLID	MOLREGNO	PARENT_CMPD_CHEMBLID	PARENT_MOLREGNO	MOL_PREF_NAME	COMPOUND_KEY	MOLWEIGHT	ALOGP	PSA	NUM_RO5_VIOLATIONS	CANONICAL_SMILES	ACTIVITY_ID	STANDARD_TYPE	RELATION	STANDARD_VALUE	STANDARD_UNITS	PCHEMBL_VALUE	ACTIVITY_COMMENT	DATA_VALIDITY_COMMENT	POTENTIAL_DUPLICATE	BAO_ENDPOINT	UO_UNITS	QUDT_UNITS	ASSAY_ID	ASSAY_CHEMBLID	ASSAY_TYPE	DESCRIPTION	ASSAY_SRC_ID	ASSAY_SRC_DESCRIPTION	ASSAY_ORGANISM	ASSAY_STRAIN	ASSAY_TAX_ID	CURATED_BY	BAO_FORMAT	TID	TARGET_CHEMBLID	TARGET_TYPE	PROTEIN_ACCESSION	PREF_NAME	ORGANISM	CONFIDENCE_SCORE	TARGET_MAPPING	APD_NAME	APD_CONFIDENCE	DOC_ID	DOC_CHEMBLID	PUBMED_ID	JOURNAL	YEAR	VOLUME	ISSUE	FIRST_PAGE	CELL_ID	CELL_CHEMBL_ID	CELL_NAME

gzip: stdout: Broken pipe


We have 2323 records and a number of fields of which we will only need a few.

### Curating the Data Set with Pipelines
Pipelines are part of the tools to deal with arbitrarily large data sets with compound-awareness. This is achieved using Python generators.<br>
We will now use a pipeline to curate the data set for our needs:
1. Read in the data set *(here directly as gzipped file, reading from multiple files is also possible)*
1. Transform the IC50 into a pIC50 *(personal pet peeve, ask me about it ;-) )*
1. Keep only the fields that we are interested in
1. Rename a field
1. Generate the structures from Smiles
1. Calculate some physicochemical properties
1. Finally, write everything to an SD file

In [64]:
s = p.Summary()  # optional, used for logging what the individual components do

# code for IC50 --> pIC50 conversion
run_code = """
if "STANDARD_VALUE" in rec:
    rec["ETA_pIC50"] = tools.pic50(rec["STANDARD_VALUE"], "nM")"""  

# define the start of the pipeline, can work directly with gzipped files
rd = p.start_csv_reader("chembl_et-a_antagonists.txt.gz", summary=s)

et_a_list = p.pipe(rd,
             (p.pipe_custom_man, run_code),
             (p.pipe_keep_props, ["CMPD_CHEMBLID", "CANONICAL_SMILES", "ETA_pIC50"]),
             (p.pipe_rename_prop, "CMPD_CHEMBLID", "Chembl_Id"),
             (p.pipe_mol_from_smiles, "CANONICAL_SMILES"),
             (p.pipe_calc_props, ["2d", "LogP"]),
             (p.stop_mol_list_from_stream, {"max": 3000, "summary": s})
            )
s.update(True)

0,1
Pipeline finished.,Pipeline finished.
Component,# Records
start_csv_reader,2323
stop_mol_list_from_stream,2292
Time elapsed,00h 00m 5.81s


## Splitting the Data Set and Define Activity Classes

In [73]:
mol_list = et_a_list.has_prop_filter("ETA_pIC50").sample(750)

> processed:    2292   found:   1453


In [74]:
mol_list.summary()

0,1,2,3,4,5,6
Summary (750 records),Summary (750 records),Summary (750 records),Summary (750 records),Summary (750 records),Summary (750 records),Summary (750 records)
Property,Type,Num Values,Min,Max,Mean,Median
Chembl_Id,str,750,,,,
LogP,number,750,-4.721,8.311,4.048,4.169
ETA_pIC50,number,750,3.000,10.959,7.041,7.069


In [75]:
for mol in mol_list:
    if float(mol.GetProp("ETA_pIC50")) >= 7.0:
        mol.SetProp("ActClass", "1")
    else:
        mol.SetProp("ActClass", "0")

In [76]:
train_list, test_list = mol_list.split(0.25)
print(len(train_list), len(test_list))

185 565


## Training the Model

In [77]:
model = sar.train(train_list)

## Prediction
The predicted Activity Class is put in `ActClass_Pred`, the probability in `ActClass_Prob`.

In [78]:
test_list = sar.SAR_List(test_list)
test_list.order = ["Chembl_Id", "ETA_pIC50", "ActClass", "LogP"]

test_list.model = model

In [79]:
test_list.predict()

In [80]:
_ = test_list.analyze()

Number of correctly predicted molecules: 435 / 565    (76.99%)

Correctly predicted molecules per Activity Class:
  0:  66.67
  1:  86.85


## Display as Similarity Maps

In [94]:
sample_list = test_list[:10]
sample_list.sort_list("ETA_pIC50")
sample_list.sim_map()

0,1,2
,ETA:,


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

0,1,2,3,4,5,6,7,8
#,Molecule,SimMap,Chembl_Id,ETA_pIC50,ActClass_Prob,ActClass_Pred,ActClass,LogP
0,,,CHEMBL445101,9.013,0.75,1,1,4.541
1,,,CHEMBL2163693,8.658,0.83,1,1,4.068
2,,,CHEMBL91873,8.602,0.83,1,1,4.904
3,,,CHEMBL307330,8.553,0.98,1,1,3.78
4,,,CHEMBL273660,8.0,0.69,1,1,0.066
5,,,CHEMBL266594,7.672,0.55,1,1,3.134
6,,,CHEMBL10374,7.301,0.92,1,1,4.131
7,,,CHEMBL274758,7.215,0.86,1,1,4.539
8,,,CHEMBL150719,6.886,0.92,0,0,1.754


In [96]:
sample_list.write_sim_map()

Using cached HTML content...
Set property `html` to `None` to re-generate.
