# Standardization Methods
In order to build equivalence classes of compounds, six standardization methods are applied on each compound, repeatedly, until the application has no effect anymore. Prior to standardization, the SMILES of all compounds are normalized by the same routine.

### Implementation
The standardization routine is implemented in Java library PPS-tool-box-0.3.0.jar.<br>

In [1]:
import yaml
with open("config.yaml", 'r') as stream:
    config = yaml.safe_load(stream)
DATA = config['datadir']['stds']
BIN = config['binaries']

In [2]:
import subprocess
STANDARDIZE = f'{BIN}/standardize'

command = f'{STANDARDIZE} -h'
ran = subprocess.run(command.split(), capture_output=True)
print(ran.stdout.decode())

Usage: ../bin/standardize [-f fields] [-o outfile] [-s standardizer] compounds
where
- compounds: tab separated file by default id and smiles as columns

- fields: comma separated column number list,
  e.g. '-f 1,3' for normalization of the first and third column
  if omitted the second column will be normalized

- outfile: a copy of compounds having normalized smiles 
  problematic rows are skipped (non normalizable smiles, too few columns)
  if omitted 'normalized_compounds.csv' is used

- standardizer: for serious standardization 
  can be a single smirks string or a key word:
  - 'basic' -> default standardization
  if omitted the compounds will be just normalized




### Normalization
The _normalization_ routine essentially removes co-enzyme A from the compound in order to avoid a multitude of predicted reactions starting from CoA that are of little interest with regard to linking bt-rules to database reactions. Different from the standardization methods listed below, _normalization_ is not used to build equivalence relationships between compounds but to generate unique identifiers for each compound in the first place.

In [3]:
from pandas import read_csv, set_option
from IPython.display import Image, HTML
from urllib.parse import quote

def show_molecules(mol):
    if mol.__class__ is list:
        mol = ".".join(mol)
    return f'<img src="https://o3pps.ethz.ch/o3pps-server/convert/?smiles={quote(mol)}"/>'

cut_coa_rules = read_csv(f'{DATA}/cut_coa_rules.csv')
set_option('display.max_colwidth', -1)
cut_coa_rules['img'] = cut_coa_rules.apply(lambda row: show_molecules(row['SMIRKS representation']), axis=1)
HTML(cut_coa_rules.to_html(escape=False))

Unnamed: 0,rule,SMIRKS representation,img
0,CutCoEnzymeAOff,CC(C)(COP(O)(=O)OP(O)(=O)OCC1OC(C(O)C1OP(O)(O)=O)n1cnc2c(N)ncnc12)C(O)C(=O)NCCC(=O)NCCS[$(*):1]>>[O-][$(*):1],


## The Six Standardization Methods
### 1. basic
The _basic_ standardization consists of a collection of SMIRKS-based rules that are applied to a given compound by eventually running the ```ambit2.smarts.SMIRKSReaction``` on a ```org.openscience.cdk.interfaces.IAtomContainer``` built from the compound's SMILES representation.<br>
These are the individual rules:

In [4]:
basic_standardization_rules = read_csv(f'{DATA}/basic_standardization_rules.csv')
basic_standardization_rules['img'] = basic_standardization_rules.apply(lambda row: show_molecules(row['SMIRKS representation']), axis=1)
HTML(basic_standardization_rules.to_html(escape=False))

Unnamed: 0,rule,SMIRKS representation,img
0,ammoniumstandardization,[H][N+:1]([H])([H])[#6:2]>>[H][#7:1]([H])-[#6:2],
1,cyanate,[H][#8:1][C:2]#[N:3]>>[#8-:1][C:2]#[N:3],
2,deprotonatecarboxyls,[H][#8:1]-[#6:2]=[O:3]>>[#8-:1]-[#6:2]=[O:3],
3,forNOOH,[H][#8:1]-[#7+:2](-[*:3])=[O:4]>>[#8-:1]-[#7+:2](-[*:3])=[O:4],
4,Hydroxylprotonation,[#6;A:1][#6:2](-[#8-:3])=[#6;A:4]>>[#6:1]-[#6:2](-[#8:3][H])=[#6;A:4],
5,phosphatedeprotonation,[H][#8:1]-[$([#15]);!$(P([O-])):2]>>[#8-:1]-[#15:2],
6,PicricAcid,[H][#8:1]-[c:2]1[c:3][c:4][c:5]([c:6][c:7]1-[#7+:8](-[#8-:9])=[O:10])-[#7+:11](-[#8-:12])=[O:13]>>[#8-:1]-[c:2]1[c:3][c:4][c:5]([c:6][c:7]1-[#7+:8](-[#8-:9])=[O:10])-[#7+:11](-[#8-:12])=[O:13],
7,Sulfate1,[H][#8:1][S:2]([#8:3][H])(=[O:4])=[O:5]>>[#8-:1][S:2]([#8-:3])(=[O:4])=[O:5],
8,Sulfate2,[#6:1]-[#8:2][S:3]([#8:4][H])(=[O:5])=[O:6]>>[#6:1]-[#8:2][S:3]([#8-:4])(=[O:5])=[O:6],
9,Sulfate3,[H][#8:3][S:2]([#6:1])(=[O:4])=[O:5]>>[#6:1][S:2]([#8-:3])(=[O:4])=[O:5],


### 2. flatten
The _flatten_ standardization simply removes all stereo conformational '@' characters from the molecules SMILES string and thus projects it to its 2-D representation.

### 3. enhanced
The _enhanced_ standardization is rule-based, like _basic_. It consists of the _basic_ rules and an additional rule for complete phosphate deprotonation.

In [5]:
enhanced_standardization_rules = read_csv(f'{DATA}/enhanced_standardization_rules.csv')
enhanced_standardization_rules['img'] = enhanced_standardization_rules.apply(lambda row: show_molecules(row['SMIRKS representation']), axis=1)
HTML(enhanced_standardization_rules.to_html(escape=False))

Unnamed: 0,rule,SMIRKS representation,img
0,fullPhosphatedeprotonation,[H][#8:1]-[#15:2]>>[#8-:1]-[#15:2],


### 4. cistrans
The _cistrans_ standardization simply removes all cis-trans stereochemical informations ('/' and '\\') from the molecule's SMILES string.

### 5. exotic
The _exotic_ standardization is again rule-based. It consists of the _enhanced_ rules and an additional rule for deprotonation of thio phosphates.

In [6]:
exotic_standardization_rules = read_csv(f'{DATA}/exotic_standardization_rules.csv')
exotic_standardization_rules['img'] = exotic_standardization_rules.apply(lambda row: show_molecules(row['SMIRKS representation']), axis=1)
HTML(exotic_standardization_rules.to_html(escape=False))

Unnamed: 0,rule,SMIRKS representation,img
0,ThioPhosphate1,"[H][S:1]-[#15:2]=[$([#16]),$([#8]):3]>>[S-:1]-[#15:2]=[$([#16]),$([#8]):3]",


### 6. enolketo
The _enolketo_ standardization is rule-based and consists of a single rule for transforming molecules from their enol to the corresponding keto form.

In [7]:
enolketo_standardization_rules = read_csv(f'{DATA}/enolketo_standardization_rules.csv')
enolketo_standardization_rules['img'] = enolketo_standardization_rules.apply(lambda row: show_molecules(row['SMIRKS representation']), axis=1)
HTML(enolketo_standardization_rules.to_html(escape=False))

Unnamed: 0,rule,SMIRKS representation,img
0,enol2Ketone,[H][#8:2]-[#6:3]=[#6:1]>>[#6:1]-[#6:3]=[O:2],
