# Rule-Enzyme Link

## Table of Content <a class="anchor" id="toc"></a>
#### [Summary](#overview)
#### Data
* [Globals](#globals)
* [Reaction-Enzyme Links](#rel)
* [Rule-Reaction Links](#rra)

#### Processing
* [Rule-Enzyme Links](#join)
* [Create _enviPath_ Entries](#epapi)

## <a class="anchor" id="overview"></a>Summary [$\Uparrow$](#toc)
In the third and last step of the workflow, the reaction-enzyme links from the database are combined with the reaction-rule links from the matching step [match](match-KEGG.ipynb) to create rule-enzyme links. These are finally uploaded to envipath.org by use of the envipath-api python REST interface wrapper.

## <a class="anchor" id="globals"></a>Globals [$\Uparrow$](#toc)
#### The Configuration
Configuration data, such as credentials to access the worfklow database.

In [1]:
import yaml
with open("config.yaml", 'r') as stream:
    config = yaml.safe_load(stream)

#### Directories Used in the Workflow

In [2]:
DATASET = 'KEGG'
DATA = config['datadir'][DATASET]
TEMP = config['datadir']['temp']
BIN = config['binaries']
SRC = config['sources']

#### envipath Client
Used to upload the rule-enzyme links.

In [3]:
from envirest import EnviPathClient
from getpass import getpass
envipath = config['envipath']
client = EnviPathClient(
    envipath['host'],
    secure=envipath['secure'],
    verify=envipath['verified'],
    username=envipath['user'],
    password=envipath['password'])

## <a class="anchor" id="rel"></a>Reaction-Enzyme Links  [$\Uparrow$](#toc)
This is essential input data (see [KEGG data](KEGG%20data.ipynb)) and is read from a tab separated text file containing two columns, the reaction id and the EC-number.

In [4]:
from pandas import read_csv
reaction_enzyme_links = read_csv(f'{DATA}/{DATASET}_reaction-enzymes.tsv', sep='\t', header=None, 
                                 names=['kegg_r', 'EC_number', 'EC_name', '3rd_lvl'])
reaction_enzyme_links.head()

Unnamed: 0,kegg_r,EC_number,EC_name,3rd_lvl
0,R00002,1.18.6.1,nitrogenase,1.18.6
1,R00004,3.6.1.1,inorganic diphosphatase,3.6.1
2,R00005,3.5.1.54,allophanate hydrolase,3.5.1
3,R00006,2.2.1.6,acetolactate synthase,2.2.1
4,R00008,4.1.3.17,4-hydroxy-4-methyl-2-oxoglutarate aldolase,4.1.3


## <a class="anchor" id="rra"></a>Rule-Reaction Links [$\Uparrow$](#toc)

From the [_match_](match.ipynb) finding step of the workflow. Again a tab-separated text file containing two columns, the reaction id and the rule id.

In [5]:
from pandas import read_csv
rule_reaction_links = read_csv(f'{DATA}/direct_links.tsv', sep="\t", dtype=str, header=0, names=['bt', 'direction', 'reaction', 'kegg_r'])
rule_reaction_links.head()

Unnamed: 0,bt,direction,reaction,kegg_r
0,3568,f,4,R00033
1,1196,f,5,R00072
2,4141,f,6,R00085
3,4141,f,7,R00086
4,4141,f,7,R10531


## <a class="anchor" id="join"></a>Rule Enzyme Links [$\Uparrow$](#toc)

Both tables are simply joined to establish associations between rules and enzymes, based on reaction sets that provide evidence for the link between the two.<br>
These associations are then further condensed to links between rules and third level EC numbers.

In [6]:
rules = read_csv(f'{TEMP}/envipath-rules.annotated.tsv', sep="\t")
rules['bt'] = [name.split('-')[-1] for name in rules.name]
rules['btc'] = ['-'.join(name.split('-')[:-1]) for name in rules.name]
rules.head()

Unnamed: 0,id,name,smirks,substrateFilter,productFilter,container,bt,btc
0,https://envipath.org/package/32de3cf4-e3e6-416...,bt0001-3568,"[H][#8:2][C:1]([H:5])([H])[#1,#6:6]>>[H:5][#6:...",[$([H]C(C)=O)],,https://envipath.org/package/32de3cf4-e3e6-416...,3568,bt0001
1,https://envipath.org/package/32de3cf4-e3e6-416...,bt0002-3673,[H][#8:1][#6;A;!$(CCC(O)[O-]):2]([H])([#6:5])[...,[$([O;H1]C1C=CC=CC1[O;H1])],,https://envipath.org/package/32de3cf4-e3e6-416...,3673,bt0002
2,https://envipath.org/package/32de3cf4-e3e6-416...,bt0003-1196,[H][#6:1](-[#6:5])=[O:4]>>[#6:5]-[#6:1](-[#8-]...,,,https://envipath.org/package/32de3cf4-e3e6-416...,1196,bt0003
3,https://envipath.org/package/32de3cf4-e3e6-416...,bt0005-3667,[#8:7]([H])-[#6:1]([H])-1-[#6:2]=[#6:3]-[#6:4]...,,,https://envipath.org/package/32de3cf4-e3e6-416...,3667,bt0005
4,https://envipath.org/package/32de3cf4-e3e6-416...,bt0005-3776,[c:7]([H])1[c:8]([H])[c:9]([H])[c:10]2[c:11]([...,,,https://envipath.org/package/32de3cf4-e3e6-416...,3776,bt0005


In [7]:
rule_enzyme_links = reaction_enzyme_links\
                    .merge(rule_reaction_links, on='kegg_r', how='left')\
                    .merge(rules, on='bt')\
                    .loc[:,['btc', 'bt', 'EC_number', 'EC_name', '3rd_lvl', 'reaction', 'kegg_r', 'container', 'id']]\
                    .rename(columns={'kegg_r': 'evidence', 'container':'rule', 'id':'simple_rule'})
rule_enzyme_links.to_csv(f'{DATA}/rule_enzyme_links.tsv', sep="\t", index=None)

qualified_rule_enzyme_links = rule_enzyme_links[
       ~rule_enzyme_links.EC_number.str.endswith('.-')
    ].loc[:,['btc', 'bt', 'EC_number', '3rd_lvl', 'EC_name', 'evidence', 'rule']]\
    .drop_duplicates()
qualified_rule_enzyme_links.to_csv(f'{DATA}/qualified_rule_enzyme_links.tsv', sep="\t", index=None)

print(qualified_rule_enzyme_links.nunique())
read_csv(f'{DATA}/qualified_rule_enzyme_links.tsv', sep="\t").tail()

btc           102
bt            121
EC_number    1077
3rd_lvl        81
EC_name      1077
evidence     1579
rule          102
dtype: int64


Unnamed: 0,btc,bt,EC_number,3rd_lvl,EC_name,evidence,rule
1881,bt0184,4187.0,1.13.11.64,1.13.11,5-nitrosalicylate dioxygenase,R10110,https://envipath.org/package/32de3cf4-e3e6-416...
1882,bt0184,4187.0,1.13.11.86,1.13.11,"5-aminosalicylate 1,2-dioxygenase",R12147,https://envipath.org/package/32de3cf4-e3e6-416...
1883,bt0401,3575.0,1.14.13.182,1.14.13,2-heptyl-3-hydroxy-4(1H)-quinolone synthase,R10467,https://envipath.org/package/32de3cf4-e3e6-416...
1884,bt0063,2552.0,1.14.13.239,1.14.13,carnitine monooxygenase,R11875,https://envipath.org/package/32de3cf4-e3e6-416...
1885,bt0063,2552.0,1.14.13.239,1.14.13,carnitine monooxygenase,R11911,https://envipath.org/package/32de3cf4-e3e6-416...


## <a class="anchor" id="epapi"></a>Create _enviPath_ Entries [$\Uparrow$](#toc)
The envipath client's add_ec_number method is used to upload the links to _envipath.org_.<br>
The links are uploaded for each individual btrule - ec-number combination, with reactions submitted as list of 'evidence'.

In [8]:
LINKING = 'KEGG 2020-10-21 enviLink'

def upload_links(row):
    try:
        bt = row['btc']
        ec_number = row['EC_number']
        name = row['EC_name']
        rule_url = row['rule']
        evidence = [f'<a href="https://www.genome.jp/dbget-bin/www_bget?rn:{e}">{e}</a>' for e in rule_enzyme_links[
            (rule_enzyme_links.btc == bt) &
            (rule_enzyme_links.EC_number == ec_number) &
            (rule_enzyme_links.EC_name == name)
        ].evidence]
        eclink = client.add_ec_number(rule_url, ec_number=ec_number, name=name.replace("'", "\\\'"), 
                                      linking_method=LINKING, evidence=list(evidence))
        return eclink.get('id')
    except Exception as e:
        from sys import stderr
        stderr.write("{} - {}\n".format(e.__class__.__name__, e))
        from numpy import NaN
        return NaN

In [9]:
enviLink = qualified_rule_enzyme_links\
    .loc[:,['btc', 'EC_number', 'EC_name', 'rule']]\
    .drop_duplicates().copy()
enviLink['elink'] = None

uploadrows = enviLink[enviLink.elink.isnull()].index
enviLink.loc[uploadrows,'elink'] = enviLink.loc[uploadrows,:].apply(upload_links, axis=1)
enviLink.tail()

Unnamed: 0,btc,EC_number,EC_name,rule,elink
2234,bt0350,3.5.4.44,ectoine hydrolase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
2235,bt0184,1.13.11.64,5-nitrosalicylate dioxygenase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
2236,bt0184,1.13.11.86,"5-aminosalicylate 1,2-dioxygenase",https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
2238,bt0401,1.14.13.182,2-heptyl-3-hydroxy-4(1H)-quinolone synthase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
2239,bt0063,1.14.13.239,carnitine monooxygenase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...


With this, the workflow of _enviLink_ for the KEGG dataset is finished.