# Rule-Enzyme Link

## Table of Content <a class="anchor" id="toc"></a>
#### [Summary](#overview)
#### Data
* [Globals](#globals)
* [Reaction-Enzyme Links](#rel)
* [Rule-Reaction Links](#rra)

#### Processing
* [Rule-Enzyme Links](#join)
* [Create _enviPath_ Entries](#epapi)

## <a class="anchor" id="overview"></a>Summary [$\Uparrow$](#toc)
In the third and last step of the workflow, the reaction-enzyme links from the database are combined with the reaction-rule links from the matching step [match](match-KEGG.ipynb) to create rule-enzyme links. These are finally uploaded to envipath.org by use of the envipath-api python REST interface wrapper.

## <a class="anchor" id="globals"></a>Globals [$\Uparrow$](#toc)
#### The Configuration
Configuration data, such as credentials to access the worfklow database.

In [1]:
import yaml
with open("config.yaml", 'r') as stream:
    config = yaml.safe_load(stream)

#### Directories Used in the Workflow

In [2]:
DATASET = 'EAWAG-BBD'
DATA = config['datadir'][DATASET]
TEMP = config['datadir']['temp']
BIN = config['binaries']
SRC = config['sources']

#### envipath Client
Used to upload the rule-enzyme links.

In [3]:
from envirest import EnviPathClient
from getpass import getpass
envipath = config['envipath']
client = EnviPathClient(
    envipath['host'],
    secure=envipath['secure'],
    verify=envipath['verified'],
    username=envipath['user'],
    password=envipath['password'])

## <a class="anchor" id="rel"></a>Reaction-Enzyme Links  [$\Uparrow$](#toc)
This is essential input data (see [EAWAG-BBD data](EAWAG-BBD%20data.ipynb)) and was read from a tab-separated text file containing two columns, the reaction id and the EC number.

In [4]:
from pandas import read_csv
reaction_enzyme_links = read_csv(f'{DATA}/{DATASET}_reaction-enzymes.tsv', sep='\t', header=None, 
                                 names=['reaction_uuid', 'EC_name', 'EC_number'])
reaction_enzyme_links['3rd_lvl'] = ['.'.join(ecn.split('.')[:-1]) for ecn in reaction_enzyme_links['EC_number']]
reaction_enzyme_links.head()

Unnamed: 0,reaction_uuid,EC_name,EC_number,3rd_lvl
0,6e2372bc-b165-4c19-b01c-6b64dff4d40a,haloalkane dehalogenase,3.8.1.5,3.8.1
1,210b1e51-21cf-4c53-8990-d02aeaf05bdb,methanol dehydrogenase,1.1.2.7,1.1.2
2,9137e054-d735-4a38-81b9-5f4ed07324f6,aldehyde dehydrogenase (NAD+),1.2.1.3,1.2.1
3,f741ec3a-dfe1-42ea-beb1-0ed21215ea4f,gallate decarboxylase,4.1.1.59,4.1.1
4,436d76b6-7757-4909-91ce-87bd60bb5b4e,pyrogallol hydroxyltransferase,1.97.1.2,1.97.1


## <a class="anchor" id="rra"></a>Rule-Reaction Links [$\Uparrow$](#toc)

From the [_match_](match.ipynb) finding step of the workflow. Again a tab-separated text file containing two columns, the reaction id and the rule id.

In [5]:
from pandas import read_csv
rule_reaction_links = read_csv(f'{DATA}/direct_links.tsv', sep="\t", dtype=str).rename(columns={'envipath_url':'reaction_uuid'})
rule_reaction_links.head()

Unnamed: 0,bt,direction,reaction,reaction_uuid
0,3673.0,f,1,00549813-a13d-442f-a963-1b146cfb2df5
1,4291.0,f,2,0066b7c0-544b-47fd-9b93-8a8b275adead
2,3670.0,f,5,0104a8a5-b682-4664-9495-52feb2af184a
3,4312.0,f,5,0104a8a5-b682-4664-9495-52feb2af184a
4,2690.1,f,5,0104a8a5-b682-4664-9495-52feb2af184a


## <a class="anchor" id="join"></a>Rule Enzyme Links [$\Uparrow$](#toc)

Both tables are simply joined to establish associations between rules and enzymes, based on reaction sets that provide evidence for the link between the two.<br>
These associations are then further condensed to links between rules and third level EC numbers.

In [6]:
rules = read_csv(f'{TEMP}/envipath-rules.annotated.tsv', sep="\t")
rules['bt'] = [name.split('-')[-1] for name in rules.name]
rules['btc'] = ['-'.join(name.split('-')[:-1]) for name in rules.name]
rules.head()

Unnamed: 0,id,name,smirks,substrateFilter,productFilter,container,bt,btc
0,https://envipath.org/package/32de3cf4-e3e6-416...,bt0001-3568,"[H][#8:2][C:1]([H:5])([H])[#1,#6:6]>>[H:5][#6:...",[$([H]C(C)=O)],,https://envipath.org/package/32de3cf4-e3e6-416...,3568,bt0001
1,https://envipath.org/package/32de3cf4-e3e6-416...,bt0002-3673,[H][#8:1][#6;A;!$(CCC(O)[O-]):2]([H])([#6:5])[...,[$([O;H1]C1C=CC=CC1[O;H1])],,https://envipath.org/package/32de3cf4-e3e6-416...,3673,bt0002
2,https://envipath.org/package/32de3cf4-e3e6-416...,bt0003-1196,[H][#6:1](-[#6:5])=[O:4]>>[#6:5]-[#6:1](-[#8-]...,,,https://envipath.org/package/32de3cf4-e3e6-416...,1196,bt0003
3,https://envipath.org/package/32de3cf4-e3e6-416...,bt0005-3667,[#8:7]([H])-[#6:1]([H])-1-[#6:2]=[#6:3]-[#6:4]...,,,https://envipath.org/package/32de3cf4-e3e6-416...,3667,bt0005
4,https://envipath.org/package/32de3cf4-e3e6-416...,bt0005-3776,[c:7]([H])1[c:8]([H])[c:9]([H])[c:10]2[c:11]([...,,,https://envipath.org/package/32de3cf4-e3e6-416...,3776,bt0005


In [7]:
rule_enzyme_links = reaction_enzyme_links\
                    .merge(rule_reaction_links, on='reaction_uuid', how='left')\
                    .merge(rules, on='bt')\
                    .loc[:,['btc', 'bt', 'EC_number', 'EC_name', '3rd_lvl', 'reaction', 'reaction_uuid', 'container', 'id']]\
                    .rename(columns={'reaction_uuid': 'evidence', 'container':'rule', 'id':'simple_rule'})
rule_enzyme_links.to_csv(f'{DATA}/rule_enzyme_links.tsv', sep="\t", index=None)

qualified_rule_enzyme_links = rule_enzyme_links[
       ~rule_enzyme_links.EC_number.str.endswith('.-.-')
    ].loc[:,['btc', 'bt', 'EC_number', '3rd_lvl', 'EC_name', 'evidence', 'rule']]\
    .drop_duplicates()
qualified_rule_enzyme_links.to_csv(f'{DATA}/qualified_rule_enzyme_links.tsv', sep="\t", index=None)

print(qualified_rule_enzyme_links.nunique())
read_csv(f'{DATA}/qualified_rule_enzyme_links.tsv', sep="\t").tail()

btc          163
bt           195
EC_number    278
3rd_lvl       82
EC_name      541
evidence     736
rule         163
dtype: int64


Unnamed: 0,btc,bt,EC_number,3rd_lvl,EC_name,evidence,rule
833,bt0318,3664.0,3.5.1.-,3.5.1,N-butylcarbamate hydrolase,ac381546-596e-4d9b-bd01-11e389fa5276,https://envipath.org/package/32de3cf4-e3e6-416...
834,bt0444,4310.0,4.3.3.-,4.3.3,iminodisuccinate lyase,f86f958c-10f4-4c0e-81f1-dbc2246f212b,https://envipath.org/package/32de3cf4-e3e6-416...
835,bt0444,4310.0,4.3.3.-,4.3.3,iminodisuccinate lyase,4091c5d9-cf81-4ab9-a583-a3c2905fc1b6,https://envipath.org/package/32de3cf4-e3e6-416...
836,bt0374,3801.0,1.14.12.-,1.14.12,"N-acetylaminonitrofen 1,2-dioxygenase",c30bf5bb-21ad-4704-980c-15d6248efec3,https://envipath.org/package/32de3cf4-e3e6-416...
837,bt0374,3801.0,1.14.12.-,1.14.12,"nitrofen 1,2-dioxygenase",04116817-1eb6-4cab-863a-bbec78202def,https://envipath.org/package/32de3cf4-e3e6-416...


## <a class="anchor" id="epapi"></a>Create _enviPath_ Entries [$\Uparrow$](#toc)
The envipath client's add_ec_number method is used to upload the links to _envipath.org_.<br>
The links are uploaded for each individual btrule - ec-number combination, with reactions submitted as list of 'evidence'.

In [8]:
EAWAGBBD = client.findpackage('EAWAG-BBD')
LINKING = 'EAWAG-BBD 2020-10-21 enviLink'

def upload_links(row):
    try:
        bt = row['btc']
        ec_number = row['EC_number']
        name = row['EC_name']
        rule_url = row['rule']
        evidence = [f'{EAWAGBBD}/reaction/{e}' for e in rule_enzyme_links[
            (rule_enzyme_links.btc == bt) &
            (rule_enzyme_links.EC_number == ec_number) &
            (rule_enzyme_links.EC_name == name)
        ].evidence]
        eclink = client.add_ec_number(rule_url, ec_number=ec_number, name=name.replace("'", "\\\'"),
                                      linking_method=LINKING, evidence=list(evidence))
        return eclink.get('id')
    except Exception as e:
        from sys import stderr
        stderr.write("{} - {}\n".format(e.__class__.__name__, e))
        from numpy import NaN
        return NaN

In [9]:
enviLink = qualified_rule_enzyme_links\
    .loc[:,['btc', 'EC_number', 'EC_name', 'rule']]\
    .drop_duplicates().copy()
enviLink['elink'] = None

uploadrows = enviLink[enviLink.elink.isnull()].index
enviLink.loc[uploadrows,'elink'] = enviLink.loc[uploadrows,:].apply(upload_links, axis=1)
enviLink.tail()

Unnamed: 0,btc,EC_number,EC_name,rule,elink
942,bt0440,3.9.1.-,methamidophos phosphoamide hydrolase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
943,bt0416,1.14.12.12,"naphthalene 1,2-dioxygenase",https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
944,bt0349,4.1.1.-,"naphthalene-1,8-dicarboxylate decarboxylase",https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
945,bt0318,3.5.1.-,N-butylcarbamate hydrolase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...
946,bt0444,4.3.3.-,iminodisuccinate lyase,https://envipath.org/package/32de3cf4-e3e6-416...,https://envipath.org/package/32de3cf4-e3e6-416...


With this, the workflow of _enviLink_ for the EAWAG-BBD dataset is finished.