# EAWAG-BBD Data

## Table of Content <a class="anchor" id="toc"></a>
#### Data
* [Globals](#globals)
* [Compounds](#compounds)
* [Reactions](#reactions)
* [Enzymes](#enzymes)

## <a class="anchor" id="globals"></a>Globals [$\Uparrow$](#toc)

#### The Configuration
Configuration data: Address of the _enviPath_ instance, whether it is using a secured protocol (https) and whether its certificate is verified.

In [1]:
import yaml
with open("config.yaml", 'r') as stream:
    config = yaml.safe_load(stream)

#### Directories Used in the Workflow

In [2]:
DATASET = 'EAWAG-BBD'
DATA = config['datadir'][DATASET]
from os import makedirs
makedirs(DATA, exist_ok=True)

#### envipath Client
All data of the EAWAG-BBD package was directly read from _envipath.org_ through its REST interface. Within the workflow of _enviLink_ we made use of the python package _envipath-api_, a wrapper for the REST interface that simplifies the calls and is easy to use. The main class of the _envipath-api_ module is _EnviPathClient_, which establishes a session with _envipath.org_ to read and write data.

In [3]:
from envirest import EnviPathClient
from getpass import getpass
envipath = config['envipath']
client = EnviPathClient(
    envipath['host'],
    secure=envipath['secure'],
    verify=envipath['verified'])

#### Package URL

In [4]:
EAWAGBBD = client.findpackage(DATASET)
EAWAGBBD

'https://envipath.org/package/32de3cf4-e3e6-4168-956e-32fa5ddb0ce1'

## <a class="anchor" id="compounds"></a>Compounds [$\Uparrow$](#toc)
In _envipath.org_ compounds can be represented by multiple molecular structures. In general, these differ only slightly (e.g., different degrees of protonation, tautomerization etc.). In constructing _enviLink_, all those structures were considered, but remained associated with the respective compound.<br>
In the cells below, the REST interface is used to obtain a table with three columns: compound id, structure id and SMILES string. This table is one of the input files for the first step of the _enviLink_ workflow, i.e., [_in silico_ reaction](in%20silico%20reaction.ipynb). 

#### Compound URLs
By appending `/compound` to the package URL, a list of compound objects is provided from a GET request to the REST interface. These objects are collected into a `pandas.DataFrame`.

In [5]:
from pandas import DataFrame
compounds = DataFrame(client.get(f'{EAWAGBBD}/compound')['compound'])
compounds.head()

Unnamed: 0,id,identifier,name,reviewStatus
0,https://envipath.org/package/32de3cf4-e3e6-416...,compound,"(+)-(3S,4R)-cis-3,4-Dihydroxy-3,4-dihydrofluorene",reviewed
1,https://envipath.org/package/32de3cf4-e3e6-416...,compound,(+)-(4R)-Limonene,reviewed
2,https://envipath.org/package/32de3cf4-e3e6-416...,compound,(+)-Camphor,reviewed
3,https://envipath.org/package/32de3cf4-e3e6-416...,compound,"(-)-(1R,2S,5R)-Menthol",reviewed
4,https://envipath.org/package/32de3cf4-e3e6-416...,compound,"(-)-(2S,5R)-Menthone",reviewed


#### Structure SMILES and URLs
For each compound, a GET request to _envipath.or_ was made to collect URL and SMILES for all associated _structure_ sub-entities.<br>
URLs of _structures_ have a uniform pattern:<br>
`https://envipath.org/package/[package UUID]/compound/[compound UUID]/structure/[structure UUID]`<br>
Both, `compound UUID` and `structure UUID` are globally unique and were used as identifiers in _enviLink_.<br>
In the table below these identifiers are shown in columns _cid_ and _sid_ respectively.

In [6]:
structures = DataFrame()
for url in compounds.id:
    try:
        cid = url.split('/compound/')[1]
        cjson = client.get(url)
        cstrs = DataFrame(cjson.get('structures', []))
        cstrs['cid'] = cid
        cstrs['id'] = cstrs.id
        structures = structures.append(cstrs.loc[:,['id','cid','name','smiles']])
    except AttributeError as ae:
        print("-", url)
        print(client.get(url))

structures['sid'] = structures.apply(lambda row: row.id.split('/structure/')[1], axis=1)
structures.head()

Unnamed: 0,id,cid,name,smiles,sid
0,https://envipath.org/package/32de3cf4-e3e6-416...,3a797060-38fa-4661-8603-48773b3aff58,"(+)-(3S,4R)-cis-3,4-Dihydroxy-3,4-dihydrofluorene",C1=CC2=C(C=C1)C3=C(C=C[C@@H]([C@@H]3O)O)C2,8d63dbac-227d-4898-86fe-b19bf6683e84
0,https://envipath.org/package/32de3cf4-e3e6-416...,60de31b0-a3c3-4739-9f88-7c5c78a14a76,(+)-(4R)-Limonene,C=C(C)[C@H]1CC=C(C)CC1,8723c215-b1e7-4719-9aa0-d4c1c85a0269
0,https://envipath.org/package/32de3cf4-e3e6-416...,e4fe0464-864c-4cb3-9587-5a82d6dc67fa,(+)-Camphor,CC1(C)C2CCC1(C)C(=O)C2,5f574ab2-4990-4548-81e7-16bb217447ac
0,https://envipath.org/package/32de3cf4-e3e6-416...,04466823-ecf7-4dbd-9a81-ea0bf8e40dcc,"(-)-(1R,2S,5R)-Menthol",CC(C)[C@@H]1CC[C@@H](C)CC1O,794de362-0e07-4a9d-a447-55dc59318b1f
0,https://envipath.org/package/32de3cf4-e3e6-416...,ff1575d1-ebba-4d8b-b5db-f4bae1049779,"(-)-(2S,5R)-Menthone",CC(C)[C@@H]1CC[C@@H](C)CC1=O,7640df0d-0730-4354-ad81-dfc6d668a81e


In [7]:
RAWCOMPOUNDS = f'{DATA}/{DATASET}_compounds.tsv'
structures.loc[:,['sid', 'cid', 'smiles']].to_csv(RAWCOMPOUNDS, sep='\t', header=None, index=None)

## <a class="anchor" id="reactions"></a>Reactions [$\Uparrow$](#toc)
In the same way as the compounds, reactions are read from _envipath.org_ together with their substrates and products.
Then they are written into a table with the columns _substrates_, _reaction id_ and _products_, 
which is the input format for the _in silico_ reaction step of the workflow, with the columns _substrates_ and _products_ being '.'-separated SMILES strings.

In [8]:
reactions = DataFrame(client.get(f'{EAWAGBBD}/reaction')['reaction'])
reactions.head()

Unnamed: 0,id,identifier,name,reviewStatus
0,https://envipath.org/package/32de3cf4-e3e6-416...,reaction,Eawag BBD reaction r0001,reviewed
1,https://envipath.org/package/32de3cf4-e3e6-416...,reaction,Eawag BBD reaction r0002,reviewed
2,https://envipath.org/package/32de3cf4-e3e6-416...,reaction,Eawag BBD reaction r0003,reviewed
3,https://envipath.org/package/32de3cf4-e3e6-416...,reaction,Eawag BBD reaction r0005,reviewed
4,https://envipath.org/package/32de3cf4-e3e6-416...,reaction,Eawag BBD reaction r0006,reviewed


In [9]:
from pandas import Series
rrole = DataFrame()
for ridx in reactions.index.values:
    reaction = client.get(reactions.loc[ridx,'id'])
    reactions.loc[ridx,'smirks'] = reaction['smirks']
    for educt in reaction['educts']:
        rrole = rrole.append(Series({
            'reaction':reactions.loc[ridx,'id'],
            'compound':educt['id'],
            'isproduct':False
        }), ignore_index=True)
    for product in reaction['products']:
        rrole = rrole.append(Series({
            'reaction':reactions.loc[ridx,'id'],
            'compound':product['id'],
            'isproduct':True
        }), ignore_index=True)
rrole.head()

Unnamed: 0,compound,isproduct,reaction
0,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...
1,https://envipath.org/package/32de3cf4-e3e6-416...,1.0,https://envipath.org/package/32de3cf4-e3e6-416...
2,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...
3,https://envipath.org/package/32de3cf4-e3e6-416...,1.0,https://envipath.org/package/32de3cf4-e3e6-416...
4,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...


In [10]:
rrole_with_smiles = rrole\
    .merge(structures, left_on='compound', right_on='id', how='left')\
    .loc[:,['reaction', 'isproduct', 'compound', 'smiles']]
rrole_with_smiles.head()

Unnamed: 0,reaction,isproduct,compound,smiles
0,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...,C(CCl)Cl
1,https://envipath.org/package/32de3cf4-e3e6-416...,1.0,https://envipath.org/package/32de3cf4-e3e6-416...,C(CO)Cl
2,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...,C(CO)Cl
3,https://envipath.org/package/32de3cf4-e3e6-416...,1.0,https://envipath.org/package/32de3cf4-e3e6-416...,C(C=O)Cl
4,https://envipath.org/package/32de3cf4-e3e6-416...,0.0,https://envipath.org/package/32de3cf4-e3e6-416...,C(C=O)Cl


In [11]:
substrates = rrole_with_smiles[rrole_with_smiles.isproduct == 0]\
    .groupby(['reaction'])\
    .agg({'smiles': lambda smiles: "%s" % '.'.join(smiles)})
substrates.columns = ['substrates']

products = rrole_with_smiles[rrole_with_smiles.isproduct == 1]\
    .groupby(['reaction'])\
    .agg({'smiles': lambda smiles: "%s" % '.'.join(smiles)})
products.columns = ['products']

rs = substrates.join(products).loc[:,['substrates','products']]
rs['reaction'] = rs.apply(lambda row: row.name.split('/reaction/')[1], axis=1)
rs = rs.reindex(['substrates','reaction','products'], axis=1)
rs.reset_index(inplace=True, drop=True)
print(rs.shape)
rs.head()

(1479, 3)


Unnamed: 0,substrates,reaction,products
0,C1=C(C=CC(=C1)O)C(C2=CC=C(C=C2)O)O,00549813-a13d-442f-a963-1b146cfb2df5,C1=C(C=CC(=C1)O)C(=O)C2=CC=C(C=C2)O
1,C1=CC=C(C=C1)OC2=CC(=CC=C2)C(C#N)O,0066b7c0-544b-47fd-9b93-8a8b275adead,C1=CC=C(C=C1)OC2=CC(=CC=C2)C=O
2,CCCC[Sn+2]CC(CC)O,00a0fe94-d694-4913-866f-1c9253fcc67a,CCC(=O)C.CCCC[Sn+3]
3,CC1=CC=C(C=C1)C(=O)[O-],00b9590f-c0c9-4506-ab7e-52d73364d996,CC1=C[C@@H]([C@](C=C1)(C(=O)[O-])O)O
4,C1=C[C@H]([C@H](C(=C1)CCC(=O)[O-])O)O,0104a8a5-b682-4664-9495-52feb2af184a,C1=CC(=C(C(=C1)CCC(=O)[O-])O)O


In [12]:
RAWREACTIONS = f'{DATA}/{DATASET}_reactions.tsv'
rs.to_csv(RAWREACTIONS, sep='\t', header=None, index=None)

## <a class="anchor" id="enzymes"></a>Enzymes [$\Uparrow$](#toc)
Associated EC numbers are read for each reaction from _envipath.org_ and written to a text file, where each line contains a reaction id and an EC number, separated by a tab.

In [13]:
def collect_enzymes(reaction_list):
    for url in reaction_list:
        for e in client.get(url).get('ecNumbers', []):
            yield [
                url.split('/reaction/')[1],
                e.get('ecName'),
                e.get('ecNumber'),
                [p.get('name') for p in e.get('pathways',[])]
            ]
reaction_enzymes = DataFrame(collect_enzymes(reactions.id.drop_duplicates()),
                            columns=['reaction', 'enzyme name', 'ecNo', 'patwhays'])
print(reaction_enzymes.shape)
reaction_enzymes.head()

(1301, 4)


Unnamed: 0,reaction,enzyme name,ecNo,patwhays
0,6e2372bc-b165-4c19-b01c-6b64dff4d40a,haloalkane dehalogenase,3.8.1.5,"[1,2-Dichloroethane]"
1,210b1e51-21cf-4c53-8990-d02aeaf05bdb,methanol dehydrogenase,1.1.2.7,"[1,2-Dichloroethane]"
2,9137e054-d735-4a38-81b9-5f4ed07324f6,aldehyde dehydrogenase (NAD+),1.2.1.3,"[1,2-Dichloroethane]"
3,f741ec3a-dfe1-42ea-beb1-0ed21215ea4f,gallate decarboxylase,4.1.1.59,[Gallate (anaerobic)]
4,436d76b6-7757-4909-91ce-87bd60bb5b4e,pyrogallol hydroxyltransferase,1.97.1.2,[Gallate (anaerobic)]


In [14]:
from pandas import read_csv
REACTIONENZYMES = f'{DATA}/{DATASET}_reaction-enzymes.tsv'
reaction_enzymes\
    .loc[:,['reaction', 'enzyme name', 'ecNo']]\
    .to_csv(REACTIONENZYMES, sep="\t", index=None, header=None)
read_csv(REACTIONENZYMES, sep="\t", header=None).head()

Unnamed: 0,0,1,2
0,6e2372bc-b165-4c19-b01c-6b64dff4d40a,haloalkane dehalogenase,3.8.1.5
1,210b1e51-21cf-4c53-8990-d02aeaf05bdb,methanol dehydrogenase,1.1.2.7
2,9137e054-d735-4a38-81b9-5f4ed07324f6,aldehyde dehydrogenase (NAD+),1.2.1.3
3,f741ec3a-dfe1-42ea-beb1-0ed21215ea4f,gallate decarboxylase,4.1.1.59
4,436d76b6-7757-4909-91ce-87bd60bb5b4e,pyrogallol hydroxyltransferase,1.97.1.2
