# Overview

## Prerequisites
For building _enviLink_, we employed python scripts centred around a local, temporary relational database that is used for storing intermediate results. The chemoinformatic core of the calculation was done in Java, making use of [CDK](https://cdk.github.io/), [Ambit2](http://ambit.sourceforge.net/index.html) as well as envipath.org's own Java library.

## Data Sources

### a) [EAWAG-BBD Data](EAWAG-BBD%20data.ipynb)
The primary data source of _enviLink_ is the EAWAG-BBD package of https://envipath.org. It contains pathways, reactions, compounds and biodegradation rules that originate from http://eawag-bbd.ethz.ch, the online _Eawag Biocatalysis/Biodegradation Database_. Eawag-BDD is a manually curated data resource of literature data on microbial contaminant biotransformation pathways and has verifiable data records.<br>
The data used for building _enviLink_ was directly downloaded from envipath.org through its REST interface on the June 11 2020.

### b) [KEGG Data](KEGG%20data.ipynb)
The secondary data source of _enviLink_ is the well known _Kyoto Encyclopedia of Genes and Genomes_ online database [KEGG](https://www.kegg.jp), one of the most comprehensive data resources for metabolic pathways.<br>
The data used for building _enviLink_ was directly downloaded from [rest.kegg.jp](http://rest.kegg.jp), the REST interface of KEGG, on June 11 2020.

## Workflow

### 1) [In Silico Reaction Generation](in%20silico%20reaction.ipynb)

#### Standardization
To account for different SMILES potentially representing the same compound (e.g., different protonation states or tautomers), each compound in the database was subjected to a set of standardizers. Where applicable, these standardizers generated different SMILES strings, representing possible states of a molecule (e.g. differently protonated species, different tautomers etc.). All SMILES strings that were newly generated through the application of standardizers remained connected to the original compound by keeping track of their provenience. <br>

#### Rule Application
Each reaction substrate, original or standardized, was then combined with every biotransformation rule (btrule) of the Eawag-BBD database. Whenever a btrule was triggered by a given substrate SMILES string, the resulting reaction (i.e., substrate-product pair) was inserted into the database together with a link to the respective rule that triggered it. If the product SMILES string was unknown to the database prior to the triggering of this reaction, it was again standardized in the way described above and inserted into the database.

### 2) [Matching of in Silico and Database Reactions](match.ipynb)

The match finder algorithm takes a database reaction as input and delivers one or several rules that _essentially_ reproduce the reaction "in silico". While the "in silico" reactions predicted by these rules might differ from the database reaction in a number of ways (i.e., stoichiometry, standardization), they only are matched with a database reaction if they include the same main substrates and products as the database reaction. In the following, more details on the matching algorithm are given.

#### Projection of Substrates and Products into Equivalence Classes
To efficiently compare predicted reactions with database reactions despite the previously applied standardization rules, two or more compounds were considered equivalent if they were directly or indirectly linked by any _standardization_ step taken above. The set of all equvivalent compounds is then termed _equivalence class_. In this way, each reaction was assigned to a formal reaction with the substrates' equivalence class as substrate and the products' equivalence classes as products. Hence, "in silico" reactions and database reactions were compared based on the IDs of the substrate and product equivalence classes rather than on actual chemical structure.

#### Removal of Non-Essential Compounds
Because Eawag-BBD btrules do not produce fully stoichiometrically balanced reactions nor consider any co-factors, another step was included in the match finder algorithm to account for this, particularly when comparing predicted reactions with KEGG database reactions. This was done by removing any compounds from the database reactions that were considered non-essential. Specifically, salts, water, protons, electrons and small organic compounds were removed. Further, co-factors were removed if they were present as complete pairs in the reaction equation (e.g., pairs like NADH/NAD+, ATP/ADP).

#### Building of  Transformation Graph and Match Finding
Next, transformation graphs were built from the reaction's substrates by linking each of them to the predicted products from the _rule application_ step. The outer ends of this transformation graph were then compared to the database products' equivalence classes. Whenever all products in the database products' equivalence classes were found in this graph, a _match_ was noted, containing the original reaction and the rule(s) associated with the _in silico_ reaction(s) leading to the product(s).

### 3) [Generation of Rule-Enzyme Links](rule-enzyme%20link.ipynb)

In this last step, the two association tables between enzymes and reactions - extracted from Eawag-BBD and/or KEGG as already known input data - on the one hand side, and between reactions and rules - generated through the here described procedure - on the other hand side, are joined to generate an association table between 4th or 3rd level enzymes and rules. This final list of individual evidences resulting from the described workflow is called the _enviLink_ database and has been implemented in _envipath.org_. Implementation in _envipath.org_ is such that for each btrule a list of 3rd level enzymes classes is given, and that the individual evidences supporting this 3rd level association are listed in an underlying drop-down menu (for example have a look at rule [bt0001](https://envipath.org/package/32de3cf4-e3e6-4168-956e-32fa5ddb0ce1/parallel-rule/507b2719-da61-4793-87fc-2d4ae9c20ce9))</a>.

## Results
_enviLink_ data can be downloaded from the [KEGG](../data/kegg) and [EAWAG-BBD](../data/bbd) data directories or downloaded via the REST interface from _envipath.org_.<br> 
A small python script that fetches the entire _enviLink_ data from [envipath.org](https://envipath.org) is available from the [Download _enviLink_](download%20enviLink.ipynb) notebook.<br>
Some statistics and graphs can be seen in [_enviLink_ results](enviLink%20results.ipynb).