# Obtaining and Processing Network Reconstructions from KBase

KBase (http://kbase.us) is a powerful resource for obtaining genome-scale network reconstructions for microbial genomes. These reconstructions are distributed as SBML files, which must be processed prior to reverse ecology analysis. This notebook describes how to obtain reconstructions from KBase, and how to process them.

## Obtaining and Preparing SBML Files

Briefly, genomes (as fasta files) for your organisms of interest can be pushed from your computer to KBase. Once there, a KBase Narrative (iPython notebook) can be used to build reconstructions for your genomes. To do so, follow the instructions in `Perl/README.md`.

## Obtaining and Preparing SBML Files: Update January 2016

To validate predictions made by this reverse ecology pipeline, I propose to map metatranscriptomic reads from OMD-TOIL to KBase-annotated genomes. Retrieving the genome annotations proved to be quite a pain!

* The KBase Narrative mentioned above ([click here](https://narrative.kbase.us/narrative/ws.9599.obj.2)) no longer contains annotated versions of the genomes.

* Furthermore, the code cells in that narrative generate errors during the annotation, model-building, and downloading steps.

I have attempted to communicate with the KBase team to resolve some of these issues, but have received no response.

Thus, I created new KBase Narratives for manual annotation, model-building, and downloading. For each notebook (linked below), I copied the genomes (contigs) from the [old narrative](https://narrative.kbase.us/narrative/ws.9599.obj.2) and manually ran the "Annotate Contigs" and "Build Metabolic Model" KBase apps. I then downloaded the annotated genomes (Genbank format) and models (SBML and tsv formats).

* SAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12259.obj.1), workspace ID: joshamilton:1452727482251

* Mendota MAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12268.obj.1), workspace ID: joshamilton:1452793144933

* Trout Bog MAGs [KBase Narrative](https://narrative.kbase.us/narrative/ws.12270.obj.1), workspace ID: joshamilton:1452798604003

* MAGs from other research groups [KBase Narrative](https://narrative.kbase.us/narrative/ws.12271.obj.1), workspace ID: joshamilton:1452801292037

Once the genomes were downloaded, I further converted the Genbank-formatted genomes to fasta nucleotide (ffn), fasta amino acid (ffa), and gff format. The scripts to do so are in the refGenomes/scripts folder.

## Processing SBML Files

Reconstructions from KBase require further processing before they are suitable for use in reverse ecology. The function below does a nunmber of things:
1. Reformat gene locus tags
2. Remove biomass, exchange, transport, spontaneous, DNA/RNA biosynthesis reactions and their corresponding genes
3. Import metabolite formulas
4. Check mass- and charge-balancing of reactions in the reconstruction
5. Remove trailing 0s from reaction and metabolite names

The post-processing has a major shortcoming. When KBase detects that one or more subunits of a complex are present, it creates a "full" GPR by adding 'Unknown' genes for the other subunits. CobraPy currently lacks functions to remove the genes. As such, these model should not be used to perform any simulations which rely on GPRs.

As output, the code returns processed SBML files in the 'processedDataDir' folder. Also returns a summary of the model sizes, in the 'summaryStatsDir' folder.

The first chunk of code identifies imports the Python packages necessary for this analysis.

In [6]:
# Import special features for iPython
import sys
sys.path.append('../Python')
import matplotlib

# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
import sbmlFunctions as sf
import metadataFunctions as mf

# Define local folder structure for data input and processing.
processedDataDir = 'ProcessedModelFiles'
rawModelDir = 'RawModelFiles'
summaryStatsDir = 'DataSummaries'

Then we call a function which processes each SBML file and preps it for analysis.

In [2]:
sf.processSBMLforRE(rawModelDir, processedDataDir, summaryStatsDir)

Processing model BIN_10, 1 of 3
The remaining extracellular metabolites are:
cpd00067_e0
All reactions are balanced
Processing model MEint3864, 2 of 3
The remaining extracellular metabolites are:
cpd00067_e0
All reactions are balanced
Processing model MEint885, 3 of 3
The remaining extracellular metabolites are:
cpd00067_e0
All reactions are balanced


## Pruning Currency Metabolites

The next step is to prune currency metabolites from the metabolic network, in order for the network's directed graph to better reflect physiological metabolic transformations. We adopt an approach outlined by Ma et al, and later adopted by Borenstein et al in their "reverse ecology" paper.

__Provide some illustrative examples of why this is necessary.__

#### References
1. Borenstein, E., Kupiec, M., Feldman, M. W., & Ruppin, E. (2008). Large-scale reconstruction and phylogenetic analysis of metabolic environments. Proceedings of the National Academy of Sciences, 105(38), 14482–14487. http://doi.org/10.1073/pnas.0806162105
2. Ma, H., & Zeng, A. P. (2003). Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics, 19(2), 270–277. http://doi.org/10.1093/bioinformatics/19.2.270
3. Ma, H. W., & Zeng, A.-P. (2003). The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics, 19(11), 1423–1430. http://doi.org/10.1093/bioinformatics/btg177

Reaction pruning relies on checking each reaction against a set of criteria, and updating the stoichiometry as appropriate. Some criteria can easily be automated, such as removal of the following currency metabolites:

| Metabolite | KBase Compound ID | 
|---|---|
| CO2 | cpd00011 |
| H+ | cpd00067 |
| H2 | cpd11640 |
| H2CO3 | cpd00242 |
| H2O | cpd00001 |
| H2O2 | cpd00025 |
| H2S | cpd00239 |
| NH3 | cpd00013 |
| Nitrate | cpd00209 |
| Nitric oxide | cpd00418 |
| Nitrite | cpd00075 |
| O2 | cpd00007 |
| Phosphate | cpd00009 |
| Pyrophosphate | cpd00012 |
| Sulfate | cpd00048 |
| Sulfite | cpd00081 |

The script below loops over the set of ProcessedModelFiles and removes these metabolites from their associated reactions.

In [10]:
reload(sf)
# Define local folder structure for data input and processing.
modelDir = 'ProcessedModelFiles'
removeFile = '../externalData/currencyRemove.txt'

sf.pruningPhaseOne(modelDir, removeFile)

Processing model 1 of 3
Processing model 2 of 3
Processing model 3 of 3




Other pruning steps cannot be done programatically. Consider the criteria below:

| Metabolite | Criteria | KBase Compound ID | 
|---|---|---|
| CoA | Except reactions in CoA synthesis pathway |
| NTP/NDP | As carrier for phosphate group transfer |
| NTP/NMP | As carrier for phosphate group transfer |
| NAD(P)/NAD(P)H | As carrier for hydrogen transfer |
| FAD/FADH | As carrier for hydrogen transfer |
| Ferrocytochrome/Ferricytochrome | As carrier for hydrogen transfer |
| Reduced ferredoxin/Oxidized ferredoxin | As carrier for hydrogen transfer |
| Glutathione/Oxidized glutathione | As carrier for hydrogen transfer |
| Dihydrobiopterin/Tetrahydrobiopterin | As carrier for hydrogen transfer |
| Glutamate/Oxoglutarate | As carrier for amino group transfer |
| Glutamine/Glutamate | As carrier for amino group transfer |
| Pyruvate/Alanine | As carrier for amino group transfer |
| 5,10-Methenyl-THF, 5,10-Methylene-THF, THF | As carrier for one carbon unit transfer |
| 10-Formyl-THF, 5-Formyl-THF, THF | As carrier for one carbon unit transfer |
| 5-Formimino-THF, 5-Methyl-THF, THF | As carrier for one carbon unit transfer |
| S-Adenosyl-L-methionine, S-Adenosyl-L-homocysteine | As carrier for one carbon unit transfer |
| Adenosine 3',5'-bisphosphate/3'-Phosphoadenylyl sulfate | As carrier for sulfate group transfer |
__Note__: This list may not be exhaustive, as the KBase reaction database may contain other cofactors and carriers not identifed by Ma et al in the KEGG database.

Here, each reaction must be individually inspected to see if it meets the criteria. The script below identifies reactions containing the currency metabolites above, allowing the user to assess each reaction.