# Obtaining and Processing Network Reconstructions from KBase

KBase (http://kbase.us) is a powerful resource for obtaining genome-scale network reconstructions for microbial genomes. These reconstructions are distributed as SBML files, which must be processed prior to reverse ecology analysis. This notebook describes how to obtain reconstructions from KBase, and how to process them.

## Obtaining and Preparing SBML Files

Briefly, genomes (as fasta files) for your organisms of interest can be pushed from your computer to KBase. Once there, a KBase Narrative (iPython notebook) can be used to build reconstructions for your genomes. To do so, follow the instructions in `Perl/README.md`.

## Obtaining and Preparing SBML Files: Update January 2016

To validate predictions made by this reverse ecology pipeline, I propose to map metatranscriptomic reads from OMD-TOIL to KBase-annotated genomes. Retrieving the genome annotations proved to be quite a pain!

* The KBase Narrative mentioned above ([click here](https://narrative.kbase.us/narrative/ws.9599.obj.2)) no longer contains annotated versions of the genomes.

* Furthermore, the code cells in that narrative generate errors during the annotation, model-building, and downloading steps.

I have attempted to communicate with the KBase team to resolve some of these issues, but have received no response.

Thus, I created new KBase Narratives for manual annotation, model-building, and downloading. For each notebook (linked below), I copied the genomes (contigs) from the [old narrative](https://narrative.kbase.us/narrative/ws.9599.obj.2) and manually ran the "Annotate Contigs" and "Build Metabolic Model" KBase apps. I then downloaded the annotated genomes (Genbank format) and models (SBML and tsv formats).

* SAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12259.obj.1), workspace ID: joshamilton:1452727482251

* Mendota MAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12268.obj.1), workspace ID: joshamilton:1452793144933

* Trout Bog MAGs [KBase Narrative](https://narrative.kbase.us/narrative/ws.12270.obj.1), workspace ID: joshamilton:1452798604003

* MAGs from other research groups [KBase Narrative](https://narrative.kbase.us/narrative/ws.12271.obj.1), workspace ID: joshamilton:1452801292037

Once the genomes were downloaded, I further converted the Genbank-formatted genomes to fasta nucleotide (ffn), fasta amino acid (ffa), and gff format. The scripts to do so are in the refGenomes/scripts folder.

## Processing SBML Files

Reconstructions from KBase require further processing before they are suitable for use in reverse ecology. The function below does a nunmber of things:
1. Reformat gene locus tags
2. Remove biomass, exchange, transport, spontaneous, DNA/RNA biosynthesis reactions and their corresponding genes
3. Import metabolite formulas
4. Check mass- and charge-balancing of reactions in the reconstruction
5. Remove trailing 0s from reaction and metabolite names

The post-processing has a major shortcoming. When KBase detects that one or more subunits of a complex are present, it creates a "full" GPR by adding 'Unknown' genes for the other subunits. CobraPy currently lacks functions to remove the genes. As such, these model should not be used to perform any simulations which rely on GPRs.

As output, the code returns processed SBML files in the 'processedDataDir' folder. Also returns a summary of the model sizes, in the 'summaryStatsDir' folder.

The first chunk of code identifies imports the Python packages necessary for this analysis.

In [4]:
# Import special features for iPython
import sys
sys.path.append('../Python')
import matplotlib

# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
import sbmlFunctions as sf
import metadataFunctions as mf

# Define local folder structure for data input and processing.
processedDataDir = 'ProcessedModelFiles'
rawModelDir = 'RawModelFiles'
summaryStatsDir = 'DataSummaries'

Then we call a function which processes each SBML file and preps it for analysis.

In [None]:
sf.processSBMLforRE(rawModelDir, processedDataDir, summaryStatsDir)

## Pruning Currency Metabolites

The next step is to prune currency metabolites from the metabolic network, in order for the network's directed graph to better reflect physiological metabolic transformations. We adopt and expand an approach outlined by Ma et al, and later adopted by Borenstein et al in their "reverse ecology" paper.

__Provide some illustrative examples of why this is necessary.__

#### References
1. Borenstein, E., Kupiec, M., Feldman, M. W., & Ruppin, E. (2008). Large-scale reconstruction and phylogenetic analysis of metabolic environments. Proceedings of the National Academy of Sciences, 105(38), 14482–14487. http://doi.org/10.1073/pnas.0806162105
2. Ma, H., & Zeng, A. P. (2003). Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics, 19(2), 270–277. http://doi.org/10.1093/bioinformatics/19.2.270
3. Ma, H. W., & Zeng, A.-P. (2003). The connectivity structure, giant strong component and centrality of metabolic networks. Bioinformatics, 19(11), 1423–1430. http://doi.org/10.1093/bioinformatics/btg177

Briefly, "currency metabolites" are defined as metabolites which serve to transfer functional groups (such as phosphorous in the case of ATP), as well as the functional groups themselves. To identify such metabolites in KBase, we scanned the ModelSEED [reaction database](https://github.com/ModelSEED/ModelSEEDDatabase/blob/master/Biochemistry/reactions.master.tsv) for metabolites listed in Ma et al, with the addition of cytochromes and quinones (H+ transfer) and acetyl-CoA/CoA (acetate transfer). We counted the occurence of each metabolite, and only kept metabolite pairs where both members were present in a non-trivial number of reactions. The full set of such metabolites are defined in `externalData\currency*.txt`. Additional details can be found in `2016-05-11 Currency Metabolites.xlsx`.  

We then wrote a function to update the stoichiometry of all reactions containing these metabolites.
1. First, all pairs of currency metabolites are removed. The set of such pairs is listed in `externalData\currencyRemovePairs.txt`. 
2. Some currency pairs involved in amino acid metabolism are subject to additional scrutiny, and removed only if a free amino group does not participate in the reaction. This ensures that reactions which synthesize these compounds are retained. The set of such pairs is listed in `externalData\currencyAminoPairs.txt`.
3. Finally, all metabolites which represent free forms of functional groups are removed (H+, NH4+, CO2, O2, H2O, etc). The set of such metabolites is listed in `externalData\currencyRemoveSingletons.txt`. 

The script below loops over the set of ProcessedModelFiles and removes these metabolites from their associated reactions.

In [7]:
reload(sf)
# Define local folder structure for data input and processing.
modelDir = 'ProcessedModelFiles'
singletonFile = '../externalData/currencyRemoveSingletons.txt'
pairFile = '../externalData/currencyRemovePairs.txt'
aminoFile = '../externalData/currencyAminoPairs.txt'
summaryStatsDir = 'DataSummaries'

sf.pruneCurrencyMetabs(modelDir, summaryStatsDir, singletonFile, pairFile, aminoFile)

Processing model 1 of 63
Processing model 2 of 63
Processing model 3 of 63
Processing model 4 of 63
Processing model 5 of 63
Processing model 6 of 63
Processing model 7 of 63
Processing model 8 of 63
Processing model 9 of 63
Processing model 10 of 63
Processing model 11 of 63
Processing model 12 of 63
Processing model 13 of 63
Processing model 14 of 63
Processing model 15 of 63
Processing model 16 of 63
Processing model 17 of 63
Processing model 18 of 63
Processing model 19 of 63
Processing model 20 of 63
Processing model 21 of 63
Processing model 22 of 63
Processing model 23 of 63
Processing model 24 of 63
Processing model 25 of 63
Processing model 26 of 63
Processing model 27 of 63
Processing model 28 of 63
Processing model 29 of 63
Processing model 30 of 63
Processing model 31 of 63
Processing model 32 of 63
Processing model 33 of 63
Processing model 34 of 63
Processing model 35 of 63
Processing model 36 of 63
Processing model 37 of 63
Processing model 38 of 63
Processing model 39 o