# Obtaining and Processing Network Reconstructions from KBase

KBase (http://kbase.us) is a powerful resource for obtaining genome-scale network reconstructions for microbial genomes. These reconstructions are distributed as SBML files, which must be processed prior to reverse ecology analysis. This notebook describes how to obtain reconstructions from KBase, and how to process them.

## Obtaining and Preparing SBML Files

Briefly, genomes (as fasta files) for your organisms of interest can be pushed from your computer to KBase. Once there, a KBase Narrative (iPython notebook) can be used to build reconstructions for your genomes. To do so, follow the instructions in `Perl/README.md`.

## Obtaining and Preparing SBML Files: Update January 2016

To validate predictions made by this reverse ecology pipeline, I propose to map metatranscriptomic reads from OMD-TOIL to KBase-annotated genomes. Retrieving the genome annotations proved to be quite a pain!

* The KBase Narrative mentioned above ([click here](https://narrative.kbase.us/narrative/ws.9599.obj.2)) no longer contains annotated versions of the genomes.

* Furthermore, the code cells in that narrative generate errors during the annotation, model-building, and downloading steps.

I have attempted to communicate with the KBase team to resolve some of these issues, but have received no response.

Thus, I created new KBase Narratives for manual annotation, model-building, and downloading. For each notebook (linked below), I copied the genomes (contigs) from the [old narrative](https://narrative.kbase.us/narrative/ws.9599.obj.2) and manually ran the "Annotate Contigs" and "Build Metabolic Model" KBase apps. I then downloaded the annotated genomes (Genbank format) and models (SBML and tsv formats).

* SAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12259.obj.1), workspace ID: joshamilton:1452727482251

* Mendota MAGs: [KBase Narrative](https://narrative.kbase.us/narrative/ws.12268.obj.1), workspace ID: joshamilton:1452793144933

* Trout Bog MAGs [KBase Narrative](https://narrative.kbase.us/narrative/ws.12270.obj.1), workspace ID: joshamilton:1452798604003

* MAGs from other research groups [KBase Narrative](https://narrative.kbase.us/narrative/ws.12271.obj.1), workspace ID: joshamilton:1452801292037

Once the genomes were downloaded, I further converted the Genbank-formatted genomes to fasta nucleotide (ffn), fasta amino acid (ffa), and gff format. The scripts to do so are in the refGenomes/scripts folder.

## Processing SBML Files

Reconstructions from KBase require further processing before they are suitable for use in reverse ecology. The function below does a nunmber of things:
1. Reformat gene locus tags
2. Remove biomass, exchange, spontaneous, DNA/RNA biosynthesis reactions and their corresponding genes
3. Import metabolite formulas
4. Check mass- and charge-balancing of reactions in the reconstruction
5. Remove trailing 0s from reaction and metabolite names

The post-processing has a major shortcoming. When KBase detects that one or more subunits of a complex are present, it creates a "full" GPR by adding 'Unknown' genes for the other subunits. CobraPy currently lacks functions to remove the genes. As such, these model should not be used to perform any simulations which rely on GPRs.

As output, the code returns processed SBML files in the 'processedDataDir' folder. Also returns a summary of the model sizes, in the 'summaryStatsDir' folder.

The first chunk of code identifies imports the Python packages necessary for this analysis.

In [1]:
# Import special features for iPython
import sys
sys.path.append('../Python')
import matplotlib
%matplotlib inline

# Import Python modules 
# These custom-written modules should have been included with the package
# distribution. 
import sbmlFunctions as sf
import metadataFunctions as mf

# Define local folder structure for data input and processing.
processedDataDir = 'ProcessedModelFiles'
rawModelDir = 'RawModelFiles'
summaryStatsDir = 'DataSummaries'

Then we call a function which processes each SBML file and preps it for analysis.

In [2]:
sf.processSBMLforRE(rawModelDir, processedDataDir, summaryStatsDir)

Processing model AAA023D18, 1 of 72
All reactions are balanced
Processing model AAA023J06, 2 of 72
All reactions are balanced
Processing model AAA024D14, 3 of 72
All reactions are balanced
Processing model AAA027D23, 4 of 72
All reactions are balanced
Processing model AAA027E14, 5 of 72
All reactions are balanced
Processing model AAA027F04, 6 of 72
All reactions are balanced
Processing model AAA027J17, 7 of 72
All reactions are balanced
Processing model AAA027L06, 8 of 72
All reactions are balanced
Processing model AAA027L17, 9 of 72
All reactions are balanced
Processing model AAA027M14, 10 of 72
All reactions are balanced
Processing model AAA028A23, 11 of 72
All reactions are balanced
Processing model AAA028C09, 12 of 72
All reactions are balanced
Processing model AAA028E20, 13 of 72
All reactions are balanced
Processing model AAA028G02, 14 of 72
All reactions are balanced
Processing model AAA028I14, 15 of 72
All reactions are balanced
Processing model AAA028K15, 16 of 72
All reaction