# 'To Do' List

## Additional Metabolic Models

I would like to include all sequenced FW actinobacterial genomes in my RE analysis. This requires the following steps:

6. Rerun reverseEcology and mergingGenomes iPynb's using updated ExternalData files

** At the end of this analysis, I will have the final set of genomes, models, etc, which will be the subject of analysis. **

## Reverse Ecology Analysis

### New Calculations
1. The dendrograms aren't the best method of visualization. Instead, construct NMDS plots to see if genomes separate by tribe, clade, or lineage.
2. The tables on my GRC poster were made by hand. I want to write Python code to generate these tables to avoid the need for manual analysis.
  1. Write Python code to identify compounds which are consistently or differentially utilized within a taxonomic level. The output should be similar to the heatmaps from 'Resource Utilization' from my GRC poster.
  2. Write Python code to identify unique seed compounds for each tribe. The output should be similar to the table from 'Metabolic Competition' from my GRC poster.
  3. For 3 and 4 above, it would be useful to have a mapping of compounds to compound classes.
3. I want to compute new RE metrics, as described below


### Additional Reverse Ecology Metrics
I want to expand my code to compute the following additional metrics:
1. Environmental Scope Index, the fraction of environments (seed sets) on which an organism can grow
2. Cohabitation Score, the number of species which are also viable in each environment where the species under study is viable. This more precisely places a bound on potential for cooperation, as it considers the entire seed set and not just shared seed compounds.
3. Effective Metabolic Overlap, which represents the ability of one organism to tolerate competition by another.

** Note: for metrics which incorporate the 'scope' of a metabolite, I want to use the 'reachability' for reasons of internal consistency.**

I would also like to develop novel RE metrics which extend beyond pairs to more complicated interactions. ** However, this is a low priority, as it seems even the pairwise metrics I have computed thus far are very informative.** Some possible ideas:

1. The simplest approach would be to compute all observed pairwise interactions and average
2. For a more complex approach, consider a three-organism interaction: the weight of a given metbolite in RE metrics could depend on the number of pairwise interactions each metabolite participates in. This might require "sampling” from all possible two, three, etc size communities
3. Finally, a re-conceptualization of metrics from Zelezniak paper to fit within a RE framework could be intersting.

### Analysis Using Reverse Ecology
I would like to perform the following analyses using the above metrics:
1. Expand the 'genomeMerging' notebook to allow analysis at different levels of taxonomic resolution. 
2. More systematically evaluate seed sets at different taxonomic resolutions (genome, tribe, etc). This will give a picture of niche differentiation at different taxnomic levels.
3. Identification of differences between seed sets for genomes within the same tribe (or tribes between the same clade, etc) could idenitfy exchanged metabolites.
4. Develop a classifier to predict the trophic level of FW microbes
  1. Ricardo Cavicchioli has developed such a classifier for marine microbes based on their genome sequence (Lauro et al PNAS 2009)
  2.  Lifestyles could be based on framework of Livermore et al Environ Microbiol 2014
5. Explore the relationship between phylogeny and traits
  1. To what extent at differences explained by broad-scale phylogeny vs. finer niche differentiation?
  2. This idea has been explored by Tony Ives - Helmus and Ives, American Naturalist 2007

### Validating Reverse Ecology
I am struggling with how best to validate the results of our analyses. At a broad level, I have the following ideas:

1. What are values of various metrics for known organisms
2. Examples - FW lifestyles from Livermore et al Environ Microbiol 2014
3. Evaluate RE papers for large-scale calculations
4. PCA (or other metric) for clustering of genomes based on nutrient profiles

At conferences this summer, I was frequently asked about experimentally validating predictions, such as nutrient requirements or competition. Is experimental validation of indivudal predictions really necessary?

## Working with Metadata

### Validating ANI and Phylogeny
My ANI- and COV-based approach for classifying GFMs based on SAGs should be validated. Here are two possible approaches:

1. Perform pairwise ANI of all Actinobacterial families (or suitable) to generate phylum-specific ANI cutoff for that taxonomic level
2. Benchmark the approach against other tribes for which have a lot of SAGs, such as LD12. The same ANI and COV cutoffs selected for the Actinos should also work for LD12.

The Luna genomes are problematic with our current approach, with pairs of genomes having coverage as low as 3 to 5%. **I don't yet know what I want to do about this.** One possibility is to use the new tool developed by JGI instead of Sarah's scripts.

## Reproducible Research

As part of my committment to reproducible research, I am interested in automating the following tasks:

1. Building phylogenetic tree
  1. Convert Perl scripts to Python
  2. Scripts to extract marker genes, run RaXML
  3. Write Python master function to carry out all of the above
  4. Integrate into iPython notebook

2. Computing genome statistics
  1. Python scripts to extract genome statistics from IMG metadata file
  2. Scripts to compute completeness, since this isn’t in the metadata
  3. Write Python master function to carry out all of the above
  4. Integrate into iPython notebook

## Optional Analyses

Should time permit, I would like to further delve into the following issues:

1. Genome phylogenetics
  1. Expand the phylogenetic tree w/ reference genomes from "non-FW” and “marine" lineages
  2. For SAGs, compare the phylogeny of 16S vs. marker genes
  3. SAG AAA028-N15 is possibly mis-classified. It has a weird coverage pattern and needs to be decontaminated

2. Genome Completeness Estimates
  1. Consider bootstrapping for probability estimates of conserved genes
  2. Construct an completeness estimate based on the probability of seeing observed number of conserved genes (e.g., Podar et al Biol Direct 2012)

3. Reverse Ecology
  1. More robust pan-genome comparison using KBase orthologs
  2. Implement a routine to perform incompleteness simulations and evaluate robustness of results