# 'To Do' List

## Reverse Ecology Analysis

### Immediate

1. Remove genomes from the analysis. We have small numbers of genomes from a handful of tribes/clades. These should be removed.
2. The tables on my GRC poster were made by hand. I want to write Python code to generate these tables to avoid the need for manual analysis.
  1. Write Python code to identify compounds which are consistently or differentially utilized within a taxonomic level. The output should be similar to the heatmaps from 'Resource Utilization' from my GRC poster.
  2. Write Python code to identify unique seed compounds for each tribe. The output should be similar to the table from 'Metabolic Competition' from my GRC poster.
  3. For 3 and 4 above, it would be useful to have a mapping of compounds to compound classes.

### Analysis Using Reverse Ecology
I would like to perform the following analyses using the above metrics:
1. Expand the 'genomeMerging' notebook to allow analysis at different levels of taxonomic resolution. 
2. More systematically evaluate seed sets at different taxonomic resolutions (genome, tribe, etc). This will give a picture of niche differentiation at different taxnomic levels.
3. Identification of differences between seed sets for genomes within the same tribe (or tribes between the same clade, etc) could idenitfy exchanged metabolites.

### Validating Reverse Ecology
After discussing with the lab, I propose the following to validate the reverse ecology predictions:

  1. Pair w/ Sarahi: run genomes from her enrichment culture (Garcia et al Mol Ecol 2015) through my pipeline. Does reverse ecology predict the presence of the cooperative interactions proposed by Sarahi?
  2. Map metratranscriptome reads from OMD-TOIL and/or Mary Ann Moran to our genomes to confirm substrate utilization predictions.
  3. Develop correlation networks from the 16S tag data collected concurrently with our metagenomic samples. Positive and negative correlations should co-occur w/ cooperative and competitive interactions.

## Reproducible Research

As part of my committment to reproducible research, I am interested in automating the following tasks:

1. Building phylogenetic tree
  1. Convert Perl scripts to Python
  2. Scripts to extract marker genes, run RaXML
  3. Write Python master function to carry out all of the above
  4. Integrate into iPython notebook

2. Computing genome statistics
  1. Python scripts to extract genome statistics from IMG metadata file
  2. Scripts to compute completeness, since this isn’t in the metadata
  3. Write Python master function to carry out all of the above
  4. Integrate into iPython notebook

## Optional Analyses

Should time permit, I would like to further delve into the following issues:

1. Clustering and visualization.
    1. Predictive metabolites. The dendrograms aren't the best method of visualization. Instead, construct NMDS plots to see if genomes separate by tribe, clade, or lineage.
    2. Dimension reduction techniques aren't working very well, probably because the number of dimensions (metabolites) is too high.
    3. Consider a machine learning approach to develop a classifier to predict lineage. Selected features would be indicative metabolites. This is probably not a priority.
    4. Come up with a reduction scheme to reduce the number of metabolites in the feature vector.
    
1. Genome phylogenetics
  1. Expand the phylogenetic tree w/ reference genomes from "non-FW” and “marine" lineages
  2. For SAGs, compare the phylogeny of 16S vs. marker genes

3. Reverse Ecology
  1. More robust pan-genome comparison using KBase orthologs
  2. Implement a routine to perform incompleteness simulations and evaluate robustness of results