# Gene Expression Among acI Clades: Mapping Metatranscriptomes to Composite Genomes

## Overview

We believe that reverse ecology metrics are sensitive to genome completeness. We also observed that--for a single tribe--at least four SAGs or MAGs were required before the pan-genome was "complete", based on single-copy marker genes. Because we only have four genomes for two tribes, we cannot confidently perform reverse ecology analysis at the tribe-level. At the clade level, we have 12 genomes from clade acI-A, 21 from clade acI-B, and 3 from acI-C. This should allow us to make confident predictions about the differences between the acI-A and acI-B clades.

We decided to map metatranscriptome samples to the "pan-genome" of each clade. To construct the pan-genome, we used our reference genome collection to define actinobacterial COGs (clusters of orthologous groups), and defined the pan-genome of a clade as the union of all COGs present in at least one genome. To obtain the pan-genome sequence, we aligned all sequences belonging to each COG and obtained a consensus sequence for each COG. We then mapped a number of metatranscriptomes to each consensus genome and looked for differentially expressed genes.

## Creation of Composite Genomes

For computational details, please see my [OrthoMCL repo](https://github.com/joshamilton/OrthoMCL), which will be merged into this one at a later date.

### Identification of Actinobacterial COGs
We used OrthoMCL to identify clusters of orthologous genes (COGs) in a set of 72 freshwater Actinobacterial genomes. OrthoMCL is an algorithm for grouping proteins into orthologous gene families based on sequence similarity. OrthoMCL takes as input a set of protein sequences and returns a list of COGs and the proteins which belong to each COG. The OrthoMCL pipeline consists of the following steps:

1. Convert KBase-annotated genomes (in `Genbank` format) to fasta amino acid (`faa`) format.
2. Format `faa` files to be compatible with OrthoMCL (script `01faaParser.py`).
3. Run all-vs-all BLAST on the concatenated set of protein sequences  (script `02parallelBlast`).
4. Initialize the MySQL server to store OrthoMCL output and run OrthoMCL (scripts `setupMySql.sh` and `runOrthoMCL.sh`).
5. Rearrange the OrthoMCL output into a user-friendly format (script `05parseCOGs`). This script returns three tables, structured as follows:

#### cogTable
A table listing the locus tags associated with each (genome, COG) pair.

|   | AAA023D18 | AAA023J06 | AAA024D14 |
|---|---|---|---|---|
| group00000 | AAA023D18.genome.CDS.1002; AAA023D18.genome.CDS.925; AAA023D18.genome.CDS.939 | AAA023J06.genome.CDS.1227; AAA023J06.genome.CDS.862	 |  |
| group00001 | AAA023D18.genome.CDS.800 | AAA023J06.genome.CDS.798 | AAA024D14.genome.CDS.945; AAA024D14.genome.CDS.1601 |

For example, in genome AAA023D18, the following genes belong to cog00000: AAA023D18.genome.CDS.1002, AAA023D18.genome.CDS.925, and AAA023D18.genome.CDS.939.

#### annotTable
A table listing the annotations associated with each (genome, COG) pair.

|   | AAA023D18 | AAA023J06 | AAA024D14 |
|---|---|---|---|---|
| group00000 | Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster; hypothetical protein; 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100) | 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100); 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	|  |
| group00001 | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3) | DNA gyrase subunit A (EC 5.99.1.3); Topoisomerase IV subunit A (EC 5.99.1.-) |

#### annotSummary
This table provides a list of all annotations associated with the genes in a COG. It can be further manually parsed to reveal the distribution of annotations associated with a COG. For example, COG00000 contains 94 genes across 72 genomes, as follows:

| Annotation | Counts |
|------------|--------|
| 3-oxoacyl-[acyl-carrier protein] reductase (EC 1.1.1.100)	| 63 |
| Short-chain dehydrogenase/reductase in hypothetical Actinobacterial gene cluster | 11 |
| None Provided	| 9 |
| hypothetical protein | 3 |
| COG1028: Dehydrogenases with different specificities (related to short-chain alcohol dehydrogenases) | 2 |
| 2,3-butanediol dehydrogenase, S-alcohol forming, (S)-acetoin-specific (EC 1.1.1.76) | 1 |
| D-beta-hydroxybutyrate dehydrogenase (EC 1.1.1.30) | 1 |
| Oxidoreductase, short chain dehydrogenase/reductase family | 1 |
| Short-chain dehydrogenase/reductase SDR | 1 |
| Acetoacetyl-CoA reductase (EC 1.1.1.36) | 1 |
| short chain dehydrogenase | 1 |

This COG appears to be a 3-oxoacyl-[acyl-carrier protein] reductase.

### Sequence Clustering and Generation of Composite Genome

Once protein sequences were assigned to COGs, we developed composite genomes for each lineage, clade, and tribe represented by one or more reference genomes. Briefly, the process works as follows:

1. For each tribe (clade, lineage), look up the genomes associated with that tribe. For example, the tribe acI-A1 has two genomes, `AAA027M14` and `AAA278O22`.

2. For each COG, retrieve the coding sequences from the appropriate genomes. For example, COG00000 contains two sequences from tribe acI-A1:

        >AAA027M14.genome.CDS.1561
        ATGAAAGATAACTCGAATAAAGGCATTCTCATCTTCGGAGGAGCACGTGGTATCGGAGGC...

        >AAA278O22.genome.CDS.1772
        ATGAGTAAGCGTTTAGAGGGAAGAGTCGCAGTAATTACCGGTGCAGGTAGTGGAATCGGT...

3. Align the sequences using MUSCLE. This step is skipped in the event a COG contains only a single sequence across a tribe. For the above two sequences, the alignment begins:

        >AAA027M14.genome.CDS.1561
        ATGAAAGATAACTC---GAATAAAGGCATTCTCATCTTCGGAGGAGCACGTGGTATCGGA...

        >AAA278O22.genome.CDS.1772
        ATGAGTAAGCGTTTAGAGGGAAGAGTCGCAGTAATTACCGGTGCAGGTAGTGGAATCGGT...

4. Obtain the consensus sequence for the cluster obtained in Step 3 using EMBOSS Toolkit. For each position in the alignment, the `cons` function selects the most common nucleotide, or indicates an ambiguous nucleotide if no nucleotide has a majority. In the example above, the consensus sequence begins:

         >group00000
         ATGAAAGATAACTCAGAGAATAAAGGCATTCTCATCTTCGGAGGAGCACGTGGTATCGGA...
         
    In the same genome, the consensus sequence for COG00001 has ambiguities:
    
        >group00001
        NNNNNTNNANATAANAACGNACCNGAAGANNNNNNNNNNNNNNNNNNNNNNNTNGCANNG...

The above steps are all documented in the script `06getConsensusSequence.py`, the output of which is a set of `tribe-group.cons` files, giving the consensus sequence for each (tribe, COG) pairing. In the event a COG contains only a single sequence across a tribe, the `tribe-group.cons` file contains that sequence.

Additional processing steps create fasta nucleotide files (`ffn`) and `gff` files for each composite genome (e.g., `acI-A.ffn` and `acI-A1.gff`). The former are used as reference genomes for metatranscriptome mapping, while the latter are used by `HTSeq-count` to count the reads which map to each gene.

__Note__: These genomes contain protein-encoding sequences only.

### References
1. Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research, 13(9), 2178–89. http://doi.org/10.1101/gr.1224503
3. Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5), 1792–1797. http://doi.org/10.1093/nar/gkh340
3. Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113. http://doi.org/10.1186/1471-2105-5-113
4. Rice, P., Longden, I., & Bleasby, A. (2000). EMBOSS: The European Molecular Biology Open Software Suite. Trends in Genetics, 16(1), 276–277. http://doi.org/10.1016/j.cocis.2008.07.002

## Mapping to Consensus Genomes

For computational details, please see the `actinoRE` branch of the [OMD-TOIL repo](https://github.com/joshamilton/OMD-TOILv2/tree/actinoRE), which will be merged into this one at a later date.

## Differential Expression Analysis

Finally, I plan to use edgeR or DESeq to identify COGs which show differential expression across the acI clades. I hypothesize that such genes will be associated with metabolic machinery (such as transporters) required to uptake and metabolize seed compounds unique to individual clades. DESeq and edgeR rely on biological replicates to test for statistically significant differences in expression. In this study, we used three sets of "replicates", as follows:

* Lake Mendota - Three samples collected during OMD-TOIL. These are not true biological replicates as they were collected at different time points, but we are treating them as such. Additional documentation can be found in the [OMD-TOIL Github repo] (https://github.com/McMahonLab/OMD-TOILv2).

* Amazon River - Twelve samples taken from six stations along the Amazon River (two biological replicates each station). For ease of analysis, I am currently analyzing these as twelve biological replicates.

* Lake Lanier - Four biological replicates.

I am currently analyzing these data and have nothing to put here.

### References
1. Anders, S., & Huber, W. (2010). Differential expression analysis for sequence count data. Genome Biology, 11, R106. http://doi.org/10.1186/gb-2010-11-10-r106
2. McCarthy, D. J., Chen, Y., & Smyth, G. K. (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288–4297. http://doi.org/10.1093/nar/gks042
3. Satinsky, B. M., Fortunato, C. S., Doherty, M., Smith, C. B., Sharma, S., Ward, N. D., … Crump, B. C. (2015). Metagenomic and metatranscriptomic inventories of the lower Amazon River, May 2011. Microbiome, 3, 39. http://doi.org/10.1186/s40168-015-0099-0
4. Tsementzi, D., Poretsky, R. S., Rodriguez-R, L. M., Luo, C., & Konstantinidis, K. T. (2014). Evaluation of metatranscriptomic protocols and application to the study of freshwater microbial communities. Environmental Microbiology Reports, 6(6), 640–655. http://doi.org/10.1111/1758-2229.12180