Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Output files available for download:

Transcriptome assemblies (fasta): DOI

Annotations (gff): DOI

Table of one annotation name (best = sorted by e-value < 1e-05) by transcript ID (.csv): DOI

Peptide translations (fasta): DOI

Expression quantification (salmon output): DOI

All files combined: DOI

Pipeline scripts: DOI


Johnson, Lisa K., Alexander, Harriet, & Brown, C. Titus. (2018). MMETSP re-assemblies [Data set]. Zenodo.

MMETSP pipeline

This respository contains the pipeline code used to generate re-assemblies of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP). Originally:

This pipeline was constructed to automate the eel pond khmer protocols over a large-scale RNAseq data set. The data set used is from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), which contains 678 cultured samples of 306 pelagic and endosymbiotic marine eukaryotic species representing more than 40 phyla (Keeling et al. 2014).

Input file is SraRunInfo.csv, a metadata spreadsheet downloaded from NCBI-SRA that contains the url and sample ID information. Scripts were designed for the high performance computing cluster at Michigan State University, iCER, and will be launched in parallel through the portable batch system (PBS) scheduler. Scripts will use the SraRunInfo.csv metadata spreadsheet to download and extract data, run qc, trim, diginorm, then assemble using Trinity. If you are interested in using these scripts, please be aware that modifications will be required specific to the system you are using.

The main pipeline scripts in this repository:

  •, download data from NCBI and organize into individual directories for each sample/accession ID
  •, trim reads for quality, interleave reads
  •, normalize-by-median and filter-abund from khmer, rename, combined orphans
  •, runs Trinity de novo transcriptome assembly software

Annotation and expression counts (run separately):

Additional scripts (run separately):


  1. Clone this repo
git clone
  1. edit with absolute path names specific to your system. The file SraRunInfo.csv was obtained from NCBI for NCBI Bioproject accession: PRJNA231566. This set of code could be used with SraRunInfo.csv input from any collection of SRA records from NCBI or ENA.

  2. Run the main python function



Keeling et al. 2014:

Supporting information with methods description:

Preliminary assembly protocol run by NCGR:

MMETSP website:

iMicrobe project with data and combined assembly downloads:

Blog posts:


Code to generate re-assemblies of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP)






No packages published