Code to generate re-assemblies of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP)
Python
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
__pycache__
.gitignore
README.md
SraRunInfo.csv
assembly_trinity_2.2.0.py
assembly_trinity_20140413p1.py
clusterfunc.py
dibMMETSP_configuration.py
diginorm_mmetsp.py
getdata.py
main.py
mmetsp_pipeline1.png
trim_qc.py

README.md

dib-MMETSP

Output files available for download:

Transcriptome assemblies (fasta): DOI

Annotations (gff): DOI

Table of one annotation name (best = sorted by e-value < 1e-05) by transcript ID (.csv): DOI

Peptide translations (fasta): DOI

Expression quantification (salmon output): DOI

All files combined: DOI

Pipeline scripts: DOI

Citation:

Johnson, Lisa K., Alexander, Harriet, & Brown, C. Titus. (2018). MMETSP re-assemblies [Data set]. Zenodo. https://doi.org/10.5281/zenodo.740440

MMETSP pipeline

This respository contains the pipeline code used to generate re-assemblies of the Marine Microbial Eukaryote Transcriptome Sequencing Project (MMETSP). Originally: https://github.com/ljcohen/MMETSP

This pipeline was constructed to automate the eel pond khmer protocols over a large-scale RNAseq data set. The data set used is from the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), which contains 678 cultured samples of 306 pelagic and endosymbiotic marine eukaryotic species representing more than 40 phyla (Keeling et al. 2014).

Input file is SraRunInfo.csv, a metadata spreadsheet downloaded from NCBI-SRA that contains the url and sample ID information. Scripts were designed for the high performance computing cluster at Michigan State University, iCER, and will be launched in parallel through the portable batch system (PBS) scheduler. Scripts will use the SraRunInfo.csv metadata spreadsheet to download and extract data, run qc, trim, diginorm, then assemble using Trinity. If you are interested in using these scripts, please be aware that modifications will be required specific to the system you are using.

The main pipeline scripts in this repository:

  • getdata.py, download data from NCBI and organize into individual directories for each sample/accession ID
  • trim_qc.py, trim reads for quality, interleave reads
  • diginorm_mmetsp.py, normalize-by-median and filter-abund from khmer, rename, combined orphans
  • assembly.py, runs Trinity de novo transcriptome assembly software

Annotation and expression counts (run separately):

Additional scripts (run separately):

Usage:

  1. Clone this repo
git clone https://github.com/dib-lab/dib-MMETSP.git
  1. edit dibMMETSP_configuration.py with absolute path names specific to your system. The file SraRunInfo.csv was obtained from NCBI for NCBI Bioproject accession: PRJNA231566. This set of code could be used with SraRunInfo.csv input from any collection of SRA records from NCBI or ENA.

  2. Run the main python function

python main.py

References

Keeling et al. 2014: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889

Supporting information with methods description: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001889#s6

Preliminary assembly protocol run by NCGR: https://github.com/ncgr/rbpa

MMETSP website: http://marinemicroeukaryotes.org/

iMicrobe project with data and combined assembly downloads: ftp://ftp.imicrobe.us/projects/104/

Blog posts: https://monsterbashseq.wordpress.com/2016/09/13/mmetsp-re-assemblies/

http://ivory.idyll.org/blog/2016-mmetsp-a-first-look.html