Analysis of metatranscriptomic RNAseq data for pathogen discovery

This repo contains code for a pipeline originally conceived of and written by Keir Balla at the University of Utah for RNAseq analysis for the Langlois-Eide lab dirty mouse virome collaboration, with modifications by me.

Output

The pipeline works by mapping RNAseq reads to the mouse genome, de novo assembling unmapped reads with Trinity, and assigning taxonomic lineages to the reads with BLASTn. The two main outputs are csv files that show the raw read counts for the taxonomic (1) families and (2) species present in each mouse.

Dependencies/versions used

STAR RNAseq read mapper, version 2.7.1a- details here
trinity RNAseq de novo read assembler, version 2.8.5- details here
samtools, version 1.12- details here
salmon RNAseq transcript quantification, version 1.4.0 details here
blast+ sequence database, version 2.11.0, details here
snakemake, latest version used is 6.5.1, details here
dib-lab/2018-ncbi-lineages scripts on github.
faSplit script for splitting fasta files into subsets for a parallelized BLAST search, executable found here

Loading dependencies with conda

Most of the dependencies can be loaded using a conda environment. STAR, Trinity, Samtools, and snakemake are included in the environment.yaml file in the base directory, and can be loaded into an environment using conda env create --file environment.yaml

Loading remaining dependencies

BLAST+ is already compiled on MSI, so the script that uses BLAST (blastarray.srun) loads this via module load. Non-MSI users will have to load this elsewhere.

The faSplit executable should be downloaded from the link above and added to your path variable, for example

cd ~/bin #(or wherever you want to download your local scripts)
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/fasplit
echo $PATH
#If ~/bin is not included in your path, run:
$PATH=$PATH:~/bin

Lineage assignment scripts

The dib lab scripts for assigning lineages to Trinity contigs were cloned from github. They are included in this repo with the usage licenses, so there is no need to download them yourself unless there is an update in the future to that code.

Data

Raw data is not included in this repo, but consists of 2x150 paired end stranded novaseq reads.

Running the pipeline

The slurm/bash scripts should be run in the following order:

star_index.srun <- Downloads the mouse reference genome from ensembl and indexes it for mapping with STAR.
map_reads.srun <- maps raw rnaseq reads to the indexed mouse genome with STAR and output unmapped reads (i.e. pathogen-specific reads).
assemble_reads.srun <- Extracts unmapped reads from STAR, concatenates, and de novo assembles into transcripts using Trinity.
blastarray.srun <- Moves the Trinity read files from scratch to local directory, splits into ~100 subsets, and launches a parallel blast search on multiple nodes to find the top 10 best matches in genbank for the transcript.
salmon_lineage_assign.srun <- re-maps original fastq reads back to the Trinity contigs, quantifies them with salmon, and assigns taxonomy lineages to the Trinity contigs. Produces files that list transcript abundances at the species and family levels for each sample within the final/<project-id> directory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Analysis of metatranscriptomic RNAseq data for pathogen discovery

Output

Dependencies/versions used

Loading dependencies with conda

Loading remaining dependencies

Lineage assignment scripts

Data

Running the pipeline

Files

README.md

Latest commit

History

README.md

File metadata and controls

Analysis of metatranscriptomic RNAseq data for pathogen discovery

Output

Dependencies/versions used

Loading dependencies with conda

Loading remaining dependencies

Lineage assignment scripts

Data

Running the pipeline