wbps-expression

Transcriptomic data for WormBase ParaSite

Demo

Studies page

A list of all studies for a species, with the results available for download

FTP

Introduction

This is the pipeline for providing WormBase ParaSite with RNASeq data. It encompasses a curation platform, data retrieval and analysis, and a UI oriented around static pages and files.

Curation

On a fresh run for a species of interest, the pipeline retrieves:

run metadata from RNASeq-er, who retrieve it from ENA, as well as FTP location of result files
study and publication metadata from ENA, and GEO
publication details from PubMed
FTP location of result files from RNASeq-er

This is then used to update the studies folder, preserved together with source code in this repository. The pipeline makes some guesses on what to accept and reject - with a few exceptions, we only allow studies of at least six runs - and runs consistency checks on the annotation. A curator then amends the files and re-runs iteratively, until the checks pass and they are satisfied with the results.

Data retrieval and analysis

Source files

We leave alignment, quantification, etc. to RNASeq-er. Their data, assembled by study but without any extra interpretation, is available for every study as "counts of aligned reads per run" and "TPMs per run".

TPMs by condition

Where there are enough replicates, we provide median TPM per condition. Not many studies have both technical and biological replicates, but where they do, we take median of technical replicates for each biological replicates, and then take median of biological replicates.

Differential expression analysis

For studies where it makes sense to form contrasts from appropriate pairs of conditions, we run differential expression analysis. The pipeline picks contrasts automatically, through a number of heuristics:

compare everything to a reference condition, if there is a clear reference
in the general case, try make contrasts from all pairs of conditions
except if conditions differ by multiple types of characteristics, then pick only the pairs that differ by one characteristic
except a parasite-specific curation of life stages as two characteristics (developmental stage and sex) should still work
except drug treatment assays curated as treatment+concentration or treatment+timepoint should still work

It works slightly better than it sounds.

The analysis uses DESeq2 in a very standard way, with fold changes and p-values extracted and filtered past a significance threshold.

Outputs

The analysis results are returned as files within a per-study directory structure, together with a listing of metadata for programmatic use else where, and a static HTML page presenting the content.

This HTML page is intended as a primary point of reference for people interested in the data. It lists the studies, with metadata for each, and links to the analysis results.

WormBase ParaSite deploys the content by syncing the data to a particular place in a file system within web servers' environment, with web servers configured to read content from there.

Gene page

The repository also contains a module capable of searching in text files with grep and formatting results as HTML pages, which forms the basis of our gene page. Due to a small volume of the files we had no need for a database to store the data.

Tracks

The pipeline also supports WormBase ParaSite track hubs and JBrowse displays - code somewhere else, as one of the outputs, $species_id.studies.json.

Curation tools

Tracking progress

The curation files go together with the source code, and git is really good at tracking what happened, when, and why. git status will show you what new files appeared after a run. Very convenient!

What to edit

Primarily edit TSV files in study folders, to fix the per-run metadata: $study_id.design.tsv and $study_id.skipped_runs.tsv.

Don't edit YAMLs or sources, because the changes will be lost.

There are also a few places with essentially corner-case curation, scattered around the source code:

characteristics get standardised after retrieving them from RNASeq-er through a bunch of regex-based heuristics centered around parasite specific stuff like life stages
PubMed ids not in ENA or GEO are in StudyMetadata.pm
Studies with fewer than six runs that are nevertheless worth including are in IncomingStudies.pm

Editing tsv files

Command line environment

See the scripts folder for utilities that pick out specific columns (good with paste) or transpose the rows and columns (good with grep or sed). Rerun the pipeline changing the file to get it in a more "standard" format.

Online - GitHub's editing tools

You can use GitHub's editing tools. Then make a commit and open a PR - Travis will run the build with the checks. This is great for very small changes, but the editor doesn't help you much, it's just a text box. You can paste it somewhere more convenient, Google Spreadsheets didn't like tabs when I tried but Excel worked well.

Development

Install

Not automated, but should present no difficulties. This software can run both in a personalized or cluster computing environment.

Instructions

Install R, and DESeq2. Clone the repository and install all the Perl modules. Write a wrapper script similar to bin/run-ebi.pl, hooking up libraries, inputs, and outputs.

Perl modules

You could do this:

cpanm -v --installdeps --notest .

R

Unfortunately DESeq2 pulls down a lot of dependencies: interfacing C++ code, plotting, etc. Install BioConductor, and then install DESeq2 using BioConductor.

if (!requireNamespace("BiocManager"))
    install.packages("BiocManager")
BiocManager::install(c("DESeq2"))

Name		Name	Last commit message	Last commit date
Latest commit History 649 Commits
bin		bin
lib		lib
past_studies		past_studies
scripts		scripts
studies		studies
t		t
travisci		travisci
.RData		.RData
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
SRR5947752.pe.genes.tpm.htseq2.irap.tsv		SRR5947752.pe.genes.tpm.htseq2.irap.tsv
cpanfile		cpanfile
elegans_all_studies.tsv		elegans_all_studies.tsv
elegans_all_studies2.tsv		elegans_all_studies2.tsv
elegans_studies_to_keep.tsv		elegans_studies_to_keep.tsv
fix_ele.sh		fix_ele.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wbps-expression

Demo

Studies page

FTP

Gene page

JBrowse tracks

Hub in the genome browser

Introduction

Curation

Data retrieval and analysis

Source files

TPMs by condition

Differential expression analysis

Outputs

Gene page

Tracks

Curation tools

Tracking progress

What to edit

Editing tsv files

Command line environment

Online - GitHub's editing tools

Development

Install

Instructions

Perl modules

R

About

Releases

Packages

Contributors 5

Languages

WormBase/wbps-expression

Folders and files

Latest commit

History

Repository files navigation

wbps-expression

Demo

Studies page

FTP

Gene page

JBrowse tracks

Hub in the genome browser

Introduction

Curation

Data retrieval and analysis

Source files

TPMs by condition

Differential expression analysis

Outputs

Gene page

Tracks

Curation tools

Tracking progress

What to edit

Editing tsv files

Command line environment

Online - GitHub's editing tools

Development

Install

Instructions

Perl modules

R

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages