This is a simple based implementation of a Ribosome profiling pipeline
- snakemake 3.7.1 (3.8.0 is bad, later maybe ok)
- R > 3.3.1 (In older version wig generation does not work)
- GenomicAlignments
- GenomicFeatures
- RiboProfiling
- rtracklayer
- tophat
- bowtie
- FastX
- samtools
- cutadapt
The pipeline is used as
snakemake [options] [target]
The possible targets are:
- bams Generate BAM files with prefiltering of the ribosomal fragments
- bamsnf without prefiltering
- randomizedbams Generates BAMS where reads with double mapping are randomly selected. Be careful, it is a bit of a hack.
- csv Generate WIGS, bigWigs, CSV with counts per gene (bams are, of course, also generated)
Run snakemake --lt
to see all possible target rules, or look into Snakefile
for details.
The suggested way to run is something like
snakemake -np csv
To check what is it going to do, and then start as
snakemake -p csv
run.sh
contains an example setup to run on a LSF based cluster.
The samples in FASTQ format (preferrably compressed with gzip, bzip,
or xz) should be in input
directory.
data
should contain yast.fa
with the reference genome FASTA,
yeast.gff
with the gene annotations, and
Yeast-Noncoding-extended.fa
with the ribosomal RNA contaminating
sequences to filter at the first stage. See next section on the
naming for these files.
The configuration is in the file config.json
which looks like
{
"INDEX": "data/yeast",
"INDEX_NC": "data/Yeast-Noncoding-extended",
"GENOME_GFF": "data/yeast.gff",
"ADAPTER": "CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
"MANUAL_SHIFTS_FILE": "shifts.csv"
}
The entries mean:
- INDEX basename of the FASTA file with the reference sequence.
In the example above the reference should be named
data/yeast.fa
- GENOME_GFF name of the genome annotation in gff format
- ADAPTER Adapter to strip of the end of the sequence
- MANUAL_SHIFTS_FILE Name of the csv with the manual selection of the reads and position of the P-sites. See P-site WIGs
The WIG (or, currently, BedGraph) files are being generated by
selecting only the reads with length that is close to the correct one,
and then selecting the fixed offset into the read, which is individual
for each read length. By default, the algorithm selects the length
corresponding to the maximum of the length histogram +- one. The
histogram with color-marked selected lengths is saved in
wig/{sample}_Length.pdf
. Then, for each of the lenght individually
the metagene distribution of the starts of the reads relative to the
start of the nearest OFR is plotted in wig/{sample}_Asiteshifts.pdf
,
that is P-site codon. Then, the corrsponding offset into the read is
taken, which leads to the resulting bedgraph files.
However, if this fails, it is possible to create the
MANUAL_SHIFTS_FILE with the sete of the read lengths and shifts for
each sample. It has three columns: file name, length list (space
separated), and shift list. the pipeline creates the CSV files with
the values it found automatically in wigs/{sample}.shifts.csv
, which
can be used as the template to create the MANUAL_SHIFTS_FILE.
/ This one (rApp) seems to not be present!
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTATGCATGCATGCATGCATGCATGCATGCaCTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
^^^^Sequence^^^^^^^^^^^^^^^^-CTGTAGGCACCATCAATAGATCGGAAGAGC TGACCA
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
TruSeq Universal Adapter TruSeq Indexed Adapter Index
AATGATACGGCGACCACCGA
Portion bind to flow cell
CGTATGCCGTCTTCTGCTTG
Binds to flow cell for paired end reads
GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -> Index read proceeds
Index read primer
ACACTCTTTCCCTACACGACGCTCTTCCGATCT -> Read proceeds
Read 1 primer
-
5'rAppCTGTAGGCACCATCAAT/3ddC/3'
-
AATGATACGGCGACCACCGAGATCTACAC
CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCG
: IDX4