Skip to content

fedxa/RPF_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ribosome footprinting pipeline

This is a simple Snakemake based implementation of a Ribosome profiling pipeline

Installation

Requirements

  • snakemake 3.7.1 (3.8.0 is bad, later maybe ok)
  • R > 3.3.1 (In older version wig generation does not work)
    • GenomicAlignments
    • GenomicFeatures
    • RiboProfiling
    • rtracklayer
  • tophat
  • bowtie
  • FastX
  • samtools
  • cutadapt

Run

The pipeline is used as

snakemake [options] [target]

The possible targets are:

  • bams Generate BAM files with prefiltering of the ribosomal fragments
  • bamsnf without prefiltering
  • randomizedbams Generates BAMS where reads with double mapping are randomly selected. Be careful, it is a bit of a hack.
  • csv Generate WIGS, bigWigs, CSV with counts per gene (bams are, of course, also generated)

Run snakemake --lt to see all possible target rules, or look into Snakefile for details.

The suggested way to run is something like

snakemake -np csv

To check what is it going to do, and then start as

snakemake -p csv

run.sh contains an example setup to run on a LSF based cluster.

Input files

The samples in FASTQ format (preferrably compressed with gzip, bzip, or xz) should be in input directory.

data should contain yast.fa with the reference genome FASTA, yeast.gff with the gene annotations, and Yeast-Noncoding-extended.fa with the ribosomal RNA contaminating sequences to filter at the first stage. See next section on the naming for these files.

Configuration

The configuration is in the file config.json which looks like

{
"INDEX": "data/yeast",
"INDEX_NC": "data/Yeast-Noncoding-extended",
"GENOME_GFF": "data/yeast.gff",
"ADAPTER": "CTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC",
"MANUAL_SHIFTS_FILE": "shifts.csv"
}

The entries mean:

  • INDEX basename of the FASTA file with the reference sequence. In the example above the reference should be named data/yeast.fa
  • GENOME_GFF name of the genome annotation in gff format
  • ADAPTER Adapter to strip of the end of the sequence
  • MANUAL_SHIFTS_FILE Name of the csv with the manual selection of the reads and position of the P-sites. See P-site WIGs

P-site WIG generation

The WIG (or, currently, BedGraph) files are being generated by selecting only the reads with length that is close to the correct one, and then selecting the fixed offset into the read, which is individual for each read length. By default, the algorithm selects the length corresponding to the maximum of the length histogram +- one. The histogram with color-marked selected lengths is saved in wig/{sample}_Length.pdf. Then, for each of the lenght individually the metagene distribution of the starts of the reads relative to the start of the nearest OFR is plotted in wig/{sample}_Asiteshifts.pdf, that is P-site codon. Then, the corrsponding offset into the read is taken, which leads to the resulting bedgraph files.

However, if this fails, it is possible to create the MANUAL_SHIFTS_FILE with the sete of the read lengths and shifts for each sample. It has three columns: file name, length list (space separated), and shift list. the pipeline creates the CSV files with the values it found automatically in wigs/{sample}.shifts.csv, which can be used as the template to create the MANUAL_SHIFTS_FILE.

Notes on linkers/adapters in Ribosome Footprinting

Detailed sequence breakout for Ribosome footprint library

                                                                                      / This one (rApp) seems to not be present!
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCTATGCATGCATGCATGCATGCATGCATGCaCTGTAGGCACCATCAATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
                                                          ^^^^Sequence^^^^^^^^^^^^^^^^-CTGTAGGCACCATCAATAGATCGGAAGAGC                     TGACCA
AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT                                               GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
TruSeq Universal Adapter                                                                                 TruSeq Indexed Adapter           Index
AATGATACGGCGACCACCGA
Portion bind to flow cell
                                                                                                                                                    CGTATGCCGTCTTCTGCTTG
                                                                                                                                                    Binds to flow cell for paired end reads
                                                                                                         GATCGGAAGAGCACACGTCTGAACTCCAGTCAC -> Index read proceeds
                                                                                                         Index read primer
                         ACACTCTTTCCCTACACGACGCTCTTCCGATCT -> Read proceeds
                         Read 1 primer

Linkers (some references)

  1. Adapter

    5'rAppCTGTAGGCACCATCAAT/3ddC/3'

  2. cloning linkers

    AATGATACGGCGACCACCGAGATCTACAC
    CAAGCAGAAGACGGCATACGAGATTGGTCAGTGACTGGAGTTCAGACGTGTGCTCTTCCG: IDX4

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published