Collection of scripts for computing PSI of junction clusters
Python Shell R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
aux_scripts
Extract_gene_coordinates.py
File_splitter.py
Format_genotype_v5.py
Generate_junction_BEDTracks.py
GenestoJunctions.R
GenestoJunctions.py
GenestoJunctions_v2.R
Get_PSI.py
Get_length_clusters.py
Methods.odt
README.md
STARtoBED.R
Split_in_juncfiles.py
change_gtf.py
change_gtf.sh
fix_coordinates.py
format_Log_STAR.R
format_STAR_output.sh
format_STAR_output_per_sample.sh
format_STAR_output_per_sample.sh~
format_STAR_output_per_sample_slurm.sh
format_STAR_output_pipeline_cluster_part1.sh
format_STAR_output_pipeline_cluster_part2.sh
format_STAR_output_pipeline_cluster_slurm_part1.sh
format_STAR_output_pipeline_cluster_slurm_part2.sh
pool_results.py

README.md

Junckey

Junckey is a collection of scripts for the calculation of PSI (Proportion Spliced Index) values of junctions clusters. This pipeline is adpated for using STAR (https://github.com/alexdobin/STAR).

1. Format STAR output

This pipeline uses the SJ.out.tab files generated with STAR. It is necessary to specify to the next code the path to the STAR samples, with each execution in a separated folder. Also, is necesary a gtf annotation of the transcriptome:

format_STAR_output.sh <path_to_STAR_samples> <gtf_annotation>

This script generates two files, with the samples in the columns and the junctions in the rows:

  • readCounts.tab: all the unique read counts computed with STAR. Per junction we obtain the overlap with genes and the type of the junction:
    • 1: Fully annotated junction
    • 2: Junction overlapping with known exons, but new connection
    • 3: Alternative donor site
    • 4: Alternative acceptor site
    • 5: Novel junction, neither donor nor acceptor site is annotated
  • rpkm.tab: normalizated rpkm values from the read counts. For the following steps we will just use the readCounts file

If the processing of lots of samples is needed and a cluster system is available, there is an adapted version for paralelizing jobs. The pipeline is splited into 2 parts:

  • part1: run a job per sample
qsub -b y "format_STAR_output_pipeline_cluster_part1.sh <path_to_STAR_samples> <gtf_annotation>"
  • part2: once all the jobs created by part1 have finished, run part2 in order to gather all the data
qsub -b y "format_STAR_output_pipeline_cluster_part2.sh <path_to_STAR_samples>"

2. Clustering

For computing the PSI of the junctions, we propose to do it according to the relative inclusion of the nearby junctions. In order to achieve this, we can calculate clusters of our junctions using LeafCutter (https://github.com/davidaknowles/leafcutter).

First, we need to split the readCounts file in .junc files (one per sample). The next script will generate this files in the same provided path and the corresponding index file (index_juncfiles.txt):

python Split_in_juncfiles.py <path_to_STAR_samples>/readCounts.tab

Now we are ready for running LeafCutter. Here we show an example of execution, but there are several options in the github website for tuning the execution. It's necessary to provide the previous generated index_juncfiles.txt file:

python leafcutter-master/clustering/leafcutter_cluster.py -p 0.01 -j <path_to_STAR_samples>/index_juncfiles.txt -o <output_path_LeafCutter>

3. PSI Calculation

The next script calculate the PSI inclusion of each junction in relation to the clusters. It returns a sigle file with all the PSI values together, removing those clusters with NA values

python Get_PSI.py <output_path_LeafCutter> <path_to_STAR_samples>/readCounts.tab