IRTools

IRTools is a computational toolset for detection and quantitative analysis of intron retention from RNA-Seq libraries.

Installation

To use IRTools, you will need python 3.6.9 or potentially above.

Note: This has only been tested in python 3.6.9.

PIP

To install directly from PyPI:

pip install IRTools

If this fails, please install all dependencies first:

pip install numpy
pip install scipy
pip install pandas
pip install networkx
pip install bx-python
pip install HTSeq==0.13.5
pip install pysam==0.19.1

Note: This has only been tested with numpy 1.19.5, scipy 1.5.4, pandas 1.1.5, networkx 2.5.1, and bx-python 0.8.12, but will likely work with any later version.

From source

To install from source:

1. Install the dependencies listed above.

2. Run:

python setup.py install

Usage

IRTools [-h] [-v] {annotation,quant,diff} ...

There are three major functions available in IRTools serving as sub-commands.

Command	Function
annotation	Generate annotation GTF file for intron retention analysis.
quant	Quantify intron retention in both gene and intron levels.
diff	Detection of differential intron retention from two samples with replicates in both gene and intron levels.

annotation

`Arguments`

-g/--GTF-file GTFFILE

Input annotation GTF File. GTF file for a specific species can be downloaded from iGenome.

-o/--annotation-file ANNOFILE

Output annotation GTF file.

--outdir OUTDIR

If specified, all output files will be written to that directory. DEFAULT: the current working directory.

`Outputs`

ANNOFILE is the output GTF file that contains information for intron retention analysis, including the genomic coordinates of constitutive intronic regions (CIRs), constitutive exonic regions (CERs), constitutive exon-intron junctions (CJs), etc.

Sample lines are as follows.

chr1	IR_annotation	constitutive_intronic_region	3411983  3660632     .  	-	  .    downstream_constitutive_junction_number "002"; constitutive_intronic_region_number "001"; upstream_constitutive_junction_number "001"; gene_id "Xkr4"
chr1	IR_annotation	constitutive_junction	        3660632	 3660632	 . 	    -	  .	   constitutive_junction_type "5'_splice_junction"; constitutive_junction_number "001"; downstream "constitutive_intronic_region_number 001"; gene_id "Xkr4"; upstream "constitutive_exonic_region_number 001"

quant

`Arguments`

-q/--quant-type {IRI,IRC}

IR quantification types: intron retention index (IRI), intron retention coefficient (IRC). DEFAULT: "IRI".

-i/--alt-file SAMPLELIB

Input RNA-Seq alignment file. The input file must be a BAM file.

-p/--read-type {paired,single}

"paired" is for paired-end data and "single" is for single-end data. DEFAULT: "single".

Note: See Tutorial for more information.

-s/--library-type {fr-unstranded,fr-firststrand,fr-secondstrand}

Library type. DEFAULT: "fr-unstranded" (unstranded). Use "fr-firststrand" or "fr-secondstrand" for strand-specific data.

Note: See Tutorial for more information.

-u/--map-file MAPFILE (optional)

Mappability score bigWig file (depends on species, sequence length of RNA-Seq library, etc.). Or specify a species (i.e. hg19 or mm9) for which a default annotation file (default for 50 bps of single end RNA- Seq library) can be downloaded and used. If specified, mappability will be taken into account.

Note: To take into account mappability, download RSeQC 4.0.0, and install: python setup.py install

-e/--species {hg19,mm9} (exclusive with -g)

Specify a species for which integrated IR annotation GTF file can be used.

Note: -e and -g are mutually exclusive and one is required.

-g/--annotation-file ANNOFILE (exclusive with -e)

IR annotation GTF file user-built by "IRTools annotation" command. -e and -g are mutually exclusive and one is required.

-n/--name NAME

Sample name, which will be used to generate output file names. REQUIRED.

--outdir OUTDIR

If specified, all output files will be written to that directory. DEFAULT: the current working directory.

-m/--min_overlap MINOVERLAP (specified when -q IRC)

Set when IR quantifiation type is "IRC". Minimum length of overlap between the reads and each of the exons or introns involved in splicing. DEFAULT: 8.

`Outputs`

-q IRI

1. NAME.quant.IRI.genes.txt is the quantification of intron retention index for all genes from RNA-Seq library SAMPLELIB.

The file format is as follows.

gene_id	gene_iv	gene_CIR_length	gene_CER_length	gene_CIR_read_count	gene_CER_read_count	gene_CIR_RPKM	gene_CER_RPKM	gene_IRI
A1BG	chr19:58858171-58864865	4071	1766	80.13888888888889	93.86111111111111	0.7394798014780853	1.9965449549590883	0.37037974008115343
A1CF	chr10:52559168-52645435	72712	9221	24.0	59.0	0.012399074027101793	0.2403577285944172	0.051585917788498296

2. NAME.quant.IRI.introns.txt is the quantification of intron retention index for all CIRs from RNA-Seq library SAMPLELIB.

The file format is as follows.

CIR_id	CIR_iv	CIR_length	adjacent_CER_length	CIR_read_count	adjacent_CER_read_count	CIR_RPKM	adjacent_CER_RPKM	intron_IRI
A1BG:004	chr19:58863053-58863648	474	570	7.138888888888889	17.02777777777778	0.5657653978142747	1.122192132703333	0.5041609019761704
A1BG:005	chr19:58862017-58862756	739	579	18.13888888888889	25.22222222222222	0.9220412349334726	1.636397795045001	0.5634578815281992

-q IRC

1. NAME.quant.IRC.genes.txt is the quantification of intron retention coefficient for all genes from RNA-Seq library SAMPLELIB.

The file format is as follows.

gene_id	gene_iv	gene_retained_reads	gene_spliced_reads	gene_IRC
AAAS	chr12:53701239-53715412	2.5	64	0.0375939849624
AAGAB	chr15:67493012-67547536	0.5	29	0.0169491525424

2. NAME.quant.IRC.introns.txt is the quantification of intron retention coefficient for all CIRs from RNA-Seq library SAMPLELIB.

The file format is as follows.

CIR_id	CIR_iv	CIR_5'retained_reads	CIR_3'retained_reads	CIR_spliced_reads	intron_IRC
AAAS:012	chr12:53702133-53702218	1	0	4	0.111111111111
AAGAB:007	chr15:67496486-67500899	0	1	5	0.0909090909091

3. NAME.quant.IRC.junctions.txt is the quantification of intron retention coefficient for all CJs from RNA-Seq library SAMPLELIB.

The file format is as follows.

CJ_id	CJ_iv	CJ_type	CJ_retained_reads	CJ_spliced_reads	junction_IRC
A1BG:012	chr19:58859005-58859006	3'_splice_junction	3	0	1.0
AAAS:001	chr12:53715125-53715126	5'_splice_junction	0	12	0.0

diff

`Arguments`

Arguments that are same as IRTools quant:

-q/--quant-type {IRI,IRC}

-n/--name NAME

--outdir OUTDIR

Additional arguments:

--indir INDIR

The directory containing the IR quantification result files to be analyzed. Default: the current working directory

-s1/--s1-files S1FILES

A comma-separated list of names for each replicate in sample 1. IR quantification result files referenced by each name must be followed by the ".quant.{IRI,IRC}.{introns,genes,junctions}.txt" extension.

-s2/--s2-files S1FILES

A comma-separated list of names for each replicate in sample 2. IR quantification result files referenced by each name must be followed by the ".quant.{IRI,IRC}.{introns,genes,junctions}.txt" extension.

-t/--analysis-type {P,U}

Type of analysis performed. "P" is for paired replicates analysis and "U" is for unpaired replicates analysis. DEFAULT: "U".

`Outputs`

-q IRI

1. NAME.diff.IRI.genes.txt is the detection of differential intron retention index from two RNA-Seq samples S1FILES and S2FILES for all genes.

The file format is as follows.

gene_id	PValue	FDR	gene_IRI_S1	gene_IRI_S2	ene_IRI_difference
Aaas	0.189649931299	0.558970463129	0.008,0.01,0.007	0.012,0.008,0.019	0.005
9530051G07Rik	0.00474936472952	0.0199711015431	0.0,0.0,0.0	0.183,0.366,0.0 0.183

2. NAME.diff.IRI.introns.txt is the detection of differential intron retention index from two RNA-Seq samples S1FILES and S2FILES for all CIRs.

The file format is as follows.

CIR_id	PValue	FDR	intron_IRI_S1	intron_IRI_S2	intron_IRI_difference
Aacs:014	1.0	1.0	0.0,0.004,0.0	0.003,0.005,0.03	0.011
Abca7:017	1.97072661512e-54	1.13601945194e-52	0.108,0.394,0.281	0.0,0.033,0.0	-0.25

-q IRC

1. NAME.diff.IRC.genes.txt is the detection of differential intron retention coefficient from two RNA-Seq samples S1FILES and S2FILES for all genes.

The file format is as follows.

gene_id	PValue	FDR	gene_IRC_S1	gene_IRC_S2	gene_IRC_difference

2. NAME.diff.IRC.introns.txt is the detection of differential intron retention coefficient from two RNA-Seq samples S1FILES and S2FILES for all CIRs.

The file format is as follows.

CIR_id	PValue	FDR	intron_IRC_S1	intron_IRC_S2	intron_IRC_difference

3. NAME.diff.IRC.junctions.txt is the detection of differential intron retention coefficient from two RNA-Seq samples S1FILES and S2FILES for all CJs.

The file format is as follows.

CJ_id	PValue	FDR	junction_IRC_S1	junction_IRC_S2	junction_IRC_difference

Tutorial

The following tutorial will demonstrate how to use IRTools. We will be working with mm9 data. Due to constraints with file sizes, we do not provide the data for users to run the following commands on their own.

Create an annotation file

The first step in the IRTools pipeline is creating an annotation file to be used for further intron retention analysis. The following is an example of using IRTools annotation to create the annotation file for the mm9 genome.

IRTools annotation -g gtf/mm9_2015.gtf -o gtf/mm9_annotation.gtf

Pre-prepared annotation files for hg19 and mm9 can be found in the data directory. These annotation files were generated with GTF files downloaded from iGenome.

Determine data parameters

Before proceeding with intron retention analysis, you must determine whether the data is pair-end or single-end and the data's strand specificity. If this information is not known, you can run infer_experiment.py to determine these parameters. The following is an example of running infer_experiment.py.

Note: To run infer_experiment.py, download RSeQC 4.0.0, and install: python setup.py install

infer_experiment.py -r ../IRTools/utility/RSeQC-4.0.0/gene_model/mm9_UCSC_knownGene.bed -i bam/B_0h_R1.bam

When the run is finished, we get the following output.

This is PairEnd Data
Fraction of reads failed to determine: 0.0001
Fraction of reads explained by "1++,1--,2+-,2-+": 0.5003
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4996

It is important to keep a record of this output as the information will be needed when running IRTools quant.

The first line of the output gives whether the data is pair-end or single-end. In this case, the data is pair-end, meaning that the BAM file must be sorted by name to run IRTools quant properly. If the data is single-end it does not matter whether or not the BAM file is sorted by name.

The second line of the output gives the fraction of total mapped reads where the strand specificity could not be determined. This number should be relatively low otherwise it could indicate that the data's quality is not great.

The third and fourth lines give the fraction of mapped reads that can be attributed to a certain strand. In this example, half can be explained by "1++,1--,2+-,2-+" and half can be explained by "1+-,1-+,2++,2--". This indicates that the data is unstranded. If the large majority of reads were to be explained by "1++,1--,2+-,2-+", this would indicate that the data is strand specific to the second strand. If the large majority of reads were to be explained by "1+-,1-+,2++,2--", this would indicate that the data is strand specific to the first strand. The interpretation of the output is that same for single-end data except that rather than the mapped reads being explained by "1++,1--,2+-,2-+" and "1+-,1-+,2++,2--" they are explained by "++,--" and "+-,-+", respectively.

The steps above should be repeated with all replicates.

Quantify intron retention

IRTools quant allows users to detect and quantify intron retention events in RNA-Seq data. The intron retention events can either be quantified as an intron retention index (IRI) or intron retention coefficient (IRC). The IRI of a CIR is defined as the ratio of its read density to the read density of its adjacent CERs and the IRI of a gene is defined as the ratio of the overall read density of CIRs in that gene to the overall read density of CERs. The IRC of a CIR is defined as the fraction of junction reads that are CJs (average of 5' CJ reads and 3' CJ reads) and the IRC of a gene is defined as the fraction of junction reads altogether in this gene that are CJs. Additionally, IRC also quantifies CJs for which the IRC is defined as the fraction of junction reads that are CJs. The following is an example of running IRTools quant in IRI mode with the parameters obtained above. Make sure to double check that the BAM file is sorted by name if the data is pair-end.

IRTools quant -q IRI -i bam/B_0h_R1.bam -p paired -s fr-unstranded -e mm9 -f BAM -n B_0h_R1 --outdir quant

Shown below are the first few lines of the output CIR IRI file and output gene IRI file, respectively.

CIR_id  CIR_iv  CIR_length      adjacent_CER_length     CIR_read_count  adjacent_CER_read_count CIR_RPKM        adjacent_CER_RPKM       intron_IRI
0610005C13Rik:001       chr7:52830147-52830496  349     220     0.0     0.9519439748852935      0.0     0.1353347974887704      0.0
0610005C13Rik:002       chr7:52829892-52829977  85      280     0.0     2.155574337921597       0.0     0.24078295617790033     0.0
0610005C13Rik:003       chr7:52826562-52829782  3093    110     0.0     1.2036303630363037      0.0     0.3422324752934783      0.0
0610005C13Rik:004       chr7:52823749-52826355  2467    585     0.0     0.0     0.0     0.0     NA

gene_id gene_iv gene_CIR_length gene_CER_length gene_CIR_read_count     gene_CER_read_count     gene_CIR_RPKM   gene_CER_RPKM   gene_IRI
0610005C13Rik   chr7:52823164-52830546  5994    915     0.0     2.155574337921597       0.0     0.0736822160981553      0.0
0610007P14Rik   chr12:87156404-87165495 7690    1185    143.0433093553559       748.9566906446437       0.581784274398963       19.76783572161043       0.02943085336160243
0610009B22Rik   chr11:51498886-51502136 2453    795     33.161477559187205      101.8385224408128       0.42282153013403        4.006505375049167       0.10553374838011824
0610009L18Rik   chr11:120209991-120212504       1894    619     31.046335986936107      23.953664013063893      0.5126856329027132      1.2103251933843449      0.4235932918731762

Analyze differential intron retention

The final step of the IRTools pipeline is analyzing differential intron retention across samples using IRTools diff. The following is an example of running IRTools diff on data with three replicates each.

IRTools diff -q IRI --indir quant -s1 B_0h_R1,B_0h_R2,B_0h_R3 -s2 B_24h_R1,B_24h_R2,B_24h_R3 -n B_rest_active

Shown below are the first few lines of the output differential CIR IRI file and differential gene IRI file, respectively.

CIR_id  PValue  FDR     intron_IRI_S1   intron_IRI_S2   intron_IRI_difference
0610007P14Rik:001       0.01748570124638199     0.04794954044905743     0.007060475733529612,0.014128453952795113,0.014477353807235999  0.0015103179228320555,0.002840643279225302,0.0026121421370695623        -0.009567726718144601
0610007P14Rik:002       0.06787966337015987     0.12263988572369974     0.009760132590207683,0.010902219030707407,0.019954850865529797  0.005107327432614331,0.006744590566349023,0.004010156747295317  -0.008251709246728738
0610007P14Rik:003       0.005272071342986928    0.02435463738145186     0.04637347332930471,0.06178728248456752,0.07360228639288818     0.01638970338560156,0.017775747116409467,0.016974940064366832   -0.043540883880127515
0610007P14Rik:004       0.18363340347714693     0.2615404726932045      0.06460308673989847,0.06047582787680872,0.1100761804966993      0.0379698930660664,0.06959717944498957,0.033493469152358374     -0.031364851149997375

gene_id PValue  FDR     gene_IRI_S1     gene_IRI_S2     gene_IRI_difference
0610005C13Rik   0.3739009663000589      0.43369087370573584     0.0,0.0,0.0     0.03589700330832069,0.0,0.0     0.011965667769440231
0610007P14Rik   0.015597116208720256    0.029785444823973002    0.02943085336160243,0.04145926912655988,0.05402988190915748     0.011682591550220334,0.01425293641802157,0.01240164576089424    -0.028860943556061218
0610009B22Rik   0.004408064327194515    0.013804288139961773    0.10553374838011824,0.1268034455814951,0.15060043224605094      0.04221736494775407,0.050984225384648134,0.05459385479080759    -0.07838072702815149
0610009L18Rik   0.09138325586455809     0.12748981426961586     0.4235932918731762,0.37842494303340185,0.4278781983026458       0.598773864510687,1.282172938644743,1.9609292502639917  0.8706598734033992

An helpful step to check differential intron retention in individual CIRs/genes/junctions is to visualize them using IGV or another genome viewer of your choice. The following is what the Gar1 gene looks like when visualized on IGV with this data.

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
IRTools		IRTools
bin		bin
.gitignore		.gitignore
IGV_example.PNG		IGV_example.PNG
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IRTools

Table of Contents

Installation

PIP

From source

Usage

annotation

`Arguments`

`Outputs`

quant

`Arguments`

`Outputs`

diff

`Arguments`

`Outputs`

Tutorial

Create an annotation file

Determine data parameters

Quantify intron retention

Analyze differential intron retention

About

Releases 6

Packages

Languages

License

WeiqunPengLab/IRTools

Folders and files

Latest commit

History

Repository files navigation

IRTools

Table of Contents

Installation

PIP

From source

Usage

annotation

Arguments

Outputs

quant

Arguments

Outputs

diff

Arguments

Outputs

Tutorial

Create an annotation file

Determine data parameters

Quantify intron retention

Analyze differential intron retention

About

Resources

License

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

`Arguments`

`Outputs`

`Arguments`

`Outputs`

`Arguments`

`Outputs`

Packages