Bo Han edited this page Jan 19, 2016 · 31 revisions

piPipes installation and genome preparation

This document explains how to obtain piPipes from Github and how to install genome files.

To obtain and update piPipes

To obtain piPipes using commandline git

To clone the directory from Github, you will need to have git installed on your system. If not, please download git here.

# The genome sequence and annotations will be stored under the piPipes directory
# so allow extra ~8.5 G for dm3 (fly), ~90 G for mm9 (mouse), ~131 G for hg19 (human)
git clone

To update piPipes

# If you have git, enter the piPipe directory and then type:
git pull

# occasionally, you might get error message like:
git pull
Updating 42bf792..fe137aa
error: Untracked working tree file 'common/dm3/rRNA.fa' would be overwritten by merge.  Aborting
# this issue originated from the explicit inclusion of rRNA.fa file in piPipes, which conflicts
# with the same file extracted from iGenome when you install the genome
# to solve "Untracked working tree file":
rm -f common/dm3/rRNA.fa && git pull

# you might also get error like
warning: Cannot merge binary files: common/dm3/structured_loci.bed.gz 
(HEAD vs. fe137aab4c81c6b0ff3f66cef68e3b7e396aba15)

Auto-merging common/dm3/structured_loci.bed.gz
CONFLICT (content): Merge conflict in common/dm3/structured_loci.bed.gz
Automatic merge failed; fix conflicts and then commit the result.

# this issue originated from force update of file which was included in gitignore
# to solve "Merge conflict":
git checkout -- common/dm3/structured_loci.bed.gz 
git pull

To re-install (start from scratch)

# If you have git, enter the piPipe directory and then type:
git reset --hard origin/master
# to re-install a genome in a clean background, enter the common/ directory and do:
rm -rf bosTau7
git checkout -- bosTau7
piPipes install -g bosTau7

To obtain piPipes from release page

Alternatively, you can obtain piPipes from its release page. Note that you will not be able to make upgrades without git.

To set up piPipes

Make symbol links to piPipes script, so that you can find piPipes without explicitly typing the absolute path:

# Enter the piPipes directory
ln -s $PWD/piPipes $HOME/bin/piPipes
ln -s $PWD/piPipes_debug $HOME/bin/piPipes_debug
# If successfully done:
$ which piPipes

Other softwares

piPipes has most of the third-party tools pre-compiled and included in the bin directory. They will be automatically found when you run piPipes. To avoid mixing them with your own versions, we do not recommend to add /piPipes/bin to the $PATH. However, there are some tools that we find them hard to ship so the user will need to install them if haven't done so.

# 1. R
# Please follow instructions on to install R
#! if successfully installed:
$ which Rscript
# ! Note R 3.1.0 has a different behavior for read.table ().
# and has been fixed in 3.1.1...
# Also try to keep only one version of R in your system or $PATH
# Many of the "bugs" reported by our users were caused by multiple versions of R!

# FYI: in the installation pipeline, piPipes will try to install the following packages.
# It would be nice if they are manually installed and confirmed.
## from CRAN
## from Bioconductor

# 2. HTSeq-count
# Please follow instructions on
# to install HTSseq-count
# or if you have pip set up
pip install HTSeq
#! if successfully installed:
$ which htseq-count
# HTSeq-count is used in RNA-seq pipeline; if you are not planning to use RNA-seq pipeline, you
# might not need it

# 3. MACS2
# Please follow instructions on
# to install MACS2
# or if you have pip
pip install macs2 # please run "macs2 callpeak -h" to see if the option --outdir is included...; if not, install it from github

#! if successfully installed:
$ which macs2
# MACS2 is used in ChIP-seq pipeline; if you are not planning to use ChIP-seq pipeline, you
# might not need it

# 4. Perl Module Statistics::Descriptive; install it through
cpan Statistics::Descriptive
#! if successfully installed:
$ perl -MStatistics::Descriptive -e "print \"Installed.\\n\";"
# Bio::Seq
# please follow the instructions here
# Theose two modules are only used in genome-seq pipeline; if you are not planning to use genome-seq 
# pipeline, you might not need it

# 5. GNU awk
# GNU awk is heavily used in piPipes. But some versions of awk do not have the GNU extension, 
# for example, the definition of variable ARGIND; to test is
$ echo 1 | awk '{print ARGIND}'
# if it prints nothing, it means that your awk doesn't define ARGIND variable, it will cause 
# issues when you run piPipes
# the easiest way to install gawk is to use linuxbrew
# Please follow their instruction to install linuxbrew and install gawk with the following:
$ brew install gawk
# then you have to make a symbol link in the /bin directory of piPipes to make it use it as  "awk"
$ ln -s $HOME/.linuxbrew/bin/gawk /path/to/piPipes/bin/awk

To install genome

piPipes provides a uniform interface for different organisms/genomes. Due to Github's limit on the size of a single file, genome sequences and annotations are downloaded separately. The user will need to perform an installation to download the files and prepare them for other pipelines to use.

To install a specific genome in one step:

piPipes install -g dm3    # fly genome dm3
piPipes install -g dm6    # fly genome new release, BDGP6
piPipes install -g mm9    # mouse genome mm9
piPipes install -g hg19   # human genome hg19

Many computing clusters only have internet access on the 'head node', which should only be used to submit jobs but not to run jobs. To separate downloading and preparation steps:

# under the "head" node: with internet access but no computing power
piPipes install -g dm3 -D
# finish the work under a computing node
piPipes install -g dm3
# Some steps take advantage of multiple CPUs, so providing more than one CPUs using `-c`
# accelerates the installation process.
piPipes install -g dm3 -c 8


  • piPipes uses wget --continue so downloading will resume if the installation is disrupted. piPipes also only runs steps that haven't succeeded.

  • During the installation, the user will be prompted to define the length of siRNAs and piRNAs for the genome to be installed. Our lab uses 20-22 nt for fly/mouse siRNA, 23–29 nt for fly piRNA and 23–35 for mouse piRNA. This information is stored in common/dm3/variables files and users can change the values manually later.

  • The installation of R packages is NOT multi-threading safe, so please install each genome separately.

Genome Assembly Supported

Currently, Drosophila melanogaster and Mus Musculus piRNAs are the most well studied. piPipes is optimized for those two species (assembly version dm3 and mm9 from UCSC). For other organisms, due to either the relatively immature piRNA cluster annotation, some functions in the pipelines may not be performed. Please contact us if you would like to contribute to the annotations of organisms that are poorly supported by piPipes.

File organization

All the files for a specific genome are stored under the /path/to/piPipes/common/. For example, fly files are stored under /path/to/piPipes/common/dm3. Most of them are in gzipped BED format.


piPipes downloads the annotation from iGenome, which misses the chrU and X-TAS. piPipes thus downloads chrU.fa from UCSC, and put X-TAS.fa in the Github repository.

For piRNA cluster annotation, piPipes uses the one from Brennecke, et al., Cell, 2007.

For transposons, piPipes uses two different annotations. transposon sequences are from flyBase and repBase sequences are from repBase. The transposon annotation has been used in the Zamore Lab since Li, et al., Cell, 2009. The repBase annotation separated Long Terminal Repeat (LTR) of a retrotransposon from the middle part. So the LTR derived sequences do not become multi-mappers simply due to the presence of two LTRs in a transposon sequence.

BDGP6 (Berkeley Drosophila Genome Project Release 6)

piPipes has incorporated the new assembly of fruitfly genome release 6.

# To install the new release, type:
piPipes install -g dm6

Since it was just released (July 2014), iGenome or UCSC has not incorporated it. We used most of the annotation files from flyBase. Several notes:

1.piRNA cluster

Using the converter tool provided by flyBase, we tried to make the new coordinates of piRNA clusters. However, 46 clusters cannot be successfully found in the new assembly, mostly due to "maps to more than one scaffold".

We now only keep the 96 ones that can be successfully mapped. But we are planning to use new data with higher depth and possibly new algorithsm to annotate new clusters.

For more information, please read file common/dm6/Brennecke.piRNAcluster.bed6.converted.failed

2.Repeat Masker

We ran repeatMasker using the following parameter to identify transposon sites in BDGP6.

Note that by providing -species drosophila, we were using the transposon sequences from repBase instead of the sequences from flyBase.

# Using flyBase transposon sequences
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-lib dmel-all-transposon-r6.01.fasta \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> flyBase.stdout \
	2> flyBase.stderr
# Using repBase
RepeatMasker \
	-pa 24 \
	-s \
	-low \
	-species drosophila \
	-gff dmel-all-chromosome-r6.01.fasta \
	1> repBase.stdout \

3.GTF file

The gtf file obtained from flyBase cannot be correctly processed by gtfToGenePred from kent tools, due to the presence of "trans-splicing" of mdg4.

invalid gffGroup detected on line: 3R	FlyBase	CDS	21375060	21375912	3.000000	-	0	gene_id "FBgn0002781"; transcript_id "FBtr0084081";
GFF/GTF group FBtr0084081 on 3R+, this line is on 3R-, all group members must be on same seq and strand
# the rest trans-splicing ones include


We thus removed all the mdg4 annotations.

grep -v mdg4


piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Li, et al., Mol Cell, 2013 and transposon annotation from repBase.


piPipes downloads the annotation from iGenome.

piPipes uses the piRNA cluster annotation from Rosenkranz, et al., BMC Bioinformatics, 2013 and transposon annotation from repBase.

other genomes with iGenome support

In order for piPipes to perform its full function on other genomes, the following steps should be completed:

1.Annotate piRNA cluster and provide it in BED format. Pleases also provide the sequences in a file named ${GENOME}.piRNAcluster.fa.

Run proTRAC or piClust to produce piRNA cluster annotation.

	Rosenkranz D and Zischler H. 2012. proTRAC--a software for probabilistic piRNA cluster detection,
visualization and analysis. BMC Bioinformatics 13: 5.
	Jung, I., Park, J. C. & Kim, S. piClust: A density based piRNA clustering algorithm.
Comput Biol Chem (2014).

2.Get gene structure annotations from UCSC table browser or through the mySQL interface. We have already included those files for many organisms in the common folder. If the folder already exist, there is no need to do this step.

other genomes without iGenome support

We provided an option -C to install genomes that are not currently supported by iGenome:

-C  Custom genome installation. The user will need to create a folder 
    $PIPELINE_DIRECTORY/common/MY_GENOME and provide the following files:
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.fa --> genome sequence in fasta format
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.transposon.fa --> transposon sequence in fasta format
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.piRNAcluster.bed --> piRNA cluster in bed format
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.genes.gtf --> genes annotation in gtf format
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.hairpin.fa --> miRNA hairpin sequence in fasta format
	$PIPELINE_DIRECTORY/common/MY_GENOME/MY_GENOME.mature.fa --> miRNA sequence in fasta format
  *Note that if you obtain hairpin and mature sequences from miRBase, you can extract the sequences 
corresponding to your genome using $PIPELINE_DIRECTORY/bin/
	$PIPELINE_DIRECTORY/bin/ hairpin.fa dme > \
	$PIPELINE_DIRECTORY/bin/ mature.fa  dme > \
  Then run: 
  piPipes install -g MY_GENOME -C

Better name the genome just using lowercase a-z and underscore. Avoid using all upper-case name such as "GENOME".

Then please create the genome_feature files according to the instruction at the end of this document.

Currently the following genomes have been done for this step


3.Edit the genomic_features file under the genome folder. See the next section.

4.The genome sequences should be provided in a file named as $GENOME.fa. piPipes builds bowtie index of the genome sequence for small RNA pipeline, STAR index for RNA-seq and degradome pipeline and Bowtie2 index for Genome-seq pipeline.

5.The rRNA sequence should be provided in a file named as rRNA.fa. piPipes builds bowtie index of the rRNA for small RNA, bowtie2 index for normal RNA.

6.The transposon consensus sequences should be provided and named as ${GENOME}.repBase.fa. piPipes builds bowtie index of the repBase/transposon/piRNA cluster for small RNA.

Basic piPipes directory structure

|-- piPipes/ # top directory
|   |-- piPipes # main bash script to run
|   |-- piPipes_debug # main bash script to run, debug mode
|   |-- bin/ # binrary executables
|       |-- # smallRNA seq pipeline, single sample mode
|       |-- # smallRNA seq pipeline, dual sample mode
|       |-- # RNA-seq pipeline, single sample mode
|       |-- # RNA-seq pipeline, dual sample mode
|       |-- # Degradome-seq pipeline
|       |-- # ChIP-seq pipeline, single sample mode
|       |-- # ChIP-seq pipeline, dual sample mode
|       |-- # Genomic Seq pipeline
|       |-- ... # binaries like bowtie, STAR, cufflinks ...
|   |-- src/ # source codes
|       |-- bed2_to_bedGraph.cpp # piPipes source codes
|       |-- third_party/ # source codes of other tools; use this if the precompiled ones don't work
|       |-- ...
|   |-- common/ # where annotations and sequences been stored
|       |-- mm9/
|       |-- dm3/
|           |-- dm3.fa # genome sequence
|           |-- genomic_features # very important configuration file, see below
|           |-- Brennecke.piRNAcluster.bed6.gz # one the the annotation file, in bed format
|           |-- BowtieIndex/
|           |-- ...
|       |-- dm6/
|       |-- hg19/
|       |-- genome_supported.txt # storing the names of genome that has been installed
|       |-- RepBase19.02.fasta.tar.gz # transposon consensus sequences from repBase
|       |-- # eXpress only takes the first token of Fasta name...

common folder

piPipes downloads annotations from iGenome (UCSC version), which usually includes genomic sequence (fasta), rRNA (fasta), transcriptome (gtf) to be used by piPipes. piPipes includes the repBase(fasta) in the github for dm3 and mm9. For other genomes, please retrieve the repBase.fa and name it ${GENOME}.repBase.fa in the common/${GENOME} directory. For example, run:

# Enter the directory unarchived from RepBase19.02.fasta.tar.gz
$ cat humrep.ref humsub.ref > ../hg19/hg19.repBase.fa
# for hg19 genome, please then run 
bash hg19/hg19.repBase.fa > hg19/hg19.repBase.fa.1 && \
mv hg19/hg19.repBase.fa.1 hg19/hg19.repBase.fa
# it replace space in the fasta header to underscore
# this step is essential since eXpress only uses the first token as name
# and some transposon sequences share same name

genomic features

piPipes includes a bunch of genomic features (bed) in the genomic_features file under the directory of each genome. Please also include them in the common/${GENOME} directory and add them in the TARGET array in common/${GENOME}/genomic_features. Follow the following example to set up:

# variables for small RNA pipeline intersecting
	# tRNA, rRNA, nonCoding RNA (flyBase) from UCSC table browser
	# piRNA cluster defined in Brennecke, et al,. Cell, 2007; no strand information
	# 42AB
	# 20A
	# flam
	# repeatMakser obtained from UCSC
	# repeat masker identified region that fall into piRNA cluster
	# repeat masker identified region that fall outside piRNA cluster
	# transposon region used in Li, et al., Cell, 2009. More conserved than repeat masker
	# transposon region in cluster
	# transposon region out cluster
	# transposons that failed to pass threshold in Li, et al., Cell, 2009.
	# More conserved than repeat masker
	# group 1 transposon in Li, et al., Cell, 2009, mainly germline
	# group 2 transposon in Li, et al., Cell, 2009
	# group 3 transposon in Li, et al., Cell, 2009, mainly somatic
	# flyBase gene
	# flyBase exons
	# flyBase introns
	# flyBase introns that subtract repeatMasker
	# flyBase 5' UTR
	# flyBase CDS
	# flyBase 3' UTR
	# cis-NATs
	# structural loci
	# linc RNA identified in 'Identification and properties of 1,119 candidate lincRNA loci in the
	# Drosophila melanogaster genome. Genome Biol Evol. 2012;4(4):427-42.'
	# unannoated region, basically all the genome segments between annotations defined above

# TARGETS is used in small RNA-seq and degradome-seq pipeline
	declare -a TARGETS=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"repeatMasker_IN_Cluster" \
	"repeatMasker_OUT_Cluster" \
	"Trn" \
	"Trn_IN_Cluster" \
	"Trn_OUT_Cluster" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_CDS" \
	"flyBase_3UTR" \
	"cisNATs" \
	"structural_loci" \
	"lincRNA" \
	"unannotated" )

# TARGETS_SHORT is used for "cis-Ping-Pong" analysis between degradome/small RNA.
# Since this step uses multi-threading itself, we are not able to run each feature simultaneously
# thus a few less important ones have been removed
	declare -a TARGETS_SHORT=( \
	"piRNA_Cluster" \
	"piRNA_Cluster_42AB" \
	"piRNA_Cluster_20A" \
	"piRNA_Cluster_flam" \
	"repeatMasker" \
	"Trn" \
	"Trn_GROUP1" \
	"Trn_GROUP2" \
	"Trn_GROUP3" \
	"Trn_GROUP0" \
	"flyBase_Gene" \
	"flyBase_Exon" \
	"flyBase_Intron_xRM" \
	"flyBase_5UTR" \
	"flyBase_3UTR" \
	"lincRNA" )

# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats, 
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
	declare -a TARGETS_EXCLUSIVE=(\
	"piRNA_Cluster" \
	"CDS" \
	"FivePrimeUTR" \
	"ThreePrimeUTR" \
	"Intron" \
	"Repeats" \
	"tRNA_NonCoding" \
# variables for small RNA direct mapping
	declare -a DIRECT_MAPPING=( "transposon" "repBase" "piRNAcluster" )

# gtf files for rnaseq/deg/cage htseq-count
	declare -a HTSEQ_TARGETS=( "Genes_transposon_Cluster" "Genes_repBase_Cluster" )

For example:

	# put the bed files under the common/xxx folder
	#MASK is used to mask regions
	# some regions of interest
	# put them in an array in this awy
	declare -a TARGETS=( \
	"piRNACluster" \
	"myGene" \
	"regionOfInterest" \

# The following variables are for the pie chart, which gives reads information for genomic
# features that are mostly exclusive to each other. Different from the genomic feature count
# using TARGETS, reads mappable to genomic features in TARGETS_EXCLUSIVE will be partitioned.
# For example, if a read overlaps with a region annotated as both piRNA_Cluster and Repeats, 
# piRNA_Cluster and Repeats will each get half of the reads.
# Please see small RNA-seq pipeline document for more information.
	declare -a TARGETS_EXCLUSIVE=(\
	"piRNACluster" \
	"myGene" \
	"regionOfInterest" \
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.