Skip to content

Commit

Permalink
code untar in directory
Browse files Browse the repository at this point in the history
to keep track of code changes
  • Loading branch information
warrenlr committed Nov 15, 2016
1 parent c1510f9 commit 673c912
Show file tree
Hide file tree
Showing 12 changed files with 2,842 additions and 0 deletions.
1,046 changes: 1,046 additions & 0 deletions RAILS_v1.1/RAILS

Large diffs are not rendered by default.

530 changes: 530 additions & 0 deletions RAILS_v1.1/cobbler.pl

Large diffs are not rendered by default.

188 changes: 188 additions & 0 deletions RAILS_v1.1/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
## RAILS v1.1 and Cobbler v0.2 Rene L. Warren, 2014-2016
email: rwarren at bcgsc.ca

### Name

RAILS: Radial Assembly Improvement by Long Sequence Scaffolding
Cobbler: Gap-filling with long sequences


### Description

RAILS and Cobbler are genomics application for scaffolding and automated finishing of genome assemblies with long DNA sequences.
They can be used to scaffold & finish high-quality draft genome assemblies with any long, preferably high-quality, sequences such as scaftigs/contigs from another genome draft.

They both rely on accurate, long DNA sequences to patch gaps in existing genome assembly drafts.

Cobbler is a utility to automatically patch gaps (ambiguous regions in a draft assembly, represented by N's)
It does so by first aligning the long sequences to the assembly, tallying the alignments and replacing N's with the sequences from these long DNA sequences.

RAILS is an all-in-one scaffolder and gap-filler. Its process is similar to that of Cobbler. It scaffolds your genome draft with the help of long DNA sequences (contig sequences are ordered/oriented using alignment information). The newly created gaps are automatically filled with the DNA sequence of the provided long DNA sequence.

You can test the software by executing "runme.sh" in the test folder. A simulated SARS genome assembly is provided to test the software.

### Implementation and requirements

RAILS and Cobbler are implemented in PERL and run on any OS where PERL is installed.


### Community guidelines:

I encourage the community to contribute to the development of this software, by providing suggestions for improving the code and/or directly contributing to the open source code for these tools. Users and developers may report software issues, bug fix requests, comments, etc, at <https://github.com/warrenlr/RAILS>


### Install

Download the tar ball, gunzip and extract the files on your system using:

gunzip rails_v1-1.tar.gz
tar -xvf rails_v1-1.tar

Alternatively, individual tools are available within the github repository


### Dependencies

Make sure you have installed bwa (Version: 0.7.15-r1140) and that is is in your path.


### Test data

Go to ./test
(cd test)

1. SARS:
execute runme.sh
(./runme.sh)

2. Human:
execute runmeHuman.sh (will take a while to run)
(./runmeHuman.sh)


### Usage

./runRAILS.sh
Usage: runRAILS.sh <FASTA assembly .fa> <FASTA long sequences .fa> <anchoring sequence length eg. 250> <min sequence identity 0.95>

this pipeline will:
1. reformat the assembly file $1
2. rename the long sequence file $2
3. Build a database index with bwa
4. Align the reformatted long sequences to your re-formatted baseline assembly
5. Run Cobbler to gap-fill regions of ambiguity
6. Reformat Cobbler's .fa file
7. Build a database index of it with bwa
8. Align the reformatted long sequences to your re-formatted cobbler assembly
9. Run RAILS to generate a newly scaffolded assembly draft

Usage: ./cobbler.pl [v0.2]
-f Assembled Sequences to further scaffold (Multi-Fasta format, required)
-q Long Sequences queried (Multi-Fasta format, required)
-s SAM file
-d Anchoring bases on contig edges (ie. minimum required alignment size on contigs, default -d 1000, optional)
-i Minimum sequence identity, default -i 0.9, optional
-t LIST of names/header, long sequences to avoid using for merging/gap-filling scaffolds (optional)
-b Base name for your output files (optional)
-v Runs in verbose mode (-v 1 = yes, default = no, optional)

Usage: ./RAILS [v1.1]
-f Assembled Sequences to further scaffold (Multi-Fasta format, required)
-q Long Sequences queried (Multi-Fasta format, required)
-s SAM file
-d Anchoring bases on contig edges (ie. minimum required alignment size on contigs, default -d 1000, optional)
-i Minimum sequence identity, default -i 0.9, optional
-t LIST of names/header, long sequences to avoid using for merging/gap-filling scaffolds (optional)
-b Base name for your output files (optional)
-v Runs in verbose mode (-v 1 = yes, default = no, optional)


### How it works

The pipeline is detailed in the provided script runRAILS.sh

Cobbler's process:

The assembly draft sequence supplied to Cobbler is first broken up at the ambiguous regions of the assembly (Ns) to create scaftigs.
In the runRAILS.sh, these scaftigs are renamed, tracking their scaffold of origin (renumbered incrementally) and their position within it (also numbered incrementally).
A bwa index is created and the long sequence file, also re-numbered, is aligned to the scaftigs.
Cobbler is supplied with the alignment file (-s sam file) and the long reads files (-q option), specifying the minimum length of anchoring bases (-d) aligning at the edge of scaftigs and the minimum sequence identity of the alignment (-i). When 1 or more long sequences align unambiguously to the 3'end of a scaftig and the 5'end of its neighbour, the gap is patched with the sequence of that long sequence. If no long sequences are suitable, or the -d and -i conditions are not met, the original Ns are placed back between those scaftigs.

RAILS process:

In RAILS, the process is similar as for Cobbler, except that the draft assembly is not broken up at Ns, since the goal is to merge distinct sequences into larger ones. Long sequences are aligned to the draft assembly sequences, orienting and ordering sequences and simulateneously filling the gaps between them, using DNA bases from the long sequences.

Scaffolding in RAILS is done using the LINKS scaffolder code (Warren et al. 2015), the unpublished scaffolding engine in the widely-used SSAKE assembler (Warren et al. 2007), and foundation of the SSPACE-LongRead scaffolder (Boetzer and Pirovano, 2014).

Output: For both Cobbler and RAILS, a summary of the gaps closed and their lengths is provided (.tsv) as a text file.
A fasta file (.fa) of the finished and/or scaffolded draft is generated for both along with a log file reporting basic success statistics.


Boetzer M, Pirovano W. 2014. SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information. BMC Bioinformatics.15:211. DOI: 10.1186/1471-2105-15-211

Warren RL, Yang C, Vandervalk BP, Behsaz B, Lagman A, Jones SJ, Birol I. 2015. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience 4:35. DOI: 10.1186/s13742-015-0076-3

Warren RL, Sutton GG, Jones SJM, Holt RA. 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics. 23(4):500-501. DOI: 10.1093/bioinformatics/btl629


### Runs on the human genome

On a human draft assembly, cobbler patched over 65% of the gaps using 1, 2.5, 5, 15 kb long DNA sequences simulated from the human genome reference. The Pearson correlation between the predicted gap sizes and the size of patched gaps is R=0.8150


**Table 1.** Patching gaps with Cobbler using simulated 1, 2.5, 5, 15kbp simulated long sequences from human genome reference GRCh38.

Metric | Value
---- | ----
Total gaps | 148,091
Number of gaps patched | 95,523
Proportion of gaps patched | 65.1%
Average length (bp) | 343.39
Length st.dev +/- | 931.12
Total bases added | 32,801,755
Largest gap resolved (bp) | 13,662
Shortest gap resolved (bp) | 1

RAILS was used to further contiguate the human baseline assembly draft and automatically close gaps within in:

**Table 2.** RAILS scaffolding and gap-filling summary on a human assembly baseline, using simulated 1, 2.5, 5, 15kbp simulated long sequences from human genome reference GRCh38.

Metric | Value
---- | ----
Number of merges induced | 6,029
Average closed gap length (bp) | 1,136.71
Closed gap length st.dev +/- | 2,511.69
Total bases added | 6,853,222
Largest gap resolved (bp) | 14,471
Shortest gap resolved (bp) | 1

6,029 merges resulted from RAILS scaffolding of the baseline human assembly draft (1,695 >= 500bp)
The scaffold N50 length increased from 5.6 to 7.3 Mbp, a 30% increase in N50 length.

**Table 3.** Assembly statistics on human genome scaffolding and finishing post cobbler and RAILS.

n:500 | n:N50 | n:NG50 | NG50 | N50 | E-size | max | sum | name
------ | ----- | ----- | --------- | --------- | --------- | ------- | ------- | ---------
65,905 | 145 | 164 | 5,144,025 | 5,597,244 | 7,101,538 | 26.41e6 | 2.794e9 | baseline
65,905 | 145 | 161 | 5,312,196 | 5,658,133 | 7,175,808 | 26.66e6 | 2.827e9 | cobbler
64,210 | 113 | 125 | 6,935,685 | 7,266,542 | 9,007,414 | 32.14e6 | 2.836e9 | RAILS


### License Preamble

RAILS and Cobbler Copyright (c) 2014-2016 British Columbia Cancer Agency Branch. All rights reserved.

RAILS and Cobbler are released under the GNU General Public License v3

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 3.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.

30 changes: 30 additions & 0 deletions RAILS_v1.1/runRAILS.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/bash
#RLW 2016
if [ $# -ne 4 ]; then
echo "Usage: $(basename $0) <FASTA assembly .fa> <FASTA long sequences .fa> <anchoring sequence length eg. 250> <min sequence identity 0.95>"
exit 1
fi
###Change line below to point to path of bwa executables
export PATH=/gsc/btl/linuxbrew/bin:$PATH
echo Resolving ambiguous bases -Ns- in $1 assembly using long sequences $2
echo reformatting file $1
cat $1 | perl -ne 'if(/^\>/){$scafnum++;}else{my $len=length($_);my @scaftigs=split(/N+/i,$_);my $scaftignum=0;foreach my $scaftig(@scaftigs){ my $len=length($scaftig);$scaftignum++; print ">wga$scafnum";print "."; print "$scaftignum,$len\n$scaftig\n";}}' > $1-formatted.fa
echo reformatting file $2
cat $2 | perl -ne 'if(/^\>/){$ct++;}else{my $len=length($_);print ">seq$ct,$len\n$_";}' > $2-formatted.fa
echo Building sequence database index out of your $1-formatted.fa assembly contigs..
bwa index $1-formatted.fa
echo Aligning long sequences $2-formatted.fa to your contigs..
bwa mem -a -t4 $1-formatted.fa $2-formatted.fa > $2_vs_$1_gapfilling.sam
echo Scaffolding $1-formatted.fa using $2-formatted.fa and filling gaps with sequences in $2-formatted.fa
./cobbler.pl -f $1 -s $2_vs_$1_gapfilling.sam -d $3 -i $4 -b $2_vs_$1_$3_gapsFill -q $2-formatted.fa
echo Process terminated.
echo RAILS scaffolding $1.gapsFill.fa sequences using long seqs $2 -- anchoring sequence threshold $3 bp
echo reformatting file $1.gapsFill.fa
cat $2_vs_$1_$3_gapsFill.fa | perl -ne 'if(/^\>/){$ct++;}else{my $len=length($_);print ">wga$ct,$len\n$_";}' > $2_vs_$1_$3_gapsFill-formatted.fa
echo Building sequence database index out of your $2_vs_$1_$3_gapsFill-formatted.fa assembly contigs..
bwa index $2_vs_$1_$3_gapsFill-formatted.fa
echo Aligning long sequences $2-formatted.fa to your contigs..
bwa mem -a -t4 $2_vs_$1_$3_gapsFill-formatted.fa $2-formatted.fa > $2_vs_$1_scaffolding.sam
echo Scaffolding $2_vs_$1_$3_gapsFill-formatted.fa using $2-formatted.fa and filling new gaps with sequences in $2-formatted.fa
./RAILS -f $2_vs_$1_$3_gapsFill-formatted.fa -s $2_vs_$1_scaffolding.sam -d $3 -i $4 -b $2_vs_$1_$3_rails -q $2-formatted.fa
echo RAILS process terminated.
6 changes: 6 additions & 0 deletions RAILS_v1.1/test/SARSassembly.fa

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions RAILS_v1.1/test/SARSgenome.fa

Large diffs are not rendered by default.

Loading

0 comments on commit 673c912

Please sign in to comment.