A fully automated pipeline for TAL effector RVD sequence determination from raw PacBio data
- AMOS 3.1.0
- BLAST command line tools 2.2.28+
- GNU Parallel 20130722+
- MUMmer 3.23
- Python 2.6/2.7
- SMRTAnalysis 2.3
To run the pipeline edit config.ini then do:
python pbx.py config.ini
config.ini has only 3 required parameters:
smrtanalysis_path
is the full path to SMRTAnalysismummer_path
is the full path to MUMmerraw_reads_path
is the full path to a folder containing raw PacBio reads in .bas.h5 and .bax.h5 format.
Other parameters of possible interest:
results_path
is the directory the results will be stored intale_seqs_file_whitelisting
is the sequence file in tale_seqs/whitelisting that raw reads will be aligned to identify TALE-containing reads. As Xanthomonas TALEs are all highly similar in sequence, the default set from Xoc will identify nearly all reads even in data sets from other Xanthomonas species.tale_seqs_file_export
Automated TALE sequence extraction from assembled reads relies on identifying conserved TALE N-terminal and C-terminal coding regions. This is the file in tale_seqs/exporter containing known terminal sequences that will be used.tale_seqs_file_export_boundaries
RVD sequence determination breaks apart repeat regions and identifies RVDs based on conserved boundary residues. This is the file in tale_seqs/exporter containing known boundaries that will be used.
After the pipeline is finished running, the determined RVD sequences will be at results_path/resequencing/unique_tale_seqs.txt
and results_path/combine_resequenced_tals/unique_tale_seqs.txt
.
These files should be interpreted as discussed in Booher et al.
If the number of identified TALEs seems low it may be that your library insert size was too small to produce a useful number of long reads at the 16 kbp threshold. Run the pipeline again using a lower value for min_seed_read_length
such as 12000 or 10000.