The BaitFisher-package is a software package for designing hybrid enrichment probes. For more information see:
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
CAligner
CFile
Documentation
Example
fast-dynamic-bitset
scoring-matrices
tclap
.DS_Store
CBaitRecord.h
CBlastParser.h
CDistance_matrix.h
CDnaString2.h
CHistogram.h
CRequiredTaxon.h
CSeqNameList.h
CSequence_Mol2_1.h
CSequences2.h
CSplit2.h
CTaxonNamesDictionary.h
Csequence_cluster_and_center_sequence.cpp
Csequence_cluster_and_center_sequence.h
Ctriple.h
DEBUG_STUFF.h
GFF-class.h
GFF-collection.h
LICENSE.GPLv3.txt
LICENSE.md
LICENSE_AND_COPYRIGHT_as_found_in_each_source_file
README.md
VERSION_HISTORY.md
bait-filter.cpp
bait-fisher-helper.cpp
bait-fisher-helper.h
bait-fisher.cpp
basic-DNA-RNA-AA-routines.h
easystring.h
fast-realloc-vector.h
faststring2.h
global-types-and-parameters.cpp
global-types-and-parameters.h
makefile
makefile_win
mydir-unix.cpp
mydir-unix.h
primefactors.h
print_container.h
range_functions.h
simpledna_N0.mat
statistic_functions.h
typedefs.h

README.md

BaitFisher-package: Design DNA hybrid enrichment baits

Table of contents:

About the BaitFisher package

The BaitFisher package consists of two programs: BaitFisher and BaitFilter.

BaitFisher was been designed to construct hybrid enrichment baits from multiple sequence alignments (MSAs) or annotated features in MSAs. The main goal of BaitFisher is to avoid redundancy in the construction of baits by designing fewer baits in conserved regions of the MSAs and designing more baits in variable regions. This makes use of the fact that hybrid enrichment baits can differ to some extends from the target region, which they should capture in the enrichment procedure. By specifying the allowed distance between baits and the sequences in the MSAs the user can control the allowed bait-to-target distance and the degree of reduction in the number of baits that are designed. See the BaitFisher paper for details.

BaitFilter was designed (i) to determine whether baits bind unspecifically to a reference genome, (ii) to filter baits that only have partial length matches to a reference genome, (iii) to determine the optimal bait region in a MSA and to convert baits to a format that can be uploaded at a bait constructing company. The optimal bait region can be the most conserved region in the MSA or the region with the highest number of sequences without gaps or ambiguous nucleotides.

Since performance was one of the major design goals, the BaitFisher and BaitFilter programs are both written in C++.

Software development:
Christoph Mayer, Forschungsmuseum Alexander Koenig, 53113 Bonn, Germany

Main contributors: The main contributors to the development of the BaitFisher software are:
Christoph Mayer (*), Oliver Niehuis (**,*), Manuela Sann (**,*)
(*): Forschungsmuseum Alexander Koenig, 53113 Bonn, Germany
(**): Institut für Biologie I (Zoologie), Albert-Ludwigs-Universität Freiburg, Germany

When using BaitFisher please cite:
Mayer, C., Sann, M., Donath, A., Meixner, M., Podsiadlowski, L., Peters, R.S., Petersen, M., Meusemann, K., Liere, K., Wägele, J.-W., Misof, B., Bleidorn, C., Ohl, M., Niehuis, O., 2016. BaitFisher: A Software Package for Multispecies Target DNA Enrichment Probe Design. Mol. Biol. Evol. doi:10.1093/molbev/msw056

Documentation

A full manual for the BaitFisher package as PDF can be found in the "Documentation" folder. Furthermore, example analyses can be found in the Example folder.

Online help for the BaitFisher and BaitFilter programs can be obtained by called he programs as follows:

PATH/BaitFisher -h
PATH/BaitFileter -h

Compiling and installing BaitFisher and BaitFilter

System requirements:

BaitFisher can be compiled on all platforms, which provide a C++ compiler that includes the following system header files: unistd.h, sys/stat.h, sys/types.h, dirent.h. These header files are standard header files on all unix systems such as Linux and Mac Os X. I also managed to compile BaitFisher and BaitFilter on Windows using the Mingw compiler (www.mingw.org).

Compiling BaitFisher and BaitFilter:

Please, unpack the BaitFisher archive you downloaded.

On the command line enter the BaitFisher-package-master directory. Now type make on the command line. This compiles the BaitFisher as well as the BaitFilter program. The executables of these to programs care called: BaitFisher-v1.2.7 and BaitFilter-v1.0.5. They can be copied on your computer to any place you like. To use these executables they have to be addressed with its full path, or they have to be placed somewhere where in your system path.

Quickstart

More content will me added soon. For details please see the manual in the "Documentation" folder.

Frequently asked questions

  1. BaitFisher does not design baits for all input files or all excised features. What is the reason for this?

    BaitFisher does not design baits for input files or excised features, which cannot host a tiling design of the required length.

  2. How can I determine the input files or excised features for which no baits have been designed?

    There are different reasons and therefore different messages why no baits have been designed for a given input file or excised feature. A full list of the messages written to the screen (i.e. to standard output) and the underlying problem are listed here:

Message:
NOTICE: Skipping alignment since it is too short. Alignment length: the-number
Reason:
In this case the alignment is shorter than the requested tiling design.
Solution:
If the size of the requested tiling design is decreased, it might be possible to design baits for this alignment. Be sure that decreasing the size of the tiling design makes sense.

Message 1:
ALERT: Skipping feature-name feature since its overlap with the transcript is too short. Length of overlap: a-number
Message 2:
ALERT: Skipping feature-name feature since the alignment is too short after removing gap only positions. Length of overlap: a-number
Reason:
The feature, the transcript or only the overlap of the two are too short to host a complete tiling design.
Solution:
If the size of the requested tiling design is decreased, it might be possible to design baits for this feature.

Message:
ALERT: Alignment contains #number features but no single valid bait region. (Details: see above.)
Reason:
This message hints at the avbove messages.

Message:
ALERT: No valid bait region found for alignment:
Reason:
There are two possibilities: If too many gaps appear in effect in every sequence, BaitFisher might not find a single sequence without gaps in length of the requested tiling design. If required taxa are specified, no region might exist in which the required taxa are present in the required length without gaps.
Solution: If the size of the requested tiling design is decreased, it might be possible to design baits for this alignment. If required taxa are removed, it might be possible to design baits for this alignment of it.

Message:
ALERT: No valid bait regions found for feature: name-of-feature
Reason:
There are two possibilities: If too many gaps appear in effect in every sequence of an excised feature, BaitFisher might not find a single sequence without gaps in length of the requested tiling design. If required taxa are specified, no region might exist in which the required taxa are present in the required length without gaps.
Solution: If the size of the requested tiling design is decreased, it might be possible to design baits for this feature. If required taxa are removed, it might be possible to design baits for this features.

Message:
WARNING: For gene " gene-name " no entries where found in the gff file.
While this is possible, in most cases the reason is either a wrong attribute name for the
gene-name-field in the gff-file or the gene name specified in the sequence
name of the transcript alignment is incompatible to the gene names in the gff file.
It is recommended to check whether gene name " gene-name " occurs in the
gff file under the attribute name " the-gff-record-attribute-name-for-geneID ".
Note that the gene name as specified in the sequence name and the gene name in
the gff file have to match exactly. The attribute name specified in the parameter
file must be an exact (case insensitive) match to the attribute name in the gff file.

Reason:
This message can only occur if features are excised from genes/alignments. In the configuration file the user specifies which attribute contains the gene name. A gff file contains lines of features. Each line has nine tab delimited columns. The last column contains the so called attributes. E.g. "Parent=FBtr0081133". If BaitFisher found a gene name in the sequence name in the alignment file, it will search for all features that belong to this gene. The features are identified by the gene name saved in an attribute. If for a given gene name no entry is found in the gff file, this is in most cases due to a nameing problem, or incompatibility of the gene names and the gff file. If no features with the given gene name are found, no baits will be designed for this gene. Solution:
Search for the gene-name given in the error message und verify that it can be found in the gff file. Make sure it can be found under the the-gff-record-attribute-name-for-geneID specified in the configuration file. In most cases this name is "parent".

Message:
WARNING: For gene " gene-name " no entries where found in the gff file that matched the specified feature name. Since entries for this gene name have been found, but with different feature names, the problem could be the feature name or the fact that features exist for this gene. Reason:
Here BaitFisher found entries with gene-name for the given attribute name, but they could not be found for the feaute that was specified.
Solution:
Have a closer look at the lines containing the features you want to excise. If you want to excise CDS features look at lines in the gff file containing this feature and check that the gene name is found under the correct attribute name. Note that different features often have different attribute names for the name of the gene. E.g. the gene or mRNA feature usually does not mention the gene name under the attribute name parent, but under the attribute name ID since these features have no parent features. In contrast, the exon, CDS or intron features are usually assigned to a parent gene and the attribute containing the gene name is often the parent attribute. Check that the given gene-name is found for at least one feature of the given type (e.g. CDS) under the specified attribute. If the gene name does not occur at all double check that the gff file and the genome file are compatible and that the gene name has been determined correctly.

  1. How can I determine the input files or excised features for which baits have been designed?

    The bait-loci.txt file list all bait regions for which baits have been designed. The first column in this file is the alignment file name. The second column is the gene name if applicalble, or again the alignment file name if no features have been excised.

    On Unix system such as Linux or Mac, the unique filenames, gene names, or features can be easily determined on the command line: In order to determine the alignment files for which baits have been designed you can use:
    tail -n +2 loci_baits.txt | cut -f 1 | sort -u

    In order to determine the genes for which baits have been designed you can use:
    tail -n +2 loci_baits.txt | cut -f 2 | sort -u

    In order to determine the feautes for which baits have been designed you can :
    tail -n +2 loci_baits.txt | cut -f 2,3 | sort -u

 

In case of questions not answered here, please contact the author Christoph Mayer.

Last updated: September, 2017 by Christoph Mayer