Automatic annotation of the bHLH gene family in plants

→ bHLH_annotator is also available on the BioInfToolServer

The bHLH_annotator allows the automatic identification and functional annotation of the bHLH transcription factor family in novel plant sequence data sets. Coding sequences or peptide sequences derived from a de novo genome and transcriptome assembly can be analyzed with this pipeline.

A phylogenetic approach is performed for the annotation of the candidates, based on a bait collection of bHLHs and outgroup sequences (non-bHLHs with a high sequence similarity to bHLHs):

For the identification of initial bHLH candidates (step 1), two search options are available:

BLAST option (default): Candidates are identified based on sequence similarity to the bait collection. This option is recommended if also bHLHs with a lost domain should be identified.
HMMER option: Candidates which harbour the HMM motif of the bait collection are identified. This includes candidates with a high specification, that are not represented by the bait collection.

The initial candidates are sorted out based on their phylogenetic relationship to the bHLH and outgroup baits (step 2). The functional annotation of the candidates is assigned by identifying ortholog reference sequences (step 4). As default references, annotated A. thaliana bHLHs are used. Further, bHLH-specific characteristics are analyzed: Presence of the bHLH domain (step 5), DNA-binding properties (step 5), and the identification of subfamily specific motifs (step 6). A phylogenetic tree is constructed with A. thaliana bHLHs to allow a detailed investigation on the foundation of a well-studied species (step 7).

For large datasets like de novo transcriptome assemblies, the collapse option is recommended (step 8 and 9) which collapses paralogous groups by defining a representative candidate. The parallel option is also recommenced to reduce the pipeline runtime and consumption of memory resources during classification.

The data files used in each step can be customised by the user to allow an investigation suiting the own research purpose (described below). A more detailed description of the pipeline and the bait collection can be found here.

Setup

Installation in a conda environment

The easiest way for installation is by creating an conda environment with the dedicated environment.yml file. This automatically installs all dependencies.

git clone https://github.com/bpucker/bHLH_annotator
cd bHLH_annotator
conda env create -f environment.yml
conda activate bHLH_annotator

Manual installation of the dependencies

The following dependencies are necessary for the execution of the pipeline:

Python3: sudo apt install python3.11 (other versions are also compatible)
- dendropy: sudo apt install python3-pip && python3 -m pip install -U dendropy - pandas: pip install pandas - numpy: pip install numpy - matplotlib: pip install matplotlib
BLAST: sudo apt install ncbi-blast+
HMMER: conda install -c bioconda hmmer
MAFFT : sudo apt install mafft
Muscle5 (precompiled binaries recommended)
FastTree2: sudo apt-get install -y fasttree
RAxML-NG (precompiled binaries recommended)

The bHLH_annotator can be cloned from github:

git clone https://github.com/bpucker/bHLH_annotator
cd bHLH_annotator

Usage

The pipeline is executed through the following command:

cd <PATH>/bHLH_annotator
python3 bHLH_annotator.py --subject <PATH> --out <OUTPUT> --info <DEFINITION_FILE>

The --subject file defines the path to the input FASTA file containing coding or peptide sequences. The output directory is defined with the --out command. In the output directory, a RESULT folder is created containing the output files created in the pipeline steps. The --info file represents the bHLH_annotator.csv. This file is necessary as it defines the input data files utilized in the pipeline.

Optional arguments regarding subject file

Command	Description	Default
`--name <STR>`	Prefix of output file names	--
`--cdsinput`	Changes expected input to CDS	--
`--keepnames`	Prevents splitting of sequence names at first space	--
`--collapse`	Reduces paralogs to one representative	--
`--parallel`	Parallel option for classification	--

Optional arguments for tool adjustments

Command	Description	Default
`--search <STR>`	Search option for the initial search (`blast / hmmer`)	`blast`
`--mode_aln <STR>`	Alignment tool (`muscle / mafft`)	`muscle`
`--mode_tree <STR>`	Tool for tree construction (`fasttree / raxml`)	`fasttree`
`--blastp <STR>`	Path to blastp	`blastp`
`--makeblastdb <STR>`	Path to makeblastdb	`makeblastdb`
`--hmmsearch <STR>`	Path to hmmsearch	`hmmsearch`
`--mafft <STR>`	Path to MAFFT	`mafft`
`--muscle <STR>`	Path to muscle	`muscle`
`--fasttree <STR>`	Path to FastTree	`fasttree`
`--raxml <STR>`	Path to RAxML	`raxml-ng`

Optional arguments to adjust the candidate search and classification

Command	Description	Default
`--bitcutp <INT>`	BLASTp bitscore cutoff	`60`
`--simcutp <INT>`	BLASTp similarity cutoff	`40.0`
`--poscutp <INT>`	Max number of BLASTp hits per bait	`100`
`--lencutp <INT>`	Min BLASTp alignment length	`80`
`--filterdomain`	Filter candidates not matching the HMM motif of the bait collection	--
`--minscore <FLOAT>`	Minimal score to be considered as ingroup	`0.5`
`--numneighbours <INT>`	Neighbours to consider for classification	`10`
`--neighbourdist <FLOAT>`	X*average nearest neighbour distance is used as minimal distance cutoff to be considered as a neighbour	`5`
`--minneighbours <INT>`	Minimal number of bait neighbours to be considered as ingroup	`0`
`--paralogdist <FLOAT>`	X*average nearest neighbour distance is used as cutoff to identify paralogs	`10.0`
`--numprocesscandidates <INT>`	Number of candidates processed at the same time in the parallel option	`200`

Optional arguments regarding performance

Command	Description	Default
`--cpu <INT>`	Number of threads	`4`
`--cpumax <INT>`	Maximal number of threads for classification (step 2)	value of `--cpu`
`--cpub <INT>`	Number of threads for BLASTp search (step 1)	value of `--cpu`
`--cpur <INT>`	Number of threads for alignment/tree construction	value of `--cpu`

Adjustment of input data files

The data input files required as resources for the steps of the pipeline are defined in the bHLH_annotator.csv file. The default files are stored in the data folder. The following files are required as resources:

Argument	File	Description
`--baits <PATH>`	Baits	Contains the bHLH and outgroup sequences of the bait collection (mandatory)
`--baitsinfo <PATH>`	BaitsInfo	Info file defining each bait as bHLH or outgroup sequence (mandatory)
`--optimisedbaits <PATH>`	OptimisedBaits	Optimised bait collection containing only phylogenetic distinct baits that are used for tree construction (recommended)
`--reference <PATH>`	Reference	References with alternative name, functional annotation and subfamily (step 4 and 9)
`--hmm <PATH>`	HMM	HMM motif representing the bHLH domain (step 2 for HMMER search, step 5, and `--filterdomain` option)
`--motifs <PATH>`	Motifs	HMM motifs of subfamily specific motifs (step 5)
`--ath <PATH>`	Ath	A. thaliana sequences used for tree construction (step 7 and 9)

The defined files must meet the following requirements:

Baits and OptimisedBaits: All baits, including optimised baits, must be categorized in the BaitsInfo file as ingroup or outgroup sequences.
BaitsInfo: All sequences categorized in the BaitsInfo file must be included in the Baits file.
Reference: The sequences of the specified references must be included in the Ath sequence file or the bait collection.

If the optional files are not defined in the bHLH_annotator.csv file or via argument, the dependent pipeline steps are skipped. If the files are defined both in the csv file and via argument, the argument is prioritized. The files refered to need to be placed in the 'data' folder or defined using complete paths.

Requirements

Python, dendropy, pandas, numpy, matplotlib, BLAST+, HMMER, MAFFT or MUSCLE5, FastTree2 or RAxML

Reference

Thoben C. and Pucker B. (2023). Automatic annotation of the bHLH gene family in plants. BMC Genomics 24, 780 (2023). doi: 10.1186/s12864-023-09877-2.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
data		data
supplements		supplements
LICENSE		LICENSE
README.md		README.md
bHLH_annotator.csv		bHLH_annotator.csv
bHLH_annotator.png		bHLH_annotator.png
bHLH_annotator.py		bHLH_annotator.py
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

supplements

supplements

LICENSE

LICENSE

README.md

README.md

bHLH_annotator.csv

bHLH_annotator.csv

bHLH_annotator.png

bHLH_annotator.png

bHLH_annotator.py

bHLH_annotator.py

environment.yml

environment.yml

Repository files navigation

Automatic annotation of the bHLH gene family in plants

→ bHLH_annotator is also available on the BioInfToolServer

Setup

Installation in a conda environment

Manual installation of the dependencies

Usage

Optional arguments regarding subject file

Optional arguments for tool adjustments

Optional arguments to adjust the candidate search and classification

Optional arguments regarding performance

Adjustment of input data files

Requirements

Reference

About

Releases 1

Packages

Contributors 3

Languages

License

bpucker/bHLH_annotator

Folders and files

Latest commit

History

Repository files navigation

Automatic annotation of the bHLH gene family in plants

→ bHLH_annotator is also available on the BioInfToolServer

Setup

Installation in a conda environment

Manual installation of the dependencies

Usage

Optional arguments regarding subject file

Optional arguments for tool adjustments

Optional arguments to adjust the candidate search and classification

Optional arguments regarding performance

Adjustment of input data files

Requirements

Reference

About

Resources

License

Stars

Watchers

Forks

Languages