Skip to content
forked from groupschoof/AHRD

Moved project from Asis Hallab's Repository to this work group repository.

License

Notifications You must be signed in to change notification settings

asishallab/AHRD-1

 
 

Repository files navigation

Automated Assignment of Human Readable Descriptions (AHRD)

Protein function has often been transferred from characterized proteins to
novel proteins based on sequence similarity, e.g. using the best BLAST hit. To
assign human readable descriptions to predicted proteins we developed a new
program called Automatic assignment of human readable descriptions (AHRD). We
aim to select descriptions that are concise and informative, precise in regard
to function and use standard nomenclature. AHRD scores BLAST hits taken from
searches against different databases on the basis of the trust put into these
databases and the local alignment quality. The BLAST hit descriptions are
tokenized into informative words and a lexical analysis scores these tokens
according to their frequency and the quality of the BLAST hits they occur in.
Shared tokens with Gene-Ontology Annotations increase the description-scoring
in order to use standard nomenclature where possible. Finally the best scoring
description is assigned.

1.1 Requirements:

AHRD is a Java-Program which requires Java 1.5 or higher and ant.

1.2 Installation:

1.2.1 Get AHRD

Copy (clone) AHRD to your computer using git:

git clone git://github.com/groupschoof/AHRD.git

1.2.2 Build the executable jar:

Running

ant dist
will create the executable JAR-File: ./dist/ahrd.jar

2 Usage:

All AHRD-Inputs are passed to AHRD in a single YML-File.
See ./ahrd_example_input.yml for details.
(About YAML-Format see Wikipedia)

Basically AHRD needs a FASTA-File of amino acid sequences and different files
containing the results from the respective BLAST searches, in our example we
searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note,
that AHRD is generic and can make use of any number of different Blast
databases that do not necessarily have to be the above ones. If e.g. annotating
genes from a fungal genome searching yeast databases might be more
recommendable than using TAIR (Arabidopsis thaliana).

All parameters can be set manually, or the default ones can be used as given in
the example input file ./test/resources/ahrd_input_test_run.yml (see section
Parameters).

AHRD is recommended to be run on batches of 1,000 to 2,000 proteins. If you
want to annotate a whole genome use the included Batcher to split your
input-data into Batches of appropriate size (see section Batcher).

2.2 AHRD Example run:

java -Xmx2g -jar ./dist/ahrd.jar ahrd_example_input.yml 

or just execute

ant test.run 

2.2.1 AHRD run using BLASTX results:

In order to run AHRD on BLASTX results instead of BLASTP results you have to
modify the following parameters in the ahrd_example_input.yml:

token_score_bit_score_weight: 0.5
token_score_database_score_weight: 0.3
token_score_overlap_score_weight: 0.2

Since the algorithm is based on protein sequences and the BLASTX searches are
based on nucleotide sequence there will be a problem calculating the overlap
score of the blast result. To overcome this problem the
token_score_overlap_score_weight has to be set to 0.0. Therefore the other two
scores have to be raised. These three parameters have to sum up to 1. The
resulting parameter configuration could look like this:

token_score_bit_score_weight: 0.6
token_score_database_score_weight: 0.4
token_score_overlap_score_weight: 0.0

2.3 Recommended BLAST-Search:

For your query proteins you should start independent BLAST searches e.g.
in the three different databases mentioned above:

blastall -p blastp -i proteins.fasta -o swissprot_blastout.pairwise -d swissprot.fasta -e 0.0001 -v 200 -b 200 -m 0

2.4 Batcher:

Start the Batcher with:

mkdir test/resources/batch_ymls 
java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./test/resources/batcher_input_test.yml

You will have to edit ./test/resources/batcher_input_test.yml according to your
needs.

2.5 Output:

AHRD supports two different formats. The default one is a tab-delimited table.
The other is FASTA-Format.

2.5.1 Tab-Delimited Table:

AHRD writes out a CSV table with the following columns:

  1. Protein-Accesion — The Query Protein’s Accession
  2. Blast-Hit-Accession — The Accession of the Protein the assigned description was taken from.
  3. AHRD-Quality-Code — explained below
  4. Human-Readable-Description — The assigned HRD
  5. Interpro-ID (Description) — If AHRD was started with InterProScan-Results, they are appended here.
  6. Gene-Ontology-ID (Name) — If AHRD was started with Gene-Ontology-Annotations, they are appended here.

AHRD’s quality-code consists of a four character string, where each character
is either ‘*’ if the respective criteria is met or ‘-’
otherwise. Their meaning is explained in the following table:

Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Overlap of the blast result is >60%
3 Top token score of assigned HRD is >0.5
4 Gene ontology terms found in description line

2.5.2 Fasta-Format:

To set AHRD to write its output in FASTA-Format set the following switch in the input.yml:

output_fasta: true

AHRD will write a valid FASTA-File of your query-proteins where the Header will
be composed of the same parts as above, but here separated by whitespaces.

3 Algorithm:

Based on e-values the 200 best scoring blast results are chosen from each
database-search (e.g. Swissprot, TAIR, trEMBL). For all 600 resulting candidate
description lines a score is calculated using a lexical approach. First each
description line is passed through two regular expression filters. The first
filter discards any matching description line in order to ignore descriptions
like e.g. ‘Whole genome shotgun sequence’, while the second filter tailors the
description lines deleting matching parts, in order to discard e.g. the
trailing Species-Descriptions ’OS=Arabidopsis thaliana […]". In the second
step the scoring each description line is split into single tokens, which are
passed through a blacklist filter, ignoring all matching tokens in terms of
score. Tokens are sequences of characters with a collective meaning. For each
token a score is calculated from three single scores with different weights,
the bit score, the database score and the overlap score. The bit score is
provided within the blast result. The database score is a fixed score for each
blast database, based on the description quality of the database. The overlap
score reflects the overlap of the query and subject sequence. In the second
step the sum of all token scores from a description line is divided by a
correction factor that avoids the scoring system from being biased towards
longer or shorter description lines. In the third step the predicted gene
ontology terms, if available, are used to evaluate the description lines and to
get a better ranking. Therefore only gene ontology terms with a probability
greater than 0.4 are used. From this ranking now the best scoring description
line can be chosen. In the last step a domain name provided by InterProScan
results, if available, is extracted and appended to the best scoring
description line for each uncharacterized protein.

In the end for each uncharacterized protein a description line is selected that
comes from a high-scoring BLAST match, that contains words occurring frequently
in the descriptions of highest scoring BLAST matches and that does not contain
meaningless “fill words”. If available an assigned Interpro domain is appended
to the description line and each line will contain an evaluation section that
reflects the significance of the assigned human readable description.

3.1 Pseudo-Code

  1. Choose 600 best scoring blast results
  2. Filter description lines of above blast-results using regular expressions:
    1. Reject those matched by any regex given in e.g. ./test/resources/blacklist_descline.txt,
    2. Delete those parts of each description line, matching any regex in e.g. ./test/resources/filter_descline_sprot.txt.
  3. Divide each description line into tokens (characters of collective meaning)
    1. In terms of score ignore any tokens matching regexs given e.g. in ./test/resources/blacklist_token.txt.
  4. Token score (calculated from: bitscore, database weight, overlap score)
  5. Lexical score (calculated from: Token score, High score factor, Pattern factor, Correction factor)
  6. Description score (calculated from: Lexical score, GO score, Blast score)
  7. Choose best scoring description line
  8. Append InterProScan description to chosen description line if available

3.2 Used Formulae and Parameters:

3.3 Parameters

Above formulae use the following parameters as given in ./ahrd_example_input.yml

3.3.1 The weights in formula Token-Score are:
token_score_bit_score_weight: 0.5
token_score_database_score_weight: 0.3
token_score_overlap_score_weight: 0.2 
and Blast-Database specific:
weight: 100 
3.3.2 The weight in formula Lexical-Score is:
description_score_relative_description_frequency_weight: 0.6 
3.3.3 The weight in formula Description-Score also is Blast-database specific:
description_score_bit_score_weight: 0.2 

4. Testing:

If you want to run the complete JUnit Test-Suite execute:

ant

5. License

See attached file LICENSE.txt for details.

6. Authors

Kathrin Klee, Asis Hallab and Heiko Schoof

Group “Plant Computational Biology”
Prof. Dr. Heiko Schoof

Max Planck Institute for Plant Breeding Research
Carl-von-Linné-Weg 10
50829 Köln (Cologne)
Germany

INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany

About

Moved project from Asis Hallab's Repository to this work group repository.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages

  • Java 100.0%