Skip to content

Commit

Permalink
Explanations to naming conventions
Browse files Browse the repository at this point in the history
  • Loading branch information
tkosciol committed Jul 25, 2017
1 parent 81102ae commit 8344adf
Showing 1 changed file with 52 additions and 0 deletions.
52 changes: 52 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,55 @@ Some of the tools and databases we're using were developed externally and cannot
Tools requiring manual installation are listed and linked below:
* [HH-suite 3.0](https://github.com/soedinglab/hh-suite)
* [metaPSICOV](http://bioinfadmin.cs.ucl.ac.uk/downloads/MetaPSICOV/)

## Naming conventions

### Filenames
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

### Extensions

* a3m
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).

* out
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.

* match
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.

* non_match
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`.

### Example
Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are:
```
min_prob: 90.0
min_fragment_length: 10
```
and the hit list from `example_1.out` is:
```
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 1ABC_A Uncharacterized protein 91.5 0.001 0.001 24.3 0.0 20 10-30 211-231 (260)
2 1BCD_A Uncharacterized protein 90.3 0.001 0.001 26.4 0.0 55 33-88 28-83 (149)
3 1CDE_A Uncharacterized protein 85.3 0.2 0.001 26.4 0.0 55 43-98 28-83 (149)
```

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from `Query_HMM`) >= 10).
So `example_1.match` file will contain sequences:
```
>example_1_10-30
---------EXAMPLEEXAMPLEEXAMPL-----------------------------------------
------------------------------
>example_1_33-88
---------------------------------EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEX
AMPLEEXAMPLEEXAMPL------------
```
and `example_1.non_match` will contain sequence:
```
>example_1_89-100
----------------------------------------------------------------------
------------------EXAMPLEEXAMP
```
Sub-sequences `example_1_1-9` and `example_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.

0 comments on commit 8344adf

Please sign in to comment.