Explanations to naming conventions

biocore · Jul 25, 2017 · 8344adf · 8344adf
1 parent 81102ae
commit 8344adf
Showing 1 changed file with 52 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -22,3 +22,55 @@ Some of the tools and databases we're using were developed externally and cannot
 Tools requiring manual installation are listed and linked below:
 * [HH-suite 3.0](https://github.com/soedinglab/hh-suite)
 * [metaPSICOV](http://bioinfadmin.cs.ucl.ac.uk/downloads/MetaPSICOV/)
+
+## Naming conventions
+
+### Filenames
+All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.  
+For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.
+
+### Extensions
+
+* a3m  
+    An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
+
+* out  
+    HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.
+
+* match  
+    Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.
+
+* non_match  
+    All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`.
+
+### Example
+Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are:
+```
+min_prob: 90.0
+min_fragment_length: 10
+```
+and the hit list from `example_1.out` is:
+```
+No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
+ 1 1ABC_A Uncharacterized protein  91.5   0.001   0.001   24.3   0.0   20   10-30    211-231 (260)
+ 2 1BCD_A Uncharacterized protein  90.3   0.001   0.001   26.4   0.0   55   33-88    28-83  (149)
+ 3 1CDE_A Uncharacterized protein  85.3     0.2   0.001   26.4   0.0   55   43-98    28-83  (149)
+```
+
+According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from `Query_HMM`) >= 10).  
+So `example_1.match` file will contain sequences:
+```
+>example_1_10-30
+---------EXAMPLEEXAMPLEEXAMPL-----------------------------------------
+------------------------------
+>example_1_33-88
+---------------------------------EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEX
+AMPLEEXAMPLEEXAMPL------------
+```
+and `example_1.non_match` will contain sequence:
+```
+>example_1_89-100
+----------------------------------------------------------------------
+------------------EXAMPLEEXAMP
+```
+Sub-sequences `example_1_1-9` and `example_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.