biocore · qiyunzhu · Aug 14, 2017 · Jul 25, 2017 · Aug 10, 2017 · Aug 14, 2017
diff --git a/README.md b/README.md
@@ -2,10 +2,10 @@
 [![Build Status](https://travis-ci.org/biocore/microprot.svg?branch=master)](https://travis-ci.org/biocore/microprot)
 
 # microprot
-microProt is coded in Python 3.x
+*microprot* is coded in Python 3.x
 
 ## Introduction
-microProt clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.
+*microprot* clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.
 
 ## Install
 
@@ -22,3 +22,54 @@ Some of the tools and databases we're using were developed externally and cannot
 Tools requiring manual installation are listed and linked below:
 * [HH-suite 3.0](https://github.com/soedinglab/hh-suite)
 * [metaPSICOV](http://bioinfadmin.cs.ucl.ac.uk/downloads/MetaPSICOV/)
+
+## Naming conventions
+
+### Filenames
+All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo` and contain amino acid sequences.  
+For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.
+
+### File extensions
+
+* a3m  
+    An alignment file produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
+
+* out  
+    HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters (hit list); as well as a set of pair-wise sequence alignments. A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 5).
+
+* match  
+    Internal *microprot* files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.
+
+* non_match  
+    All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `.match`. Internal *microprot* file.
+
+### Example
+Gene `CP00000.0_1` (`CP00000.0_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `CP00000.0_1.out` and `CP00000.0_1.a3m`. Sequence split parameters are:
+```
+min_prob: 90.0
+min_fragment_length: 10
+```
+and the hit list portion of `CP00000.0_1.out` is:
+```
+[7 lines of input parameters summary]
+
+No Hit                             Prob E-value P-value  Score    SS Cols Query HMM  Template HMM
+ 1 1ABC_A Uncharacterized protein  91.5   0.001   0.001   24.3   0.0   20   10-30    211-231 (260)
+ 2 1BCD_A Uncharacterized protein  90.3   0.001   0.001   26.4   0.0   55   33-88    28-83  (149)
+ 3 1CDE_A Uncharacterized protein  85.3     0.2   0.001   26.4   0.0   55   43-98    28-83  (149)
+```
+
+According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from `Query_HMM`) >= 10).  
+So `CP00000.0_1.match` file will contain sequences:
+```
+>CP00000.0_1_10-30
+EXAMPLEEXAMPLEEXAMPL
+>CP00000.0_1_33-88
+EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPL
+```
+and `CP00000.0_1.non_match` will contain sequence:
+```
+>CP00000.0_1_89-100
+EXAMPLEEXAMP
+```
+Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.