Skip to content

Commit

Permalink
addressing Stefan's comments
Browse files Browse the repository at this point in the history
  • Loading branch information
tkosciol committed Aug 14, 2017
1 parent 9155619 commit 9ed5653
Showing 1 changed file with 14 additions and 14 deletions.
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
[![Build Status](https://travis-ci.org/biocore/microprot.svg?branch=master)](https://travis-ci.org/biocore/microprot)

# microprot
microProt is coded in Python 3.x
microprot is coded in Python 3.x

## Introduction
microProt clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.
microprot clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.

## Install

Expand All @@ -29,27 +29,27 @@ Tools requiring manual installation are listed and linked below:
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

### Extensions
### File extensions

* a3m
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
An alignment file produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).

* out
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters (hit list); as well as a set of pair-wise sequence alignments. A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 5).

* match
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.

* non_match
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`.
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `.match`. Internal `microprot` file.

### Example
Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are:
Gene `CP00000.0_1` (`CP00000.0_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `CP00000.0_1.out` and `CP00000.0_1.a3m`. Sequence split parameters are:
```
min_prob: 90.0
min_fragment_length: 10
```
and the hit list from `example_1.out` is:
and the hit list portion of `CP00000.0_1.out` is:
```
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 1ABC_A Uncharacterized protein 91.5 0.001 0.001 24.3 0.0 20 10-30 211-231 (260)
Expand All @@ -58,19 +58,19 @@ No Hit Prob E-value P-value Score SS Cols Query
```

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from `Query_HMM`) >= 10).
So `example_1.match` file will contain sequences:
So `CP00000.0_1.match` file will contain sequences:
```
>example_1_10-30
>CP00000.0_1_10-30
---------EXAMPLEEXAMPLEEXAMPL-----------------------------------------
------------------------------
>example_1_33-88
>CP00000.0_1_33-88
---------------------------------EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEX
AMPLEEXAMPLEEXAMPL------------
```
and `example_1.non_match` will contain sequence:
and `CP00000.0_1.non_match` will contain sequence:
```
>example_1_89-100
>CP00000.0_1_89-100
----------------------------------------------------------------------
------------------EXAMPLEEXAMP
```
Sub-sequences `example_1_1-9` and `example_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.
Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.

0 comments on commit 9ed5653

Please sign in to comment.