Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanations to naming conventions #64

Merged
merged 5 commits into from
Aug 14, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
55 changes: 53 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
[![Build Status](https://travis-ci.org/biocore/microprot.svg?branch=master)](https://travis-ci.org/biocore/microprot)

# microprot
microProt is coded in Python 3.x
*microprot* is coded in Python 3.x

## Introduction
microProt clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.
*microprot* clusters and annotates microbial metagenome sequences for the ultimate goal of predicting the 3-dimensional structure and function of these proteins.

## Install

Expand All @@ -22,3 +22,54 @@ Some of the tools and databases we're using were developed externally and cannot
Tools requiring manual installation are listed and linked below:
* [HH-suite 3.0](https://github.com/soedinglab/hh-suite)
* [metaPSICOV](http://bioinfadmin.cs.ucl.ac.uk/downloads/MetaPSICOV/)

## Naming conventions

### Filenames
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo` and contain amino acid sequences.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

### File extensions

* a3m
An alignment file produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).

* out
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters (hit list); as well as a set of pair-wise sequence alignments. A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 5).

* match
Internal *microprot* files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.

* non_match
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `.match`. Internal *microprot* file.

### Example
Gene `CP00000.0_1` (`CP00000.0_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `CP00000.0_1.out` and `CP00000.0_1.a3m`. Sequence split parameters are:
```
min_prob: 90.0
min_fragment_length: 10
```
and the hit list portion of `CP00000.0_1.out` is:
```
[7 lines of input parameters summary]

No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 1ABC_A Uncharacterized protein 91.5 0.001 0.001 24.3 0.0 20 10-30 211-231 (260)
2 1BCD_A Uncharacterized protein 90.3 0.001 0.001 26.4 0.0 55 33-88 28-83 (149)
3 1CDE_A Uncharacterized protein 85.3 0.2 0.001 26.4 0.0 55 43-98 28-83 (149)
```

According to our criteria, hits 1 and 2 are matches (probability >= 90.0 and fragment length (from `Query_HMM`) >= 10).
So `CP00000.0_1.match` file will contain sequences:
```
>CP00000.0_1_10-30
EXAMPLEEXAMPLEEXAMPL
>CP00000.0_1_33-88
EXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPLEEXAMPL
```
and `CP00000.0_1.non_match` will contain sequence:
```
>CP00000.0_1_89-100
EXAMPLEEXAMP
```
Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial Maybe criterium instead of criteria?