Explanations to naming conventions #64

tkosciol · 2017-08-10T22:29:59Z

added explanation to naming conventions used by us in the project.

…ventions

coveralls · 2017-08-10T22:34:06Z

Coverage remained the same at 88.704% when pulling 8344adf on naming_conventions into dd01070 on master.

sjanssen2 · 2017-08-14T19:03:16Z

README.md

+### Filenames
+All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.  
+For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.
+


do we use end - start to indicate a reverse complement ?

there's no reverse complement for proteins; it's too late for them!

it might be worth to give the reader the hint that those files are going to contain protein sequences and not DNA sequences of the gene as a subset from the genome.

sjanssen2 · 2017-08-14T19:03:43Z

README.md

+All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.  
+For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.
+
+### Extensions


"File-Extensions" ?

sjanssen2 · 2017-08-14T19:04:03Z

README.md

+### Extensions
+
+* a3m  
+    An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).


format -> file ?

sjanssen2 · 2017-08-14T19:05:38Z

README.md

+* a3m  
+    An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
+
+* out  


as far as I remember, there also was some documentation on that?! You might want to point to it.
Wasn't that the format with the one column separator which messed my parsing up? Did they ever respond on that issue?

you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.

sjanssen2 · 2017-08-14T19:06:12Z

README.md

+    HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.
+
+* match  
+    Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.


consistency: microProt or microprot? italic or not?

.out --> maybe rename "a3m" and other captions to ".a3m" to make it more obvious that we are speaking about this strange Windows concept of file extensions.

good point!

there are still two versions: microprot vs. microprot

sjanssen2 · 2017-08-14T19:09:24Z

README.md

+    All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`.
+
+### Example
+Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are:


can we rename to better resemble what is stated under "Filenames", e.g. CP00000.0_1

sjanssen2 · 2017-08-14T19:10:42Z

README.md

+min_prob: 90.0
+min_fragment_length: 10
+```
+and the hit list from `example_1.out` is:


did we define what we mean with "hit list"? I think we should address it as "HH-suite output "

oh I see, the hit list is only the first part of this file, because the file also holds all the more detailed alignments :-/

nice try, but I think is might still be not precise enough to make it obvious to the reader. What about "..., which are the first couple of lines in the file, before alignments are reported"?

sjanssen2 · 2017-08-14T19:13:19Z

README.md

+So `example_1.match` file will contain sequences:
+```
+>example_1_10-30
+---------EXAMPLEEXAMPLEEXAMPL-----------------------------------------


does that mean the sequences contain gaps?

No, it simply means that the sequence starts at residue 10, hence 9 "gaps" before.

so dashes are part of the file content?

tkosciol

Thanks for your in-depth and insightful review. I will address your comments and make updates.

tkosciol · 2017-08-14T19:13:47Z

README.md

+* a3m  
+    An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
+
+* out  


you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.

tkosciol · 2017-08-14T19:14:04Z

README.md

+    HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.
+
+* match  
+    Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.


good point!

coveralls · 2017-08-14T19:36:16Z

Coverage remained the same at 88.704% when pulling 9ed5653 on naming_conventions into dd01070 on master.

tkosciol

some questions. but looks good, thanks!

tkosciol · 2017-08-14T19:56:04Z

README.md

+### Filenames
+All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.  
+For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.
+


tkosciol · 2017-08-14T19:57:38Z

README.md

+>CP00000.0_1_89-100
+----------------------------------------------------------------------
+------------------EXAMPLEEXAMP
+```


@sjanssen2 you're right, we're not reporting gaps explicitly. Do you think I should remove them altogether, or just note that gaps are included here for educational purposes?

remove them, because we don't have the complete input sequence anyway, so the reader cannot mentally align the sub-sequences to it

coveralls · 2017-08-14T20:09:45Z

Coverage remained the same at 88.704% when pulling a910874 on naming_conventions into dd01070 on master.

tkosciol · 2017-08-14T20:16:46Z

@qiyunzhu would you be so kind and review and merge (if okay)? thanks!

qiyunzhu

Looks good! Since it's but documentation, I don't see any major problem.

qiyunzhu · 2017-08-14T20:28:26Z

README.md

+>CP00000.0_1_89-100
+EXAMPLEEXAMP
+```
+Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.


Trivial Maybe criterium instead of criteria?

tkosciol added 2 commits July 24, 2017 18:56

Explanations to naming conventions

8344adf

Merge branch 'master' of github.com:biocore/microprot into naming_con…

9155619

…ventions

tkosciol requested review from qiyunzhu and sjanssen2 August 14, 2017 18:44

sjanssen2 reviewed Aug 14, 2017

View reviewed changes

tkosciol commented Aug 14, 2017

View reviewed changes

addressing Stefan's comments

9ed5653

tkosciol commented Aug 14, 2017

View reviewed changes

tkosciol added 2 commits August 14, 2017 13:03

addressing more of Stefan's comments

6cec2b2

final (hopefully) address of Stefan's comments

a910874

sjanssen2 approved these changes Aug 14, 2017

View reviewed changes

qiyunzhu approved these changes Aug 14, 2017

View reviewed changes

qiyunzhu reviewed Aug 14, 2017

View reviewed changes

qiyunzhu merged commit 03a95a8 into master Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explanations to naming conventions #64

Explanations to naming conventions #64

tkosciol commented Aug 10, 2017

coveralls commented Aug 10, 2017

sjanssen2 Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

sjanssen2 Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

tkosciol left a comment

tkosciol Aug 14, 2017

tkosciol Aug 14, 2017

coveralls commented Aug 14, 2017

tkosciol left a comment

tkosciol Aug 14, 2017

tkosciol Aug 14, 2017

sjanssen2 Aug 14, 2017

coveralls commented Aug 14, 2017

tkosciol commented Aug 14, 2017

qiyunzhu left a comment

qiyunzhu Aug 14, 2017

Explanations to naming conventions #64

Explanations to naming conventions #64

Conversation

tkosciol commented Aug 10, 2017

coveralls commented Aug 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkosciol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Aug 14, 2017

tkosciol left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Aug 14, 2017

tkosciol commented Aug 14, 2017

qiyunzhu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment