Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanations to naming conventions #64

Merged
merged 5 commits into from Aug 14, 2017
Merged

Explanations to naming conventions #64

merged 5 commits into from Aug 14, 2017

Conversation

tkosciol
Copy link
Member

added explanation to naming conventions used by us in the project.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 88.704% when pulling 8344adf on naming_conventions into dd01070 on master.

README.md Outdated
### Filenames
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use end - start to indicate a reverse complement ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no reverse complement for proteins; it's too late for them!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth to give the reader the hint that those files are going to contain protein sequences and not DNA sequences of the gene as a subset from the genome.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted!

README.md Outdated
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

### Extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"File-Extensions" ?

README.md Outdated
### Extensions

* a3m
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format -> file ?

README.md Outdated
* a3m
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).

* out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as far as I remember, there also was some documentation on that?! You might want to point to it.
Wasn't that the format with the one column separator which messed my parsing up? Did they ever respond on that issue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.

README.md Outdated
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.

* match
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consistency: microProt or microprot? italic or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.out --> maybe rename "a3m" and other captions to ".a3m" to make it more obvious that we are speaking about this strange Windows concept of file extensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are still two versions: microprot vs. microprot

README.md Outdated
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`.

### Example
Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename to better resemble what is stated under "Filenames", e.g. CP00000.0_1

README.md Outdated
min_prob: 90.0
min_fragment_length: 10
```
and the hit list from `example_1.out` is:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we define what we mean with "hit list"? I think we should address it as "HH-suite output "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see, the hit list is only the first part of this file, because the file also holds all the more detailed alignments :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice try, but I think is might still be not precise enough to make it obvious to the reader. What about "..., which are the first couple of lines in the file, before alignments are reported"?

README.md Outdated
So `example_1.match` file will contain sequences:
```
>example_1_10-30
---------EXAMPLEEXAMPLEEXAMPL-----------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that mean the sequences contain gaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it simply means that the sequence starts at residue 10, hence 9 "gaps" before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so dashes are part of the file content?

Copy link
Member Author

@tkosciol tkosciol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your in-depth and insightful review. I will address your comments and make updates.

README.md Outdated
* a3m
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1).

* out
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.

README.md Outdated
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments.

* match
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

@coveralls
Copy link

Coverage Status

Coverage remained the same at 88.704% when pulling 9ed5653 on naming_conventions into dd01070 on master.

Copy link
Member Author

@tkosciol tkosciol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some questions. but looks good, thanks!

README.md Outdated
### Filenames
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`.
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted!

README.md Outdated
>CP00000.0_1_89-100
----------------------------------------------------------------------
------------------EXAMPLEEXAMP
```
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjanssen2 you're right, we're not reporting gaps explicitly. Do you think I should remove them altogether, or just note that gaps are included here for educational purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove them, because we don't have the complete input sequence anyway, so the reader cannot mentally align the sub-sequences to it

@coveralls
Copy link

Coverage Status

Coverage remained the same at 88.704% when pulling a910874 on naming_conventions into dd01070 on master.

@tkosciol
Copy link
Member Author

@qiyunzhu would you be so kind and review and merge (if okay)? thanks!

Copy link
Contributor

@qiyunzhu qiyunzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Since it's but documentation, I don't see any major problem.

>CP00000.0_1_89-100
EXAMPLEEXAMP
```
Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial Maybe criterium instead of criteria?

@qiyunzhu qiyunzhu merged commit 03a95a8 into master Aug 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants