New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explanations to naming conventions #64
Conversation
README.md
Outdated
### Filenames | ||
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`. | ||
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we use end - start to indicate a reverse complement ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's no reverse complement for proteins; it's too late for them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth to give the reader the hint that those files are going to contain protein sequences and not DNA sequences of the gene as a subset from the genome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted!
README.md
Outdated
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`. | ||
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene. | ||
|
||
### Extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"File-Extensions" ?
README.md
Outdated
### Extensions | ||
|
||
* a3m | ||
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format -> file ?
README.md
Outdated
* a3m | ||
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1). | ||
|
||
* out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as far as I remember, there also was some documentation on that?! You might want to point to it.
Wasn't that the format with the one column separator which messed my parsing up? Did they ever respond on that issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.
README.md
Outdated
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments. | ||
|
||
* match | ||
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consistency: microProt or microprot? italic or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.out --> maybe rename "a3m" and other captions to ".a3m" to make it more obvious that we are speaking about this strange Windows concept of file extensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are still two versions: microprot vs. microprot
README.md
Outdated
All sub-sequences longer than the `minimum sequence length` that do not meet the criteria for `match`. | ||
|
||
### Example | ||
Gene `example_1` (`example_1.fasta`) with 100 residues is run against HHsearch and it returns 2 outputs: `example_1.out` and `example_1.a3m`. Sequence split parameters are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename to better resemble what is stated under "Filenames", e.g. CP00000.0_1
README.md
Outdated
min_prob: 90.0 | ||
min_fragment_length: 10 | ||
``` | ||
and the hit list from `example_1.out` is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we define what we mean with "hit list"? I think we should address it as "HH-suite output "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I see, the hit list is only the first part of this file, because the file also holds all the more detailed alignments :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice try, but I think is might still be not precise enough to make it obvious to the reader. What about "..., which are the first couple of lines in the file, before alignments are reported"?
README.md
Outdated
So `example_1.match` file will contain sequences: | ||
``` | ||
>example_1_10-30 | ||
---------EXAMPLEEXAMPLEEXAMPL----------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does that mean the sequences contain gaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it simply means that the sequence starts at residue 10, hence 9 "gaps" before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so dashes are part of the file content?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your in-depth and insightful review. I will address your comments and make updates.
README.md
Outdated
* a3m | ||
An alignment format produced by HH-suite programs. It's a format similar to FASTA, but in sequence rows it contains additional information useful for the construction of HMMs (represented by [a-z]). A detailed description can be found in [HH-suite user guide](https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf) (section 6.1). | ||
|
||
* out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right - there is! Documentation section 5. In fairness though, we inferred the exact formatting definitions from code.:)
about the issue (soedinglab/hh-suite#57), it's still open and I don't think anyone bothered to look into it.
README.md
Outdated
HH-suite output files reporting a list of hits for an input sequence, along with Probability, P-value, E-value and other parameters, as well as a set of pair-wise sequence alignments. | ||
|
||
* match | ||
Internal `microprot` files showing which sub-sequence of the input sequence matched defined by `config.yml` criteria for any of `E-value`, `P-value`, `Prob` or `minimum sequence length` in the `.out` file. Multiple hits are possible. The file is reported in a FASTA format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some questions. but looks good, thanks!
README.md
Outdated
### Filenames | ||
All filenames are in the form: `GenomeID`\_`GeneID`\_`ResiduesFrom`-`ResiduesTo`. | ||
For example, `CP003179.1_3319` means gene `3319` from genome `CP003179.1` (Sulfobacillus acidophilus DSM 10332), or `CP003179.1_3319_1-60` means amino acids 1 to 60 from that gene. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noted!
README.md
Outdated
>CP00000.0_1_89-100 | ||
---------------------------------------------------------------------- | ||
------------------EXAMPLEEXAMP | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjanssen2 you're right, we're not reporting gaps explicitly. Do you think I should remove them altogether, or just note that gaps are included here for educational purposes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove them, because we don't have the complete input sequence anyway, so the reader cannot mentally align the sub-sequences to it
@qiyunzhu would you be so kind and review and merge (if okay)? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Since it's but documentation, I don't see any major problem.
>CP00000.0_1_89-100 | ||
EXAMPLEEXAMP | ||
``` | ||
Sub-sequences `CP00000.0_1_1-9` and `CP00000.0_1_31-33` will be dropped from subsequent analyses, as they did not match `minimum fragment length` criteria. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial Maybe criterium
instead of criteria
?
added explanation to naming conventions used by us in the project.