New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove newlines in qualifier values when parsing SnapGene files #3913
base: master
Are you sure you want to change the base?
Remove newlines in qualifier values when parsing SnapGene files #3913
Conversation
Qualifier values in SnapGene files may sometimes contain newline characters, represented either as XML entities ('&x0a;', automatically decoded for us into '\n' by the XML parser) or as HTML tags ('<br/>'). We replace those newlines by space characters, and emit a transformation warning if we have to do so. This implies refactoring the XML helper methods into a single XML helper class, so that we have a way to keep track of how many values have been transformed and only emit a single warning at the end of the parsing.
Add a new SnapGene file containing a qualifier value with a line break and check that the expected warning for data transformation is emitted.
Add docstrings for all public methods of the new XmlHelper class.
@@ -51,7 +54,7 @@ def _iterate(handle): | |||
yield (packet_type, length, data) | |||
|
|||
|
|||
def _parse_dna_packet(length, data, record): | |||
def _parse_dna_packet(length, data, record, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the unused fourth argument to match the API of the other helper functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
To be frank I’m not entirely happy with this solution. A cleaner design IMHO would have been for all those functions to be methods of the SnapGeneIterator
class, so that all the state could be held within the class without having to pass it around between functions. But I felt this was too much of a refactoring for a simple bugfix.
handler(length, data, record, helper) | ||
|
||
if helper.transformed: | ||
warnings.warn("Some text values have been transformed", BiopythonWarning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warnings are dangerous; they get emitted only once in a Python session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that warnings should not be used for this kind of situation, or are you suggesting that the warning message should be made “unique” (possibly by adding specific informations such as the name of the file being parsed and so on) so that the warning gets emitted for different files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They should not be used. I don't see a reasonable way to make the warning message unique each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, but how can we let the users know that some transformation occurred during the parsing then?
Flatly erroring out just because we had to replace one character inside a qualifier value seems way overkill to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have multiple parser warnings in the GenBank parser and elsewhere in Biopython, but @mdehoon and I disagree on the use of warnings. This came up on another issue recently too.
I am of the view that https://docs.python.org/3/library/warnings.html provides the tools needed for a user to decide how to deal with warnings as they wish (including silencing them all, showing them all, or upgrading them to errors), and we just need to word the warning appropriately to balance their usefulness by default.
e.g. If a warning might be triggered dozens of times on a file, use the same warning string so that by default it is only seen once.
In this case if we want to have lots of warnings shown by default, you could issue a warning on each transformation using problematic value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting new lines in the SnapGene files (and raising a ValueError in the GenBank writer) would be reasonable. My impression from the issue report is this is rare and likely a glitch, so I would be happy with the warning (as in this PR), or forcing the user to correct the annotation before writing it.
And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...)
you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...) you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?
It's too complicated. Only advanced users will know about this. Not to mention that warnings are easy to miss if a script prints lots of output to the screen.
Raising a ValueError
, on the other hand, can simply tell the user to use strict=False
if they are sure they want to ignore the issue found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gouttegd Btw from your example it looks like the line break in the SnapGene file was just included for formatting and does not convey any information. Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.
textwrap.shorten(yourtext, width=sys.maxsize)
lets you do this in a clean way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My impression from the issue report is this is rare and likely a glitch
I have never seen it myself in the SnapGene files I’ve got, but in the case reported in #3885 the newline is represented in the file as a XML entity &x10;
. I doubt this can happen as a result of a glitch.
forcing the user to correct the annotation before writing it
I am very much against that.
Because the SnapGene format is binary, practically speaking SnapGene files can only be edited using the SnapGene software. The reason I wrote the SnapGene parser in the first place was so that I could deal with SnapGene files (that I had received from collaborators who didn’t bother to use a more standard format) without having to install SnapGene, that I personally do not use.
Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.
I’m fine with advertising the line-break removal in the documentation, but I don’t see why the warning should be removed. It doesn’t hurt and it gives the user (who may not have read the documentation) one more chance to be aware of the removal. What is the problem with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about this a bit more carefully. Is our basic assumption that newlines in SnapGene files are
(a) informative;
(b) not informative (i.e. they were just added for formatting, but do not carry information);
(c) sometimes informative and sometimes not informative?
I think @peterjc meant to correct the annotation by modifying the |
Ah, my bad. I would have no objection to that of course.
It all depends on the intention of whoever wrote the qualifier’s value. To be clear, those newlines are not automatically inserted by SnapGene itself: they are there because, when editing the value of a qualifier in SnapGene’s feature editor, the user deliberately choose to insert a line break. SnapGene merely translates that line break into either a HTML tag ( Now, I certainly hope that nobody in their right mind would use the mere presence of line breaks as a way to convey something meaningful. But I wouldn’t be overly surprised that some people do. |
Reading this comment in #3885 :
and considering the offending line with the newline, shown in #3885 :
suggest that we can assume that these newlines are meaningless. The Fasta writer also replaces newlines with spaces, without raising an exception or issuing a warning: >>> from io import StringIO
>>> from Bio import SeqIO
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> sequence = Seq("ACGT")
>>> record = SeqRecord(sequence, id='DNA', description="some piece\nof DNA")
>>> stream = StringIO()
>>> SeqIO.write([record], stream, 'fasta')
>>> stream.seek(0)
>>> print(stream.read()) gives this output:
|
Well, the one I have does not. It translates them to HTML tags (
I am not comfortable with assuming that from one example. I can easily imagine situations where a newline can be meaningful. For example, I work with a database where, in some fields, a certain keyword must be interpreted only if it appears at the beginning of a line (i.e. after a newline character), not if it appears anywhere else. Is that a good idea? No, and I hate it. But I have to work with it. I sure hope people don’t do things like that with qualifier values in SnapGene. And if some people do, I am all in favour of telling them “well, it’s your mess, Biopython has no obligation to deal with it”. So I have no qualms in removing the offending newlines so that the rest of the Biopython code does not have to worry about them. But I still think some kind of warning is warranted. |
I completely forgot about this PR for a while… Looks like the options are:
|
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Closes #3885
This PR updates the SnapGene parser to handle the case of SnapGene files containing qualifier values with newline characters.
Inserting newlines in qualifier values with SnapGene should theoretically not cause any problem, as SnapGene should encode them as HTML tags (
<br/>
). However, it seems that it sometimes happen (probably depending on the exact version of SnapGene) that the newline is encoded as a XML entity instead (

). That entity is then automatically transformed into a “real” newline character (\n
) by the XML parser used internally by the SnapGene parser. The presence of such a newline character in a qualifier value can then cause downstream problem, notably if the record is written as a GenBank file.Therefore this PR makes sure any newline in a qualifier value (regardless of whether it is encoded as a HTML tag or a XML entity) is replaced by a space. A warning is emitted to let the user know that a data transformation occurred during the parsing.