Remove newlines in qualifier values when parsing SnapGene files #3913

gouttegd · 2022-04-28T12:49:37Z

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

This PR updates the SnapGene parser to handle the case of SnapGene files containing qualifier values with newline characters.

Inserting newlines in qualifier values with SnapGene should theoretically not cause any problem, as SnapGene should encode them as HTML tags (<br/>). However, it seems that it sometimes happen (probably depending on the exact version of SnapGene) that the newline is encoded as a XML entity instead (
). That entity is then automatically transformed into a “real” newline character (\n) by the XML parser used internally by the SnapGene parser. The presence of such a newline character in a qualifier value can then cause downstream problem, notably if the record is written as a GenBank file.

Therefore this PR makes sure any newline in a qualifier value (regardless of whether it is encoded as a HTML tag or a XML entity) is replaced by a space. A warning is emitted to let the user know that a data transformation occurred during the parsing.

Qualifier values in SnapGene files may sometimes contain newline characters, represented either as XML entities ('&x0a;', automatically decoded for us into '\n' by the XML parser) or as HTML tags ('<br/>'). We replace those newlines by space characters, and emit a transformation warning if we have to do so. This implies refactoring the XML helper methods into a single XML helper class, so that we have a way to keep track of how many values have been transformed and only emit a single warning at the end of the parsing.

Add a new SnapGene file containing a qualifier value with a line break and check that the expected warning for data transformation is emitted.

Add docstrings for all public methods of the new XmlHelper class.

peterjc · 2022-04-28T15:19:33Z

Bio/SeqIO/SnapGeneIO.py

@@ -51,7 +54,7 @@ def _iterate(handle):
        yield (packet_type, length, data)


-def _parse_dna_packet(length, data, record):
+def _parse_dna_packet(length, data, record, _):


Is the unused fourth argument to match the API of the other helper functions?

Yes.

To be frank I’m not entirely happy with this solution. A cleaner design IMHO would have been for all those functions to be methods of the SnapGeneIterator class, so that all the state could be held within the class without having to pass it around between functions. But I felt this was too much of a refactoring for a simple bugfix.

mdehoon · 2022-04-29T00:21:11Z

Bio/SeqIO/SnapGeneIO.py

+                handler(length, data, record, helper)
+
+        if helper.transformed:
+            warnings.warn("Some text values have been transformed", BiopythonWarning)


Warnings are dangerous; they get emitted only once in a Python session.

Are you saying that warnings should not be used for this kind of situation, or are you suggesting that the warning message should be made “unique” (possibly by adding specific informations such as the name of the file being parsed and so on) so that the warning gets emitted for different files?

They should not be used. I don't see a reasonable way to make the warning message unique each time.

All right, but how can we let the users know that some transformation occurred during the parsing then?

Flatly erroring out just because we had to replace one character inside a qualifier value seems way overkill to me.

We already have multiple parser warnings in the GenBank parser and elsewhere in Biopython, but @mdehoon and I disagree on the use of warnings. This came up on another issue recently too.

I am of the view that https://docs.python.org/3/library/warnings.html provides the tools needed for a user to decide how to deal with warnings as they wish (including silencing them all, showing them all, or upgrading them to errors), and we just need to word the warning appropriately to balance their usefulness by default.

e.g. If a warning might be triggered dozens of times on a file, use the same warning string so that by default it is only seen once.

In this case if we want to have lots of warnings shown by default, you could issue a warning on each transformation using problematic value.

Accepting new lines in the SnapGene files (and raising a ValueError in the GenBank writer) would be reasonable. My impression from the issue report is this is rare and likely a glitch, so I would be happy with the warning (as in this PR), or forcing the user to correct the annotation before writing it.

And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...) you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?

And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...) you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?

It's too complicated. Only advanced users will know about this. Not to mention that warnings are easy to miss if a script prints lots of output to the screen.
Raising a ValueError, on the other hand, can simply tell the user to use strict=False if they are sure they want to ignore the issue found.

@gouttegd Btw from your example it looks like the line break in the SnapGene file was just included for formatting and does not convey any information. Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.
textwrap.shorten(yourtext, width=sys.maxsize) lets you do this in a clean way.

My impression from the issue report is this is rare and likely a glitch

I have never seen it myself in the SnapGene files I’ve got, but in the case reported in #3885 the newline is represented in the file as a XML entity &x10;. I doubt this can happen as a result of a glitch.

forcing the user to correct the annotation before writing it

I am very much against that.

Because the SnapGene format is binary, practically speaking SnapGene files can only be edited using the SnapGene software. The reason I wrote the SnapGene parser in the first place was so that I could deal with SnapGene files (that I had received from collaborators who didn’t bother to use a more standard format) without having to install SnapGene, that I personally do not use.

Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.

I’m fine with advertising the line-break removal in the documentation, but I don’t see why the warning should be removed. It doesn’t hurt and it gives the user (who may not have read the documentation) one more chance to be aware of the removal. What is the problem with that?

Let's think about this a bit more carefully. Is our basic assumption that newlines in SnapGene files are
(a) informative;
(b) not informative (i.e. they were just added for formatting, but do not carry information);
(c) sometimes informative and sometimes not informative?

mdehoon · 2022-05-01T12:23:47Z

@peterjc:

forcing the user to correct the annotation before writing it

@gouttegd :

I am very much against that.
Because the SnapGene format is binary, practically speaking SnapGene files can only be edited using the SnapGene software.

I think @peterjc meant to correct the annotation by modifying the SeqRecord in Python before writing it using SeqIO.write.

gouttegd · 2022-05-01T12:43:14Z

I think @peterjc meant to correct the annotation by modifying the SeqRecord in Python before writing it using SeqIO.write.

Ah, my bad. I would have no objection to that of course.

Let's think about this a bit more carefully. Is our basic assumption that newlines in SnapGene files are
(a) informative;
(b) not informative (i.e. they were just added for formatting, but do not carry information);
(c) sometimes informative and sometimes not informative?

It all depends on the intention of whoever wrote the qualifier’s value. To be clear, those newlines are not automatically inserted by SnapGene itself: they are there because, when editing the value of a qualifier in SnapGene’s feature editor, the user deliberately choose to insert a line break. SnapGene merely translates that line break into either a HTML tag (<br/>) or a XML entity (
). SnapGene itself does not assign any signification to those line breaks.

Now, I certainly hope that nobody in their right mind would use the mere presence of line breaks as a way to convey something meaningful. But I wouldn’t be overly surprised that some people do.

mdehoon · 2022-05-01T13:52:31Z

Reading this comment in #3885 :

The version of SnapGene that I have cleans these characters.

and considering the offending line with the newline, shown in #3885 :

/application_notes=Strong promoter; drives
transcription

suggest that we can assume that these newlines are meaningless.
If so, the parser and the writer can just replace them with spaces (though I would suggest to use textwrap to do this in a clean way).

The Fasta writer also replaces newlines with spaces, without raising an exception or issuing a warning:

>>> from io import StringIO
>>> from Bio import SeqIO
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord

>>> sequence = Seq("ACGT")
>>> record = SeqRecord(sequence, id='DNA', description="some piece\nof DNA")
>>> stream = StringIO()
>>> SeqIO.write([record], stream, 'fasta')
>>> stream.seek(0)
>>> print(stream.read())

gives this output:

>DNA some piece of DNA
ACGT

gouttegd · 2022-05-01T18:39:40Z

The version of SnapGene that I have cleans these characters.

Well, the one I have does not. It translates them to HTML tags (<br/>) that do not break the GenBank writer down the line, but they are still there (if you open a SnapGene file containing such tags with SnapGene, when reading the file it translates the tags back to newlines).

suggest that we can assume that these newlines are meaningless.

I am not comfortable with assuming that from one example. I can easily imagine situations where a newline can be meaningful.

For example, I work with a database where, in some fields, a certain keyword must be interpreted only if it appears at the beginning of a line (i.e. after a newline character), not if it appears anywhere else. Is that a good idea? No, and I hate it. But I have to work with it.

I sure hope people don’t do things like that with qualifier values in SnapGene. And if some people do, I am all in favour of telling them “well, it’s your mess, Biopython has no obligation to deal with it”. So I have no qualms in removing the offending newlines so that the rest of the Biopython code does not have to worry about them. But I still think some kind of warning is warranted.

gouttegd · 2023-01-29T19:15:50Z

I completely forgot about this PR for a while…

Looks like the options are:

Leaving the SnapGene parser untouched, on the rationale that there’s nothing wrong in having newlines in qualifier values and that if the GenBank writer has a problem with such newlines, then it’s the GenBank writer that needs fixing (and a fix is waiting in Handle line breaks in GenBank writer #3911). In that case this PR may be closed.
Have the SnapGene parser remove newlines (along with other HTML tags) in qualifier values and emit a warning, independently on what the GenBank writer may be doing downstream – basically, what is proposed in this PR.
Have the behaviour of the parser configurable with an hypothetical strict flag and/or use another system than a warning to inform users of what’s happening. In this case this PR would probably need so much changes that it would be better to drop it and start anew.

gouttegd added 2 commits April 28, 2022 12:55

Add a new SnapGene test.

945ef93

Add a new SnapGene file containing a qualifier value with a line break and check that the expected warning for data transformation is emitted.

gouttegd requested a review from peterjc as a code owner April 28, 2022 12:49

Add missing docstrings.

7adf957

Add docstrings for all public methods of the new XmlHelper class.

peterjc reviewed Apr 28, 2022

View reviewed changes

mdehoon reviewed Apr 29, 2022

View reviewed changes

erik-whiting mentioned this pull request Jul 3, 2022

Handle line breaks in GenBank writer #3911

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove newlines in qualifier values when parsing SnapGene files #3913

Remove newlines in qualifier values when parsing SnapGene files #3913

gouttegd commented Apr 28, 2022

peterjc Apr 28, 2022

gouttegd Apr 28, 2022

mdehoon Apr 29, 2022

gouttegd Apr 29, 2022

mdehoon Apr 29, 2022

gouttegd Apr 29, 2022

peterjc Apr 29, 2022

peterjc Apr 30, 2022

mdehoon Apr 30, 2022

mdehoon Apr 30, 2022

gouttegd May 1, 2022

mdehoon May 1, 2022

mdehoon commented May 1, 2022

gouttegd commented May 1, 2022

mdehoon commented May 1, 2022

gouttegd commented May 1, 2022

gouttegd commented Jan 29, 2023

Remove newlines in qualifier values when parsing SnapGene files #3913

Are you sure you want to change the base?

Remove newlines in qualifier values when parsing SnapGene files #3913

Conversation

gouttegd commented Apr 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdehoon commented May 1, 2022

gouttegd commented May 1, 2022

mdehoon commented May 1, 2022

gouttegd commented May 1, 2022

gouttegd commented Jan 29, 2023