Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove newlines in qualifier values when parsing SnapGene files #3913

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

gouttegd
Copy link
Contributor

  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #3885

This PR updates the SnapGene parser to handle the case of SnapGene files containing qualifier values with newline characters.

Inserting newlines in qualifier values with SnapGene should theoretically not cause any problem, as SnapGene should encode them as HTML tags (<br/>). However, it seems that it sometimes happen (probably depending on the exact version of SnapGene) that the newline is encoded as a XML entity instead (&#x0a;). That entity is then automatically transformed into a “real” newline character (\n) by the XML parser used internally by the SnapGene parser. The presence of such a newline character in a qualifier value can then cause downstream problem, notably if the record is written as a GenBank file.

Therefore this PR makes sure any newline in a qualifier value (regardless of whether it is encoded as a HTML tag or a XML entity) is replaced by a space. A warning is emitted to let the user know that a data transformation occurred during the parsing.

Qualifier values in SnapGene files may sometimes contain newline
characters, represented either as XML entities ('&x0a;', automatically
decoded for us into '\n' by the XML parser) or as HTML tags ('<br/>').

We replace those newlines by space characters, and emit a transformation
warning if we have to do so.

This implies refactoring the XML helper methods into a single XML helper
class, so that we have a way to keep track of how many values have been
transformed and only emit a single warning at the end of the parsing.
Add a new SnapGene file containing a qualifier value with a line break
and check that the expected warning for data transformation is emitted.
@gouttegd gouttegd requested a review from peterjc as a code owner April 28, 2022 12:49
Add docstrings for all public methods of the new XmlHelper class.
@@ -51,7 +54,7 @@ def _iterate(handle):
yield (packet_type, length, data)


def _parse_dna_packet(length, data, record):
def _parse_dna_packet(length, data, record, _):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the unused fourth argument to match the API of the other helper functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

To be frank I’m not entirely happy with this solution. A cleaner design IMHO would have been for all those functions to be methods of the SnapGeneIterator class, so that all the state could be held within the class without having to pass it around between functions. But I felt this was too much of a refactoring for a simple bugfix.

handler(length, data, record, helper)

if helper.transformed:
warnings.warn("Some text values have been transformed", BiopythonWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warnings are dangerous; they get emitted only once in a Python session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that warnings should not be used for this kind of situation, or are you suggesting that the warning message should be made “unique” (possibly by adding specific informations such as the name of the file being parsed and so on) so that the warning gets emitted for different files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should not be used. I don't see a reasonable way to make the warning message unique each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, but how can we let the users know that some transformation occurred during the parsing then?

Flatly erroring out just because we had to replace one character inside a qualifier value seems way overkill to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have multiple parser warnings in the GenBank parser and elsewhere in Biopython, but @mdehoon and I disagree on the use of warnings. This came up on another issue recently too.

I am of the view that https://docs.python.org/3/library/warnings.html provides the tools needed for a user to decide how to deal with warnings as they wish (including silencing them all, showing them all, or upgrading them to errors), and we just need to word the warning appropriately to balance their usefulness by default.

e.g. If a warning might be triggered dozens of times on a file, use the same warning string so that by default it is only seen once.

In this case if we want to have lots of warnings shown by default, you could issue a warning on each transformation using problematic value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting new lines in the SnapGene files (and raising a ValueError in the GenBank writer) would be reasonable. My impression from the issue report is this is rare and likely a glitch, so I would be happy with the warning (as in this PR), or forcing the user to correct the annotation before writing it.

And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...) you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to Michiel, if you wanted to silence or otherwise specially treat warnings from SeqIO.parse(...) you could use a context manager https://docs.python.org/3/library/warnings.html#temporarily-suppressing-warnings (as we do in our unit tests for verifying a warning is raised). We could add this to our documentation, ideally with a good motivational use case?

It's too complicated. Only advanced users will know about this. Not to mention that warnings are easy to miss if a script prints lots of output to the screen.
Raising a ValueError, on the other hand, can simply tell the user to use strict=False if they are sure they want to ignore the issue found.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gouttegd Btw from your example it looks like the line break in the SnapGene file was just included for formatting and does not convey any information. Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.
textwrap.shorten(yourtext, width=sys.maxsize) lets you do this in a clean way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression from the issue report is this is rare and likely a glitch

I have never seen it myself in the SnapGene files I’ve got, but in the case reported in #3885 the newline is represented in the file as a XML entity &x10;. I doubt this can happen as a result of a glitch.

forcing the user to correct the annotation before writing it

I am very much against that.

Because the SnapGene format is binary, practically speaking SnapGene files can only be edited using the SnapGene software. The reason I wrote the SnapGene parser in the first place was so that I could deal with SnapGene files (that I had received from collaborators who didn’t bother to use a more standard format) without having to install SnapGene, that I personally do not use.

Then you could just let the SnapGene parser always replace the line break with a space, without a warning, and mention somewhere in the documentation that the SnapGene parser does this.

I’m fine with advertising the line-break removal in the documentation, but I don’t see why the warning should be removed. It doesn’t hurt and it gives the user (who may not have read the documentation) one more chance to be aware of the removal. What is the problem with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's think about this a bit more carefully. Is our basic assumption that newlines in SnapGene files are
(a) informative;
(b) not informative (i.e. they were just added for formatting, but do not carry information);
(c) sometimes informative and sometimes not informative?

@mdehoon
Copy link
Contributor

mdehoon commented May 1, 2022

@peterjc:

forcing the user to correct the annotation before writing it

@gouttegd :

I am very much against that.
Because the SnapGene format is binary, practically speaking SnapGene files can only be edited using the SnapGene software.

I think @peterjc meant to correct the annotation by modifying the SeqRecord in Python before writing it using SeqIO.write.

@gouttegd
Copy link
Contributor Author

gouttegd commented May 1, 2022

I think @peterjc meant to correct the annotation by modifying the SeqRecord in Python before writing it using SeqIO.write.

Ah, my bad. I would have no objection to that of course.

Let's think about this a bit more carefully. Is our basic assumption that newlines in SnapGene files are
(a) informative;
(b) not informative (i.e. they were just added for formatting, but do not carry information);
(c) sometimes informative and sometimes not informative?

It all depends on the intention of whoever wrote the qualifier’s value. To be clear, those newlines are not automatically inserted by SnapGene itself: they are there because, when editing the value of a qualifier in SnapGene’s feature editor, the user deliberately choose to insert a line break. SnapGene merely translates that line break into either a HTML tag (<br/>) or a XML entity (&#10;). SnapGene itself does not assign any signification to those line breaks.

Now, I certainly hope that nobody in their right mind would use the mere presence of line breaks as a way to convey something meaningful. But I wouldn’t be overly surprised that some people do.

@mdehoon
Copy link
Contributor

mdehoon commented May 1, 2022

Reading this comment in #3885 :

The version of SnapGene that I have cleans these characters.

and considering the offending line with the newline, shown in #3885 :

/application_notes=Strong promoter; drives
transcription

suggest that we can assume that these newlines are meaningless.
If so, the parser and the writer can just replace them with spaces (though I would suggest to use textwrap to do this in a clean way).

The Fasta writer also replaces newlines with spaces, without raising an exception or issuing a warning:

>>> from io import StringIO
>>> from Bio import SeqIO
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord

>>> sequence = Seq("ACGT")
>>> record = SeqRecord(sequence, id='DNA', description="some piece\nof DNA")
>>> stream = StringIO()
>>> SeqIO.write([record], stream, 'fasta')
>>> stream.seek(0)
>>> print(stream.read())

gives this output:

>DNA some piece of DNA
ACGT

@gouttegd
Copy link
Contributor Author

gouttegd commented May 1, 2022

The version of SnapGene that I have cleans these characters.

Well, the one I have does not. It translates them to HTML tags (<br/>) that do not break the GenBank writer down the line, but they are still there (if you open a SnapGene file containing such tags with SnapGene, when reading the file it translates the tags back to newlines).

suggest that we can assume that these newlines are meaningless.

I am not comfortable with assuming that from one example. I can easily imagine situations where a newline can be meaningful.

For example, I work with a database where, in some fields, a certain keyword must be interpreted only if it appears at the beginning of a line (i.e. after a newline character), not if it appears anywhere else. Is that a good idea? No, and I hate it. But I have to work with it.

I sure hope people don’t do things like that with qualifier values in SnapGene. And if some people do, I am all in favour of telling them “well, it’s your mess, Biopython has no obligation to deal with it”. So I have no qualms in removing the offending newlines so that the rest of the Biopython code does not have to worry about them. But I still think some kind of warning is warranted.

@gouttegd
Copy link
Contributor Author

I completely forgot about this PR for a while…

Looks like the options are:

  1. Leaving the SnapGene parser untouched, on the rationale that there’s nothing wrong in having newlines in qualifier values and that if the GenBank writer has a problem with such newlines, then it’s the GenBank writer that needs fixing (and a fix is waiting in Handle line breaks in GenBank writer #3911). In that case this PR may be closed.

  2. Have the SnapGene parser remove newlines (along with other HTML tags) in qualifier values and emit a warning, independently on what the GenBank writer may be doing downstream – basically, what is proposed in this PR.

  3. Have the behaviour of the parser configurable with an hypothetical strict flag and/or use another system than a warning to inform users of what’s happening. In this case this PR would probably need so much changes that it would be better to drop it and start anew.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SnapGene files with line breaks in feature qualifiers result in corrupted genbank files
3 participants