NCBIWWW response can contain unexpected XML for large queries #3342

mfajer · 2020-11-09T21:59:49Z

I am responsible for a periodic test that runs a blastp query with the same input and it has been failing about 50% of the time. In discussions with NCBI the issue appears to be the large number of hits returned, although proper debugging by NCBI requires the RID of a failed response.

This leads me to you. The current structure of NCBIWWW.qblast to NCBIXML.parse means that the RID is lost by the time the parsing failure is detected. The RID is in scope during the qblast call, but neither the StringIO wrapper of the response nor the response text itself contains the RID. I am not sure what solution to propose, though, since changing the return type of qblast is likely to be painful for compatibility and stuffing the RID into the StringIO wrapper is quite hacky. Perhaps adding a debug logging option to stdout? However it is done, I think exposing this information is essential to debugging my and any future issues with NCBI.

Finally, there is the unexpected XML itself. It is possible to add an explicit exception to NCBIXML.parse to catch this special case. The issue is always the same, with a "CREATE_VIEW" dropped into the middle of otherwise expected XML tags. I've copied an excerpt below. This isn't a great solution, but does fix the issue. I can post a pull request for that if desired, but I suspect exposing the RID and getting NCBI to actually fix their end will be the preferred approach. Thanks!

<Hit>
  <Hit_num>6172</Hit_num>
  <Hit_id>ref|WP_186191221.1|</Hit_id>
  <Hit_def>transglycosylase SLT domain-containing protein [Burkholderia gladioli]</Hit_def>
  <Hit_accession>WP_186191221</Hit_accession>
  <Hit_len>147</Hit_len>
  ...
</Hit>
CREATE_VIEW




<Hit>
  <Hit_num>6173</Hit_num>
  <Hit_id>ref|WP_186261419.1|</Hit_id>
  <Hit_def>transglycosylase SLT domain-containing protein [Burkholderia gladioli]</Hit_def>
  <Hit_accession>WP_186261419</Hit_accession>
  <Hit_len>206</Hit_len>
    ...
</Hit>

The text was updated successfully, but these errors were encountered:

peterjc · 2020-11-09T23:14:35Z

Could you try if a change like this works for diagnostics?

$ git diff
diff --git a/Bio/Blast/NCBIWWW.py b/Bio/Blast/NCBIWWW.py
index 38bb9b821..5b6e86348 100644
--- a/Bio/Blast/NCBIWWW.py
+++ b/Bio/Blast/NCBIWWW.py
@@ -268,7 +268,10 @@ def qblast(
         status = results[i + len("Status=") : j].strip()
         if status.upper() == "READY":
             break
-    return StringIO(results)
+    answer = StringIO(results)
+    answer.rtoe = rtoe
+    answer.rid = rid
+    return answer
 
 
 qblast._previous = 0

You should then be able to look at the StringIO handle of XML values, and look at the run-time added attributes of .rtoe and .rid

We do something similar with the Entrez interface to record .url as an attribute, mainly for diagnostics.

MarkusPiotrowski · 2020-11-10T07:13:21Z

FYI, I'm working on a new online BLAST module ('CommonURLAPI') which will give you access to the RID, so you can fetch the result several times, in different formats and also delete your search. However, this is not high priority.
I found that error handling for BLAST search is very difficult, because there is no uniform way how errors are reported.

peterjc · 2020-11-10T09:31:21Z

Error handling in the NCBI Entrez online API is also painful, they are not consistent but I can understand why as I believe it an abstraction built on top of multiple back end systems.

mfajer · 2020-11-12T16:05:08Z

@peterjc that patch does work, now to actually catch the failure. The problem is intermittent so this may take a while. Do you have any intention of adding that patch to the repository? I think it would be useful, at least until @MarkusPiotrowski completes the new module. Cheers.

peterjc · 2020-11-12T17:10:20Z

I'm willing to add that patch for diagnostics - no objections @MarkusPiotrowski ?

JoaoRodrigues · 2023-03-02T17:49:21Z

@peterjc @MarkusPiotrowski, any updates on this. Turns out this fell in my plate from the other end, what an interesting coincidence ;)

peterjc · 2023-03-02T18:14:23Z

I don't use the online BLAST in code, so nothing on my side. Do you want to turn my patch into a PR to aid diagnosis and/or adding a recovery mode.

JoaoRodrigues · 2023-03-02T20:24:02Z

I'll give it a shot. Trying to get a minimum reproducible example.

MarkusPiotrowski · 2023-03-03T07:39:20Z

~3 years ago I started to make a new online BLAST module CommonURLAPI based and restricted to the command set of their then recent CommonURL API. It would have introduced a class Blast where you can set the search and retain the results independently later. This would also allow you to download the result in different formats at a later time, because it exposes the RID to the user.
Now I see that I have already said this earlier... :-)
But no, I got stuck during corona and didn't finish it. I wonder, now that NCBI has deprecated their CommonURL API if it is worth to finish.

JoaoRodrigues · 2023-03-03T16:55:02Z

Hi @MarkusPiotrowski, yeah, doesn't sound worth putting any effort into it at all.. I had a look and there's a bunch of different resources but they all seem to push users to AWS/GCP, definitely not what we want.

I managed to get the XML files for a "good" run and a "bad" run (both for the same input). The bad run has a CREATE_VIEW statement somewhere in the middle of the file. I think we can handle this on our end by basically dumping the XML to a file, ignoring these weird lines, and then reparsing the file. A little bit more of IO but should be robust. I also reached out to NCBI, since this is definitely their problem.

My only question is, should we handle this upstream here in Biopython?

peterjc · 2023-03-06T14:19:38Z

Are the bad runs even valid XML?

If we did look for this CREATE_VIEW marker what do we do then, re-request the XML in the hopes it works second time round?

I'm leaning towards we do as little as possible - which could be just exposing the RTOE and RID so that the end user could try fetching the results again?

JoaoRodrigues · 2023-03-06T15:23:49Z

It's valid XML. These extra lines show up in between <Hit> tags. If I remove them, I can parse the file.

Requesting again doesn't help, seems like the XML is pregenerated because you always get the same answer.

peterjc · 2023-03-06T16:23:21Z

If it is valid XML, does our parser need a tweak to ignore these extra lines then?

JoaoRodrigues · 2023-03-06T16:38:06Z

I'm not sure if we should fix it at the parser level (NCBIXML) or directly at the NCBIWWW level (modifying the StringIO object directly). What do you think?

peterjc · 2023-03-06T16:41:59Z

Being pragmatic, whichever is easiest without any (edit: new) side effects like massive memory usage.

JoaoRodrigues · 2023-03-06T18:01:11Z

We already read the entire results in memory:

biopython/Bio/Blast/NCBIWWW.py

Line 257 in 851003c

results = handle.read().decode()

We could modify it to build the buffer in memory, although we'd have to change how we do the checking of the results. The alternative is to use a tempfile to write the filtered results and then reload into memory (replacing the original, so no large memory usage).

peterjc · 2023-03-06T18:22:11Z

In that case, since it has the XML in memory as a string anyway, making the change in NCBIWWW.py looks easiest 👍

JoaoRodrigues mentioned this issue Mar 8, 2023

Sanitize CREATE_VIEW lines from Blast XML outputs. #4256

Merged

JoaoRodrigues self-assigned this Mar 8, 2023

JoaoRodrigues closed this as completed in #4256 Mar 15, 2023

peterjc mentioned this issue Mar 16, 2023

Report RID and RTOE in NCBIWWW.qblast output object #4263

Open

peterjc mentioned this issue May 6, 2024

Blast.NCBIWWW.qblast should return the rid of the submitted query to the caller #4721

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCBIWWW response can contain unexpected XML for large queries #3342

NCBIWWW response can contain unexpected XML for large queries #3342

mfajer commented Nov 9, 2020 •

edited

Loading

peterjc commented Nov 9, 2020

MarkusPiotrowski commented Nov 10, 2020

peterjc commented Nov 10, 2020

mfajer commented Nov 12, 2020

peterjc commented Nov 12, 2020

JoaoRodrigues commented Mar 2, 2023

peterjc commented Mar 2, 2023

JoaoRodrigues commented Mar 2, 2023

MarkusPiotrowski commented Mar 3, 2023

JoaoRodrigues commented Mar 3, 2023

peterjc commented Mar 6, 2023

JoaoRodrigues commented Mar 6, 2023 •

edited

Loading

peterjc commented Mar 6, 2023

JoaoRodrigues commented Mar 6, 2023

peterjc commented Mar 6, 2023 •

edited

Loading

JoaoRodrigues commented Mar 6, 2023

peterjc commented Mar 6, 2023

NCBIWWW response can contain unexpected XML for large queries #3342

NCBIWWW response can contain unexpected XML for large queries #3342

Comments

mfajer commented Nov 9, 2020 • edited Loading

peterjc commented Nov 9, 2020

MarkusPiotrowski commented Nov 10, 2020

peterjc commented Nov 10, 2020

mfajer commented Nov 12, 2020

peterjc commented Nov 12, 2020

JoaoRodrigues commented Mar 2, 2023

peterjc commented Mar 2, 2023

JoaoRodrigues commented Mar 2, 2023

MarkusPiotrowski commented Mar 3, 2023

JoaoRodrigues commented Mar 3, 2023

peterjc commented Mar 6, 2023

JoaoRodrigues commented Mar 6, 2023 • edited Loading

peterjc commented Mar 6, 2023

JoaoRodrigues commented Mar 6, 2023

peterjc commented Mar 6, 2023 • edited Loading

JoaoRodrigues commented Mar 6, 2023

peterjc commented Mar 6, 2023

mfajer commented Nov 9, 2020 •

edited

Loading

JoaoRodrigues commented Mar 6, 2023 •

edited

Loading

peterjc commented Mar 6, 2023 •

edited

Loading