Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCBIWWW response can contain unexpected XML for large queries #3342

Closed
mfajer opened this issue Nov 9, 2020 · 17 comments · Fixed by #4256
Closed

NCBIWWW response can contain unexpected XML for large queries #3342

mfajer opened this issue Nov 9, 2020 · 17 comments · Fixed by #4256
Assignees

Comments

@mfajer
Copy link

mfajer commented Nov 9, 2020

I am responsible for a periodic test that runs a blastp query with the same input and it has been failing about 50% of the time. In discussions with NCBI the issue appears to be the large number of hits returned, although proper debugging by NCBI requires the RID of a failed response.

This leads me to you. The current structure of NCBIWWW.qblast to NCBIXML.parse means that the RID is lost by the time the parsing failure is detected. The RID is in scope during the qblast call, but neither the StringIO wrapper of the response nor the response text itself contains the RID. I am not sure what solution to propose, though, since changing the return type of qblast is likely to be painful for compatibility and stuffing the RID into the StringIO wrapper is quite hacky. Perhaps adding a debug logging option to stdout? However it is done, I think exposing this information is essential to debugging my and any future issues with NCBI.

Finally, there is the unexpected XML itself. It is possible to add an explicit exception to NCBIXML.parse to catch this special case. The issue is always the same, with a "CREATE_VIEW" dropped into the middle of otherwise expected XML tags. I've copied an excerpt below. This isn't a great solution, but does fix the issue. I can post a pull request for that if desired, but I suspect exposing the RID and getting NCBI to actually fix their end will be the preferred approach. Thanks!

<Hit>
  <Hit_num>6172</Hit_num>
  <Hit_id>ref|WP_186191221.1|</Hit_id>
  <Hit_def>transglycosylase SLT domain-containing protein [Burkholderia gladioli]</Hit_def>
  <Hit_accession>WP_186191221</Hit_accession>
  <Hit_len>147</Hit_len>
  ...
</Hit>
CREATE_VIEW




<Hit>
  <Hit_num>6173</Hit_num>
  <Hit_id>ref|WP_186261419.1|</Hit_id>
  <Hit_def>transglycosylase SLT domain-containing protein [Burkholderia gladioli]</Hit_def>
  <Hit_accession>WP_186261419</Hit_accession>
  <Hit_len>206</Hit_len>
    ...
</Hit>
@peterjc
Copy link
Member

peterjc commented Nov 9, 2020

Could you try if a change like this works for diagnostics?

$ git diff
diff --git a/Bio/Blast/NCBIWWW.py b/Bio/Blast/NCBIWWW.py
index 38bb9b821..5b6e86348 100644
--- a/Bio/Blast/NCBIWWW.py
+++ b/Bio/Blast/NCBIWWW.py
@@ -268,7 +268,10 @@ def qblast(
         status = results[i + len("Status=") : j].strip()
         if status.upper() == "READY":
             break
-    return StringIO(results)
+    answer = StringIO(results)
+    answer.rtoe = rtoe
+    answer.rid = rid
+    return answer
 
 
 qblast._previous = 0

You should then be able to look at the StringIO handle of XML values, and look at the run-time added attributes of .rtoe and .rid

We do something similar with the Entrez interface to record .url as an attribute, mainly for diagnostics.

@MarkusPiotrowski
Copy link
Contributor

FYI, I'm working on a new online BLAST module ('CommonURLAPI') which will give you access to the RID, so you can fetch the result several times, in different formats and also delete your search. However, this is not high priority.
I found that error handling for BLAST search is very difficult, because there is no uniform way how errors are reported.

@peterjc
Copy link
Member

peterjc commented Nov 10, 2020

Error handling in the NCBI Entrez online API is also painful, they are not consistent but I can understand why as I believe it an abstraction built on top of multiple back end systems.

@mfajer
Copy link
Author

mfajer commented Nov 12, 2020

@peterjc that patch does work, now to actually catch the failure. The problem is intermittent so this may take a while. Do you have any intention of adding that patch to the repository? I think it would be useful, at least until @MarkusPiotrowski completes the new module. Cheers.

@peterjc
Copy link
Member

peterjc commented Nov 12, 2020

I'm willing to add that patch for diagnostics - no objections @MarkusPiotrowski ?

@JoaoRodrigues
Copy link
Member

@peterjc @MarkusPiotrowski, any updates on this. Turns out this fell in my plate from the other end, what an interesting coincidence ;)

@peterjc
Copy link
Member

peterjc commented Mar 2, 2023

I don't use the online BLAST in code, so nothing on my side. Do you want to turn my patch into a PR to aid diagnosis and/or adding a recovery mode.

@JoaoRodrigues
Copy link
Member

I'll give it a shot. Trying to get a minimum reproducible example.

@MarkusPiotrowski
Copy link
Contributor

~3 years ago I started to make a new online BLAST module CommonURLAPI based and restricted to the command set of their then recent CommonURL API. It would have introduced a class Blast where you can set the search and retain the results independently later. This would also allow you to download the result in different formats at a later time, because it exposes the RID to the user.
Now I see that I have already said this earlier... :-)
But no, I got stuck during corona and didn't finish it. I wonder, now that NCBI has deprecated their CommonURL API if it is worth to finish.

@JoaoRodrigues
Copy link
Member

Hi @MarkusPiotrowski, yeah, doesn't sound worth putting any effort into it at all.. I had a look and there's a bunch of different resources but they all seem to push users to AWS/GCP, definitely not what we want.

I managed to get the XML files for a "good" run and a "bad" run (both for the same input). The bad run has a CREATE_VIEW statement somewhere in the middle of the file. I think we can handle this on our end by basically dumping the XML to a file, ignoring these weird lines, and then reparsing the file. A little bit more of IO but should be robust. I also reached out to NCBI, since this is definitely their problem.

My only question is, should we handle this upstream here in Biopython?

@peterjc
Copy link
Member

peterjc commented Mar 6, 2023

Are the bad runs even valid XML?

If we did look for this CREATE_VIEW marker what do we do then, re-request the XML in the hopes it works second time round?

I'm leaning towards we do as little as possible - which could be just exposing the RTOE and RID so that the end user could try fetching the results again?

@JoaoRodrigues
Copy link
Member

JoaoRodrigues commented Mar 6, 2023

It's valid XML. These extra lines show up in between <Hit> tags. If I remove them, I can parse the file.

Requesting again doesn't help, seems like the XML is pregenerated because you always get the same answer.

@peterjc
Copy link
Member

peterjc commented Mar 6, 2023

If it is valid XML, does our parser need a tweak to ignore these extra lines then?

@JoaoRodrigues
Copy link
Member

I'm not sure if we should fix it at the parser level (NCBIXML) or directly at the NCBIWWW level (modifying the StringIO object directly). What do you think?

@peterjc
Copy link
Member

peterjc commented Mar 6, 2023

Being pragmatic, whichever is easiest without any (edit: new) side effects like massive memory usage.

@JoaoRodrigues
Copy link
Member

We already read the entire results in memory:

results = handle.read().decode()

We could modify it to build the buffer in memory, although we'd have to change how we do the checking of the results. The alternative is to use a tempfile to write the filtered results and then reload into memory (replacing the original, so no large memory usage).

@peterjc
Copy link
Member

peterjc commented Mar 6, 2023

In that case, since it has the XML in memory as a string anyway, making the change in NCBIWWW.py looks easiest 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants