-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCBIWWW response can contain unexpected XML for large queries #3342
Comments
Could you try if a change like this works for diagnostics?
You should then be able to look at the We do something similar with the Entrez interface to record |
FYI, I'm working on a new online BLAST module ('CommonURLAPI') which will give you access to the RID, so you can fetch the result several times, in different formats and also delete your search. However, this is not high priority. |
Error handling in the NCBI Entrez online API is also painful, they are not consistent but I can understand why as I believe it an abstraction built on top of multiple back end systems. |
@peterjc that patch does work, now to actually catch the failure. The problem is intermittent so this may take a while. Do you have any intention of adding that patch to the repository? I think it would be useful, at least until @MarkusPiotrowski completes the new module. Cheers. |
I'm willing to add that patch for diagnostics - no objections @MarkusPiotrowski ? |
@peterjc @MarkusPiotrowski, any updates on this. Turns out this fell in my plate from the other end, what an interesting coincidence ;) |
I don't use the online BLAST in code, so nothing on my side. Do you want to turn my patch into a PR to aid diagnosis and/or adding a recovery mode. |
I'll give it a shot. Trying to get a minimum reproducible example. |
~3 years ago I started to make a new online BLAST module |
Hi @MarkusPiotrowski, yeah, doesn't sound worth putting any effort into it at all.. I had a look and there's a bunch of different resources but they all seem to push users to AWS/GCP, definitely not what we want. I managed to get the XML files for a "good" run and a "bad" run (both for the same input). The bad run has a My only question is, should we handle this upstream here in Biopython? |
Are the bad runs even valid XML? If we did look for this I'm leaning towards we do as little as possible - which could be just exposing the RTOE and RID so that the end user could try fetching the results again? |
It's valid XML. These extra lines show up in between Requesting again doesn't help, seems like the XML is pregenerated because you always get the same answer. |
If it is valid XML, does our parser need a tweak to ignore these extra lines then? |
I'm not sure if we should fix it at the parser level (NCBIXML) or directly at the NCBIWWW level (modifying the StringIO object directly). What do you think? |
Being pragmatic, whichever is easiest without any (edit: new) side effects like massive memory usage. |
We already read the entire results in memory: biopython/Bio/Blast/NCBIWWW.py Line 257 in 851003c
We could modify it to build the buffer in memory, although we'd have to change how we do the checking of the results. The alternative is to use a tempfile to write the filtered results and then reload into memory (replacing the original, so no large memory usage). |
In that case, since it has the XML in memory as a string anyway, making the change in |
I am responsible for a periodic test that runs a blastp query with the same input and it has been failing about 50% of the time. In discussions with NCBI the issue appears to be the large number of hits returned, although proper debugging by NCBI requires the RID of a failed response.
This leads me to you. The current structure of NCBIWWW.qblast to NCBIXML.parse means that the RID is lost by the time the parsing failure is detected. The RID is in scope during the qblast call, but neither the StringIO wrapper of the response nor the response text itself contains the RID. I am not sure what solution to propose, though, since changing the return type of qblast is likely to be painful for compatibility and stuffing the RID into the StringIO wrapper is quite hacky. Perhaps adding a debug logging option to stdout? However it is done, I think exposing this information is essential to debugging my and any future issues with NCBI.
Finally, there is the unexpected XML itself. It is possible to add an explicit exception to NCBIXML.parse to catch this special case. The issue is always the same, with a "CREATE_VIEW" dropped into the middle of otherwise expected XML tags. I've copied an excerpt below. This isn't a great solution, but does fix the issue. I can post a pull request for that if desired, but I suspect exposing the RID and getting NCBI to actually fix their end will be the preferred approach. Thanks!
The text was updated successfully, but these errors were encountered: