Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Possible crash location Bio/Blast/NCBIXML.py", line 106 in endElement or Python-2.7.5-r2/Modules/pyexpat.c:616 #222

Closed
mmokrejs opened this Issue · 10 comments

3 participants

@mmokrejs

Oh Dear,
I am experiencing some crashes and thanks to python configured during install using "configure --with-pydebug" and thanks to https://pypi.python.org/pypi/faulthandler I have much better stacktraces in gdb and on STDERR.
Looks the route took me again to legacy BLASTN and to old bug in blast. NCBI asnwred to me in the past they won't fix legacy blastn. So, we have to fix biopython blastn parser, and now it even seems expat/biopython is crashing.

It will take me a while to get through all of the output but unless the bug is elsewhere the below stacktrace shpuld be enough . Most likely this is the bug I saw already in biopython-1.59 but right now have biopython-1.62b (pre-release-beta) installed, see the line numbers below.

Fatal Python error: Segmentation fault

Current thread 0x00007f9316072700:
File "/usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 106 in endElement
File "/mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/pyexpat.c", line 618 in EndElement
File "/usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py", line 654 in parse
File "blah.py", line 19469 in parse_blastn_XML_and_write_csv
...

(gdb) where
#0 0x00007f9315810acb in raise () from /lib64/libpthread.so.0
#1 0x00007f93149365f6 in faulthandler_fatal_error (signum=11) at faulthandler.c:321
#2
#3 0x00007f9315bc6e40 in visit_decref (op=, data=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:360
#4 0x00007f9315abc37c in list_traverse (o=0x6998150, visit=0x7f9315bc6e02 , arg=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/listobject.c:2362
#5 0x00007f9315bc6f32 in subtract_refs (containers=0x7f9315e789c0 ) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:385
#6 0x00007f9315bc7fb3 in collect (generation=2) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:925
#7 0x00007f9315bc830c in collect_generations () at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:1050
#8 0x00007f9315bc8fc3 in PyObject_GC_Malloc (basicsize=408) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:1511
#9 0x00007f9315bc9064 in PyObject_GC_NewVar (tp=0x7f9315e4f120 , nitems=1) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/gcmodule.c:1531
#10 0x00007f9315aae3e3 in PyFrame_New (tstate=0x20000a0, code=0x77d9d50, globals=
{'xml': , 'BlastParser': , '__builtins
': {'bytearray': , 'IndexError': , 'all': , 'help': <_Helper at remote 0x22091b0>, 'vars': , 'SyntaxError': , 'unicode': , 'UnicodeDecodeError': , 'memoryview': , 'isinstance': , 'copyright': <_Printer(_Printer__data='Copyright (c) 2001-2013 Python Software Foundation.\nAll Rights Reserved.\n\nCopyright (c) 2000 BeOpen.com.\nAll Rights Reserved.\n\nCopyright (c) 1995-2001 Corporation for National Research Initiatives.\nAll Rights Reserved.\n\nCopyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.\nAll Rights Reserved.', _Printer__lines=None, _Printer__name='copyright', _Printer__dirs=(), _Printer__files=(...)) at remote 0x21bdc30>, 'NameError'...(truncated), locals=
{'self': , _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A'...(truncated)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/frameobject.c:682

I suspect the bug happens because either of:
positives=(None, None)
identities=(None, None)
strand=(None, None)

I know these funny tuples were already reported for gaps and identities if I remember right .... so there may be more? :(
https://redmine.open-bio.org/issues/3363
https://redmine.open-bio.org/issues/3354

Why the expat crash on "Hsp_bit-score" (see rows #23 and #26 from gdb below)?

Nevertheless, I think biopython should sanitize its values if XML entry is crap. If you find why expat crashes than its only good. ;-)

#13 0x00007f9315baca72 in run_mod (mod=0x29b97b8, filename=0x7f9315c0fdd5 "",
globals={'xml': , 'BlastParser': , 'builtins': {'bytearray': , 'IndexError': , 'all': , 'help': <Helper at remote 0x22091b0>, 'vars': , 'SyntaxError': , 'unicode': , 'UnicodeDecodeError': , 'memoryview': , 'isinstance': , 'copyright': <Printer(_Printer__data='Copyright (c) 2001-2013 Python Software Foundation.\nAll Rights Reserved.\n\nCopyright (c) 2000 BeOpen.com.\nAll Rights Reserved.\n\nCopyright (c) 1995-2001 Corporation for National Research Initiatives.\nAll Rights Reserved.\n\nCopyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.\nAll Rights Reserved.', _Printer__lines=None, _Printer__name='copyright', _Printer__dirs=(), _Printer__files=(...)) at remote 0x21bdc30>, 'NameError'...(truncated),
locals={'self': , _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A'...(truncated), flags=0x7fff7c311d50, arena=0x4137e40)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/pythonrun.c:1365
#14 0x00007f9315bac923 in PyRun_StringFlags (str=0x77d0fc4 "self._end_Hsp_bit_score()", start=258,
globals={'xml': , 'BlastParser': , '__builtins
': {'bytearray': , 'IndexError': , 'all': , 'help': <Helper at remote 0x22091b0>, 'vars': , 'SyntaxError': , 'unicode': , 'UnicodeDecodeError': , 'memoryview': , 'isinstance': , 'copyright': <Printer(_Printer__data='Copyright (c) 2001-2013 Python Software Foundation.\nAll Rights Reserved.\n\nCopyright (c) 2000 BeOpen.com.\nAll Rights Reserved.\n\nCopyright (c) 1995-2001 Corporation for National Research Initiatives.\nAll Rights Reserved.\n\nCopyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.\nAll Rights Reserved.', _Printer__lines=None, _Printer__name='copyright', _Printer__dirs=(), _Printer__files=(...)) at remote 0x21bdc30>, 'NameError'...(truncated),
locals={'self': , _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A'...(truncated), flags=0x7fff7c311d50)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/pythonrun.c:1328
#15 0x00007f9315b658b5 in builtin_eval (self=0x0, args=(u'self._end_Hsp_bit_score()',)) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/bltinmodule.c:695
#16 0x00007f9315ad5006 in PyCFunction_Call (func=, arg=(u'self._end_Hsp_bit_score()',), kw=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/methodobject.c:81
#17 0x00007f9315b7b1d4 in call_function (pp_stack=0x7fff7c311f90, oparg=1) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:4021
#18 0x00007f9315b75cd8 in PyEval_EvalFrameEx (
f=Frame 0x3edea80, for file /usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py, line 106, in endElement (self=, _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp= at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:2666
#19 0x00007f9315b7870e in PyEval_EvalCodeEx (co=0x50a4bf0,
globals={'xml': , 'BlastParser': , '__builtins
': {'bytearray': , 'IndexError': , 'all': , 'help': <_Helper at remote 0x22091b0>, 'vars': , 'SyntaxError': , 'unicode': , 'UnicodeDecodeError': , 'memoryview': , 'isinstance': , 'copyright': <_Printer(_Printer__data='Copyright (c) 2001-2013 Python Software Foundation.\nAll Rights Reserved.\n\nCopyright (c) 2000 BeOpen.com.\nAll Rights Reserved.\n\nCopyright (c) 1995-2001 Corporation for National Research Initiatives.\nAll Rights Reserved.\n\nCopyright (c) 1991-1995 Stichting Mathematisch Centrum, Amsterdam.\nAll Rights Reserved.', _Printer__lines=None, _Printer__name='copyright', _Printer__dirs=(), _Printer__files=(...)) at remote 0x21bdc30>, 'NameError'...(truncated), locals=0x0, args=0x4172718, argcount=2, kws=0x0, kwcount=0, defs=0x0, defcount=0,
closure=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:3253
#20 0x00007f9315ab0f2e in function_call (func=,
arg=(, _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A', access...(truncated), kw=0x0)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/funcobject.c:526
#21 0x00007f9315a6f840 in PyObject_Call (func=,
arg=(, _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A', access...(truncated), kw=0x0)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/abstract.c:2529
#22 0x00007f9315a8ba6f in instancemethod_call (func=,
arg=(, _bufsize=65516, _cont_handler=<...>, _dtd_handler=, _entity_stack=[], _err_handler=, _lex_handler_prop=None, _parsing=0, _ent_handler=, _interning=None) at remote 0x6998b28>, _mult_al=, _debug=0, _hsp=, _descr=, title=u'gnl|BL_ORD_ID|14 poly_A', access...(truncated), kw=0x0)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/classobject.c:2602
#23 0x00007f9315a6f840 in PyObject_Call (func=, arg=(u'Hsp_bit-score',), kw=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/abstract.c:2529
#24 0x00007f9315b7a900 in PyEval_CallObjectWithKeywords (func=, arg=(u'Hsp_bit-score',), kw=0x0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:3890
#25 0x00007f930c412705 in call_with_frame (c=0x67d5bf0, func=, args=(u'Hsp_bit-score',), self=0x50a2ba0) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/pyexpat.c:355
#26 0x00007f930c4135bd in my_EndElementHandler (userData=0x50a2ba0, name=0x5367d60 "Hsp_bit-score") at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Modules/pyexpat.c:616
#27 0x00007f930c1ef2d2 in doContent () from /usr/lib64/libexpat.so.1
#28 0x00007f930c1f01b4 in contentProcessor () from /usr/lib64/libexpat.so.1
#29 0x00007f930c1eae2a in XML_ParseBuffer () from /usr/lib64/libexpat.so.1
#30 0x00007f930c416199 in xmlparse_Parse (self=0x50a2ba0,
args=('n>\n AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA\n AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n |||||||||| |||||||||| ||||||||| ||\n \n \n 728\n 49.9773\n 54\n 2.68758e-09\n 97\n 130\n 728\n 761\n 1\n 1\n 31\n 31\n 34\n AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA\n AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA #31 0x00007f9315ad5006 in PyCFunction_Call (func=,
arg=('n>\n AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA\n AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n |||||||||| |||||||||| ||||||||| ||\n \n \n 728\n 49.9773\n 54\n 2.68758e-09\n 97\n 130\n 728\n 761\n 1\n 1\n 31\n 31\n 34\n AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA\n AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Objects/methodobject.c:81
#32 0x00007f9315b7b1d4 in call_function (pp_stack=0x7fff7c312a70, oparg=2) at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:4021
#33 0x00007f9315b75cd8 in PyEval_EvalFrameEx (
f=Frame 0x58dab40, for file /usr/lib64/python2.7/site-packages/Bio/Blast/NCBIXML.py, line 654, in parse (handle=, debug=0, expat=, BLOCK=1024, MARGIN=10, XML_START='<?xml', text='n>\n AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA\n AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA\n |||||||||| |||||||||| ||||||||| ||\n \n \n 728\n 49.9773\n 54\n 2.68758e-09\n 97\n 130\n 728\n 761\n 1\n 1\n 31\n ...(truncated), throwflag=0)
at /mnt/1TB/var/tmp/portage/dev-lang/python-2.7.5-r2/work/Python-2.7.5/Python/ceval.c:2666

It seems it crashed because there were TWO broken XML entries in the XML stream while and on the third (non-bogus) it crashed but deeply in it on 728 ...:

<Iteration>
  <Iteration_iter-num>5195</Iteration_iter-num>
  <Iteration_query-ID>lcl|5195_0</Iteration_query-ID>
  <Iteration_query-def>EYI1BW404I60E4 length=245 xy=3653_1102 region=4 run=R_2007_11_06_15_29_46_</Iteration_query-def>
  <Iteration_query-len>253</Iteration_query-len>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>30</Statistics_db-num>
      <Statistics_db-len>20176</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.41</Statistics_kappa>
      <Statistics_lambda>0.625</Statistics_lambda>
      <Statistics_entropy>0.78</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
  <Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
  <Iteration_iter-num>5196</Iteration_iter-num>
  <Iteration_query-ID>lcl|5196_0</Iteration_query-ID>
  <Iteration_query-def>EYI1BW404I5AGB length=255 xy=3633_2713 region=4 run=R_2007_11_06_15_29_46_</Iteration_query-def>
  <Iteration_query-len>259</Iteration_query-len>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>30</Statistics_db-num>
      <Statistics_db-len>20176</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.41</Statistics_kappa>
      <Statistics_lambda>0.625</Statistics_lambda>
      <Statistics_entropy>0.78</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
  <Iteration_message>No hits found</Iteration_message>
</Iteration>
<Iteration>
  <Iteration_iter-num>5197</Iteration_iter-num>
  <Iteration_query-ID>lcl|5197_0</Iteration_query-ID>
  <Iteration_query-def>EYI1BW404IB6HP length=88 xy=3302_0331 region=4 run=R_2007_11_06_15_29_46_</Iteration_query-def>
  <Iteration_query-len>166</Iteration_query-len>
  <Iteration_hits>
    <Hit>
      <Hit_num>1</Hit_num>
      <Hit_id>gnl|BL_ORD_ID|14</Hit_id>
      <Hit_def>poly_A</Hit_def>
      <Hit_accession>14</Hit_accession>
      <Hit_len>960</Hit_len>
      <Hit_hsps>
        <Hsp>
          <Hsp_num>1</Hsp_num>
          <Hsp_bit-score>49.9773</Hsp_bit-score>
          <Hsp_score>54</Hsp_score>
          <Hsp_evalue>2.68758e-09</Hsp_evalue>
          <Hsp_query-from>97</Hsp_query-from>
          <Hsp_query-to>130</Hsp_query-to>
          <Hsp_hit-from>1</Hsp_hit-from>
          <Hsp_hit-to>34</Hsp_hit-to>
          <Hsp_query-frame>1</Hsp_query-frame>
          <Hsp_hit-frame>1</Hsp_hit-frame>
          <Hsp_identity>31</Hsp_identity>
          <Hsp_positive>31</Hsp_positive>
          <Hsp_align-len>34</Hsp_align-len>
          <Hsp_qseq>AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA</Hsp_qseq>
          <Hsp_hseq>AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA</Hsp_hseq>
          <Hsp_midline>|||||||||| |||||||||| ||||||||| ||</Hsp_midline>
        </Hsp>

         plenty matches and finally

        <Hsp>
          <Hsp_num>728</Hsp_num>
          <Hsp_bit-score>49.9773</Hsp_bit-score>
          <Hsp_score>54</Hsp_score>
          <Hsp_evalue>2.68758e-09</Hsp_evalue>
          <Hsp_query-from>97</Hsp_query-from>
          <Hsp_query-to>130</Hsp_query-to>
          <Hsp_hit-from>728</Hsp_hit-from>
          <Hsp_hit-to>761</Hsp_hit-to>
          <Hsp_query-frame>1</Hsp_query-frame>
          <Hsp_hit-frame>1</Hsp_hit-frame>
          <Hsp_identity>31</Hsp_identity>
          <Hsp_positive>31</Hsp_positive>
          <Hsp_align-len>34</Hsp_align-len>
          <Hsp_qseq>AAAAAAAAAACAAAAAAAAAANAAAAAAAAACAA</Hsp_qseq>
          <Hsp_hseq>AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA</Hsp_hseq>
          <Hsp_midline>|||||||||| |||||||||| ||||||||| ||</Hsp_midline>
        </Hsp>

Do you want "bt full" output from gdb instead. ;-)))))) This is likely the longest bug report I ever wrote and 4 A.M.

@peterjc
Owner

Can you upload the complete problem BLAST XML file? e.g. Using a GitHub gist.

Does the problem go away if you switch from legacy BLAST to BLAST+ instead? You can also get much richer tabular output directly from the BLAST command line as well, e.g. http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html

@mmokrejs

I tried to repeat the crash on the existing input XML file (which is huge). So far I failed to reproduce the crash while just parsing the file and continuing. However, running my full computation creating the XML file and then parsing it, choked today again on me in the same spot. I wrapped the test using valgrind but valgrind gave me no output because the child crashed. Probably would not help me anyway.

I have two other bugreports in python/matplotlib which show I am probably facing something overwriting my data in my memory. Of course it could be even libc issue.

Regarding XML -> tabular output ... As far as I remember legacy blast cannot produce customized output and I lacked some columns when I looked into that. Blast+ does not work well for me, would have to re-test again the whole suite and it is simply not feasible in real time. Surely I would have to check every single difference and decide whether I accept the new(different) output of not. I stepped back from blast+ in the past back to blastn for some tasks and I just don't believe it would work now. I need reproducible results so that my test do not "fail" because alignment changed slightly.

But I agree, I should try blast+ just to see if the memory corruption goes away. Could be caused by blastall of course. Yeah, quite likely, and quite likely NCBI won't fix the legacy blast sources. They already told me that on other topics (you know, you were CC'ed).

@bow
Collaborator

It seems it crashed because there were TWO broken XML entries in the XML stream while and on the third (non-bogus) it crashed but deeply in it on 728 ...:

By broken, do you mean <Iteration> elements without <Iteration_hits> or <Iteration_stat>?

Do you see the same error if you use Bio.SearchIO instead of Bio.Blast.NCBIXML? The former uses a completely different XML-parser (which are also tested on the same XML files as Bio.Blast.NCBIXML and more), so it may help.

@mmokrejs

I wouldn't like to mix here two things together. One is the XML stream being interleaved by those statistics items, another is the crash itself.

Yes, for some reason blastn includes these entries inside the XML stream (prematurely), not only at its very end. They are not evenly scattered along the stream. I think the old biopython parser should not return empty/broken hsp objects in these cases. More at
https://redmine.open-bio.org/issues/3363#note-5
and
https://redmine.open-bio.org/issues/3354

I am sure I will once test your new XML parser Bow, but I think the API is different so I will have to think about the transition process a bit more. It is not a on-line change. But when I re-run just the XML parsing and downstream part of my test it finishes. So, the crash happens on this XML stream only when the test executed blastall which wrote the .xml (2100567118 bytes) file and Bio.Blast.NCBIXML.parse() is used to parse it in the same running python instance. I have files even 300GB in size which can be parsed successfully using the old XML parser as well.

@bow
Collaborator

I am sure I will once test your new XML parser Bow, but I think the API is different so I will have to think about the transition process a bit more. It is not a on-line change.

Indeed it's not ~ I was just curious how elementtree (the underlying SearchIO parser) would respond to the XML file. The (None, None) tuple issue that is suspected to be the problem is also obviated in SearchIO (the parser does not use these tuples at all), so that's another check on a probable cause.

But yes, I realize that it's a completely different API that takes some adaptation to :).

So, the crash happens on this XML stream only when the test executed blastall which wrote the .xml (2100567118 bytes) file and Bio.Blast.NCBIXML.parse() is used to parse it in the same running python instance. I have files even 300GB in size which can be parsed successfully using the old XML parser as well.

Hmm..like you said earlier then, this sounds like a race condition bug...

@mmokrejs

The (None, None) tuple issue that is suspected to be the problem is also obviated in SearchIO (the parser does not use these tuples at all), so that's another check on a probable cause.

I don't understand what you mean here. You mean that both will leak this through? Where does the tuple come from at first? I did not check yet but I thought I am using celementtree. Ah, yes, the stacktrace really does not show celementtree. I would like to test it first. Then, I would like to use something else than expat library for parsing. Is some other library with same API usable for biopython (one line change on import line)?

I would rather move the discussion on these XML entries to the https://redmine.open-bio.org/issues/3354 . I posted there example XML files and think biopython should sanitize the output in cases when the blast match was valid but just gaps or identities is empty (and calculate the value on its own) whereas for these interleaving statistics entries it should either skip them or return a different object which would not fit in hsp object structure. What is teh issue with the bitscore the gdb stacktrace is showiung I really dont know. I know that hsp.strand can also be (None, None) but that again could be corrected based on the start/stop positions of the alignment. I just never know which one is query and is sbjct. :(

@bow
Collaborator

I don't understand what you mean here. You mean that both will leak this through? Where does the tuple come from at first?

The tuple is the default value defined in Bio.Blast.Record (from L149 onwards). As pointed out in Eric's bug report, there are some cases where it is left off as a tuple while it should have been an integer.

Is some other library with same API usable for biopython (one line change on import line)?

Hmm..not any that I'm aware of. As you've seen as well, the NCBIXML parser uses an expat parser, which means the start and end element events were defined in NCBIXML.py instead of the library itself.

I would rather move the discussion on these XML entries to the https://redmine.open-bio.org/issues/3354 . I posted there example XML files and think biopython should sanitize the output in cases when the blast match was valid but just gaps or identities is empty (and calculate the value on its own) whereas for these interleaving statistics entries it should either skip them or return a different object which would not fit in hsp object structure.

Hmm..looking at the file there and comparing it to the BLAST XML DTD, it seems that the <Iteration_query-ID> element is an optional one, so there will be cases where this element is missing from the XML file. Thus, the file seems to be a DTD-compliant BLAST XML file (as weird as it may seem).

The same applies to the <Iteration_hits> and <Iteration_stat> elements; they are both optional elements according to the DTD so they may or may not be present in a BLAST XML result. <Hsp_identity>, <Hsp_positive>, <Hsp_gaps> are also optional elements.

This still doesn't explain the <Hsp_bit-score> error in the stack trace, though. (or the whole error you're seeing for that matter). I'm getting more convinced that it is indeed a memory / race condition issue..

Finally, I should point out that at the moment we are planning to move to GitHub entirely from Redmine (discussion here), so Redmine issues are planned to be migrated here.

@bow
Collaborator

@mmokrejs, are you still seeing error after porting the code :)?

@mmokrejs

No, the crash was either due to RAM module which got baked up a bit or due to faulty CPU and warn cooler fan. Since all three were replaced (stepwise) things started to run fine. To bad I lived the partly broken CPU for 1.5 years. So, even with the old, NCBIXML I did not have the issues anymore. Closing, sorry for the noise.

Hope the XML-file format issues we solve elsewhere.

@mmokrejs mmokrejs closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.