Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

Parsing failed due to Exception: Could not determine Python version #27

Closed
bzz opened this issue Jun 21, 2017 · 6 comments
Closed

Parsing failed due to Exception: Could not determine Python version #27

bzz opened this issue Jun 21, 2017 · 6 comments
Assignees
Labels

Comments

@bzz
Copy link
Contributor

bzz commented Jun 21, 2017

Using latest https://gist.github.com/bzz/c0c3dbcab5fecbe48e22167e2ad78595 UAST parsing fails on what seems to be https://github.com/damoeb/kalipo/blob/master/kalipo-ir/harvester/spiders/heise_spider.py

Serve log

time="2017-06-21T13:44:13Z" level=debug msg="sending ParseUAST request: Filename:"kalipo-ir/harvester/spiders/heise_spider.py" Language:"python" Content:"import scrapy\nfrom scrapy.contrib.spiders import CrawlSpider, Rule\nfrom scrapy.contrib.linkextractors import LinkExtractor\nfrom scrapy.selector import Selector\n\nfrom harvester.items import Comment\nimport time\nimport calendar\nimport re\n\nclass HeiseSpider(CrawlSpider):\n    name = \"heise\"\n    allowed_domains = [\"www.heise.de\"]\n    start_urls = [\n            \"http://www.heise.de/forum/Telepolis/Kommentare/Ohne-Vorratsdatenspeicherung-sterben-vermisste-Kinder-und-Suizidale/forum-242979/\"\n    ]\n\n    rules = (\n        #Rule(LinkExtractor(allow=('/tp/foren/[^/]+/forum-[0-9]+/list'))),\n\tRule(LinkExtractor(allow=('/posting-[0-9]+/show')), callback='parse_item')\n    )\n\n    def clean_str(self, val):\n\treturn val.replace(u'\\xa0', u' ').strip()\n\n    def to_str(self, arr):\n\treturn self.clean_str(''.join(arr))\n\n    def parse_date(self, val):\n\tgrps = re.search('[0-9]+\\. ([A-Za-z]+) [0-9]{4} [0-9]{2}:[0-9]{2}', val)\n\n\tmnth = grps.group(1)\n\n   \tmonths = ['Januar', 'Februar', 'M\\u00e4rz', 'April', 'Mai', 'Juni', 'Juli', 'August', 'September', 'Oktober', 'November', 'Dezember']\n\tfor index, item in enumerate(months):\n\t   if item.lower() == mnth.lower():\n\t      val = val.replace(mnth, str(index))\n\t      break\n\n\treturn calendar.timegm(time.strptime(val, \"%d. %m %Y %H:%M\"))\n\n    def parse_item(self, response):\n        sel = Selector(response)\n\n\n\tisRoot = len(response.xpath(\"//ul[@class='forum_navi'][2]/li\")) == 6\n\n\tif !isRoot:\n\t   # find parent\n\t   parent = response.xpath(\"//span[@class='active_post']/../../../parent::ul[@class='nextlevel_line']/preceding-sibling::div[@class='hover_line']\")\n\t   # get link\n\t   link = parent.xpath(\".//div[@class='thread_title']/a\")\n\t   # extract parent id from href\n\n\n\titem = Comment()\n\titem['text'] = self.to_str(sel.xpath(\"//h3[@class='posting_subject']/text()\").extract()) + self.to_str(sel.xpath(\"//p[@class='posting_text']/text()\").extract())\n\titem['url'] = response.url\n\titem['parent'] = 'unknown'\n\titem['level'] = 0\n\titem['thread'] = re.search('forum-([0-9]+)', response.url).group(1)\n\titem['author'] = self.to_str(sel.xpath(\"//div[@class='user_info']/i//text()\").extract())\n\titem['date'] = self.parse_date(self.to_str(response.xpath(\"//div[@class='posting_date']/text()\").extract()))\n        return item\n\n" "
time="2017-06-21T13:35:14Z" level=error msg="driver bblfsh/python-driver:latest (01BK5BZ6N1S7MZBCSFPADDBFSW) stderr: ERROR:root:Filepath: , Errors: ['Traceback (most recent call last):\n  File "/usr/lib/python3.6/site-packages/python_driver/requestprocessor.py", line 151, in process_request\n    raise Exception(\'Could not determine Python version\')\nException: Could not determine Python version\n']"

Client logs

Read kalipo-ir/harvester/spiders/heise_spider.py, 2247 bytes	Parsing file:'kalipo-ir/harvester/spiders/heise_spider.py'

Panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x1186147]

goroutine 1 [running]:
github.com/bblfsh/sdk/uast.(*Node).ProtoSize(0x0, 0xc4201f9f50)
	/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:510 +0x37
github.com/bblfsh/sdk/uast.(*Node).Marshal(0x0, 0x1c18070, 0xc4200102c0, 0xc4201f9f20, 0x0, 0x0)
	/go/src/github.com/bblfsh/sdk/uast/generated.pb.go:352 +0x2f
main.main.func1(0xc4201c4e60, 0xc4201c4e60, 0x0)
	/go/src/github.com/src-d/analysis-pipeline/juanjo/pyFromGit2ast2pb.go:69 +0x4d5
@juanjux juanjux added the bug label Jun 21, 2017
@juanjux
Copy link
Contributor

juanjux commented Jun 21, 2017

I'll take a look at it, thanks for the good report.

@juanjux
Copy link
Contributor

juanjux commented Jun 22, 2017

I can reproduce it. Actually, I think there are two bugs here. The first one is the Python driver saying "I can get what version is this" and raising and exceptions instead of defaulting to Python 3 in that case (since it's compatible with both). But even this error, that is correctly returned by the native driver using the protocol (tested) shouldn't cause a nil pointer deference in the SDK, so this is the second bug.

I'll add both here as a list and keep it updated:

  • Python driver fails instead of returning Python3 when the file language version can be determined. In progress.
  • Nil pointer dereference with the error above SDK protocol buffer serializer.

@bzz
Copy link
Contributor Author

bzz commented Jun 26, 2017

Thank you for the assessment!
I believe we have been affected by 2 in many other cases and fixing it in SDK would be 🎉

@juanjux
Copy link
Contributor

juanjux commented Jun 26, 2017

Yes, this bug is my first priority after I help de machine learning team confirm bblfsh/bblfshd/issues/34 since it could be related with bblfsh/bblfshd/issues/36.

@juanjux
Copy link
Contributor

juanjux commented Jun 26, 2017

Some of the files that fail can't be parsed because they've inconsistent use of tabs and spaces. Which makes me wonder if they even run. This is actually a check that Python's AST module does before even starting to parse (it produces a TabError in Python3 and an syntax error in Python 2) so it doesn't produce any AST, which by the spec is a fatal error (which is returned, trough the message could be better).

The solution would be to fix the files before trying to parse them with any of the tools available. I'm not sure if this should be done at the Python driver side, tough. So if that solution works for you I should close the report (if not please tell me and I'll reopen it).

The nil pointer is actually because on fatal errors the response status will be "fatal" (as is in your provided code in this case), the errors will be a list of errors and the UAST nil, so that's not actually a bug.

@juanjux juanjux closed this as completed Jun 26, 2017
@juanjux
Copy link
Contributor

juanjux commented Jun 26, 2017

I've created a different issue for the non propagation of the syntax error to the response:

#28

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants