Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixed lxml errors when reading Tomcat error messages. #92

Closed
wants to merge 1 commit into from

Conversation

atkinson
Copy link

@atkinson atkinson commented Aug 5, 2013

When parsing error messages pysolr assumes that Tomcat will send a certain flavour of invalid response.

Sometime in Tomcat 6 (or maybe Solr4) the assertion that this code was based on became untrue, and so the error handling code in pysolr began creating it's own error. This may only be true when using lxml (it's required in my project so I haven't tested without).

This fix prevents pysolr from obscuring the tomcat error message with it's own, if it fails to find the tag it's looking for.

This is what I was getting before making this fix:

File "/home/webapps/.virtualenvs/myapp/local/lib/python2.7/site-packages/pysolr.py", line 404, in _scrape_response
p_nodes = body_node.cssselect('p')
AttributeError: 'NoneType' object has no attribute 'cssselect'

…g tomcat sends certain invalid responses is dangerous
@acdha
Copy link
Collaborator

acdha commented Aug 5, 2013

I think this could be a bit cleaner: rather than trapping all AttributeErrors, we should probably just have an if not body_node bailout so we don't potentially mask other errors further down.

The if reason is None or p_nodes is None: check also seems dangerous: unlike reason, p_nodes is not certain to be defined. From a quick read of that code, I'm not sure why we need the second check - it seems like the existing reason check is sufficient.

@atkinson
Copy link
Author

atkinson commented Aug 5, 2013

That would work too.

@MaximusV
Copy link

Had the exact same issue with Solr 4.5 on Tomcat6 using lxml 3.2.3. Took acdha's point on board, just put an 'if body_node:' around the whole p_nodes section, line 431 in the diff. I'll comment in the diff where I made the change. Works fine for me.
Thanks!

@@ -426,25 +426,28 @@ def _scrape_response(self, headers, response):
dom_tree = None

if server_type == 'tomcat':
# Tomcat doesn't produce a valid XML response
soup = lxml.html.fromstring(response)
body_node = soup.find('body')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put 'if body_node:' here and indented the p_nodes declaration and processing loop.

@excieve
Copy link

excieve commented Apr 10, 2014

Is there any fix planned for this? Experiencing it with pysolr 3.2.0, lxml 3.3.4, tomcat 6.0.35 and solr 4.7.1.

@marcelchastain
Copy link

@acdha are we waiting on someone to make a pull request with cleaner code for this? @MaximusV has a pretty straightforward fix, would that qualify?


if reason is None:
if reason is None or p_nodes is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p_nodes won't be defined here if something causes that big try block to bail out before line 433. Is it even necessary to check here, however, given that such a failure would leave reason as None?

@acdha
Copy link
Collaborator

acdha commented May 2, 2014

@marcelchastain I would like to see a cleaner patch – unless I'm missing something this one introduces a check on the p_nodes variable which isn't always defined. We need to clean that up before merging. If you wanted to go for honors a test using a canned Tomcat error message which triggers this codepath would be much appreciated.

@marcelchastain
Copy link

@acdha thanks for the quick reply. I'll try to get something going later today

@frankamp
Copy link

Looking at that method, and the issues surrounding lxml, it seems like a small bit of manual parsing that has simpler rules for finding an error message is a better idea and fixes all of the complaints around lxml. I've proposed an alternative #133 Its not perfect, but for us it kills 3 deps, improves build time by 5-10 minutes and works great.

@acdha
Copy link
Collaborator

acdha commented Oct 24, 2014

I tweaked @frankamp's patch in #133 a bit and am liking the reduced dependencies:

https://github.com/acdha/pysolr/tree/simple-error-extraction

Does anyone have some actual Tomcat error messages which we could pull into the test suite? I'm thinking that a simple regex or two to hit the most common cases and falling back to the raw HTML is better than spending time staying in sync with Tomcat, particularly since we're already passing the full response as extra logging data:

https://github.com/toastdriven/pysolr/blob/6e62fad989192d206c21e9acee28d5b1e1a8a0db/pysolr.py#L319-L321

@pembo13
Copy link

pembo13 commented Jan 21, 2015

What's the status of this, I'm still getting errors when trying to build an index.

@acdha
Copy link
Collaborator

acdha commented Jan 21, 2015

@pembo13 If you can, please test the branch I referenced above – and send us some of the Tomcat error messages you're receiving so we can add them to the test suite.

@domenkozar
Copy link

This can be closed now, code no longer exists.

@acdha acdha closed this May 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants