New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixed lxml errors when reading Tomcat error messages. #92
Conversation
…g tomcat sends certain invalid responses is dangerous
I think this could be a bit cleaner: rather than trapping all AttributeErrors, we should probably just have an The |
That would work too. |
Had the exact same issue with Solr 4.5 on Tomcat6 using lxml 3.2.3. Took acdha's point on board, just put an 'if body_node:' around the whole p_nodes section, line 431 in the diff. I'll comment in the diff where I made the change. Works fine for me. |
@@ -426,25 +426,28 @@ def _scrape_response(self, headers, response): | |||
dom_tree = None | |||
|
|||
if server_type == 'tomcat': | |||
# Tomcat doesn't produce a valid XML response | |||
soup = lxml.html.fromstring(response) | |||
body_node = soup.find('body') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just put 'if body_node:' here and indented the p_nodes declaration and processing loop.
Is there any fix planned for this? Experiencing it with pysolr 3.2.0, lxml 3.3.4, tomcat 6.0.35 and solr 4.7.1. |
|
||
if reason is None: | ||
if reason is None or p_nodes is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p_nodes won't be defined here if something causes that big try block to bail out before line 433. Is it even necessary to check here, however, given that such a failure would leave reason as None?
@marcelchastain I would like to see a cleaner patch – unless I'm missing something this one introduces a check on the p_nodes variable which isn't always defined. We need to clean that up before merging. If you wanted to go for honors a test using a canned Tomcat error message which triggers this codepath would be much appreciated. |
@acdha thanks for the quick reply. I'll try to get something going later today |
Looking at that method, and the issues surrounding lxml, it seems like a small bit of manual parsing that has simpler rules for finding an error message is a better idea and fixes all of the complaints around lxml. I've proposed an alternative #133 Its not perfect, but for us it kills 3 deps, improves build time by 5-10 minutes and works great. |
I tweaked @frankamp's patch in #133 a bit and am liking the reduced dependencies: https://github.com/acdha/pysolr/tree/simple-error-extraction Does anyone have some actual Tomcat error messages which we could pull into the test suite? I'm thinking that a simple regex or two to hit the most common cases and falling back to the raw HTML is better than spending time staying in sync with Tomcat, particularly since we're already passing the full response as extra logging data: |
What's the status of this, I'm still getting errors when trying to build an index. |
@pembo13 If you can, please test the branch I referenced above – and send us some of the Tomcat error messages you're receiving so we can add them to the test suite. |
This can be closed now, code no longer exists. |
When parsing error messages pysolr assumes that Tomcat will send a certain flavour of invalid response.
Sometime in Tomcat 6 (or maybe Solr4) the assertion that this code was based on became untrue, and so the error handling code in pysolr began creating it's own error. This may only be true when using lxml (it's required in my project so I haven't tested without).
This fix prevents pysolr from obscuring the tomcat error message with it's own, if it fails to find the tag it's looking for.
This is what I was getting before making this fix: