New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check that the status gets modified for pages that return 304 #279
Comments
Maybe it would make sense to share the code that handles that between the *FetcherBolt implementations? a super class? |
[https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/storm/crawler/protocol/httpclient/HttpProtocol.java#L131] uses the values for the keys cachedLastModified and cachedEtag if present in the metadata. At the moment these k/v are not stored automatically See #99 and #109 |
Note to self : need unit test with an embedded web server. |
Added fix for FetcherBolt + attempt at writing a Junit test with an embedded server in a separate branch (does not work yet but it's a start) |
…d test using Wiremock + http protocol returns empty array when entity is null #279
SimpleFetcherBolt does nothing about it so the pages are parsed as usual.
FetcherBolt skips the parsing but probably doesn't update the status either.
Need to check if/how we handle conditional requests as a server would return a 304 only in such cases, although in practice I have seen servers do so even with a non-conditional one.
The text was updated successfully, but these errors were encountered: