Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check that the status gets modified for pages that return 304 #279

Closed
jnioche opened this issue Apr 28, 2016 · 4 comments
Closed

Check that the status gets modified for pages that return 304 #279

jnioche opened this issue Apr 28, 2016 · 4 comments
Assignees
Labels
Milestone

Comments

@jnioche
Copy link
Contributor

jnioche commented Apr 28, 2016

SimpleFetcherBolt does nothing about it so the pages are parsed as usual.

FetcherBolt skips the parsing but probably doesn't update the status either.

Need to check if/how we handle conditional requests as a server would return a 304 only in such cases, although in practice I have seen servers do so even with a non-conditional one.

@jnioche jnioche self-assigned this Apr 28, 2016
@jnioche jnioche added the bug label Apr 28, 2016
@jnioche jnioche added this to the 0.10 milestone Apr 28, 2016
@jnioche
Copy link
Contributor Author

jnioche commented Apr 28, 2016

Maybe it would make sense to share the code that handles that between the *FetcherBolt implementations? a super class?

@jnioche
Copy link
Contributor Author

jnioche commented Apr 28, 2016

[https://github.com/DigitalPebble/storm-crawler/blob/master/core/src/main/java/com/digitalpebble/storm/crawler/protocol/httpclient/HttpProtocol.java#L131] uses the values for the keys cachedLastModified and cachedEtag if present in the metadata.

At the moment these k/v are not stored automatically See #99 and #109

@jnioche
Copy link
Contributor Author

jnioche commented Apr 28, 2016

Note to self : need unit test with an embedded web server.
[http://wiremock.org/]
or classes from HTTPClient itself [https://thecarlhall.wordpress.com/2010/03/25/unit-testing-with-httpclients-localtestserver/]

@jnioche
Copy link
Contributor Author

jnioche commented Apr 28, 2016

Added fix for FetcherBolt + attempt at writing a Junit test with an embedded server in a separate branch (does not work yet but it's a start)

jnioche added a commit that referenced this issue Apr 29, 2016
…d test using Wiremock + http protocol returns empty array when entity is null #279
@jnioche jnioche closed this as completed Apr 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant