-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deprecate HTTP fetcher support #92
Comments
+1 to your suggestion Ken |
My intuition is that crawler-commons should continue supporting it. @ken - why do you suggest to deprecate it? - is it because we can expect My intuition is that if we create a common library for crawling, the main Although I agree that it is a high maintenance component as Igor seems to On Wed, Sep 2, 2015 at 12:25 AM, Julien Nioche notifications@github.com
|
We want to have code that's common across all crawlers. There are several HTTP libraries that crawlers could use, of which HttpClient is just one. So for that reason alone I wasn't confident it made sense to add it in the first place. And then how you configure the fetcher depends a lot on your use case - e.g. do you need to handle cookie caching? And finally there's the issue of keeping it in sync with HttpClient. As for what should be in crawler-commons, here's the page from the original meeting (Nov 4th, 2009 at ApacheCon) where we first discussed this project: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp What's not yet in crawler-commons that I think would be useful is:
I can think of other things like a DB of link shortening domains, similar to what we do for TLD support. |
Thank you for the link and explanations Now I am still not for deprecating it but not against either, I trust your On Wed, Sep 2, 2015 at 11:48 PM, Ken Krugler notifications@github.com
|
@kkrugler do you want to submit a PR for this? |
@jnioche yes, I'll do that. As a side note, I just had to update the Bixo version of SimpleHttpFetcher to exclude some ciphers that were causing SSL socket connection failures when fetching pages from Wikipedia. This is specific to Java 7/8 and how ciphers are ordered on these platforms, but if we were going to add support for conditionally configuring that in crawler-commons, it would add another level of complexity to the API that I really don't want to implement or maintain :) |
Is this a final decision? The HTTP fetcher is the component that actually made me start using this library. Implement a good HTTP fetcher is not easy, and certainly SimpleHttpFetcher is useful for some crawlers, even tough it's too specific for the major crawlers. I had even started to do some work in updating to the latest HttpClient library, but I could not finish it yet. |
I would support a new project with pieces from here (and maybe Bixo), which is specifically around the fetching of content. I just don't have any extra time to spend on it currently. |
I am also +1 on it. I will help as much as I can, I am working on a web crawler, and have On Thu, Sep 17, 2015 at 7:29 PM, Ken Krugler notifications@github.com
|
Committed deprecation in #97 |
In researching a bug in Bixo, I realized that the SimpleHttpFetcher needs to be serializable so that we can easily use it with Hadoop jobs. But that's an odd dependency, and in researching how we use the fetching code in Bixo, it felt like SimpleHttpFetcher (which effectively wraps Apache's HttpClient code) is too specific to put into crawler-commons. Plus there's a lot of deferred maintenance work in that code, to keep it in sync with HttpClient (e.g. many methods it uses are now deprecated).
So I feel like we should deprecate this, and the RobotUtils createFetcher() and getRobotRules() methods.
Let me know if you feel strongly that we should continue supporting the fetcher code, otherwise I'll deprecate it soon, and file an issue to remove it after a few releases.
The text was updated successfully, but these errors were encountered: