Deprecate HTTP fetcher support #92

kkrugler · 2015-09-01T18:34:33Z

In researching a bug in Bixo, I realized that the SimpleHttpFetcher needs to be serializable so that we can easily use it with Hadoop jobs. But that's an odd dependency, and in researching how we use the fetching code in Bixo, it felt like SimpleHttpFetcher (which effectively wraps Apache's HttpClient code) is too specific to put into crawler-commons. Plus there's a lot of deferred maintenance work in that code, to keep it in sync with HttpClient (e.g. many methods it uses are now deprecated).

So I feel like we should deprecate this, and the RobotUtils createFetcher() and getRobotRules() methods.

Let me know if you feel strongly that we should continue supporting the fetcher code, otherwise I'll deprecate it soon, and file an issue to remove it after a few releases.

jnioche · 2015-09-01T21:25:52Z

+1 to your suggestion Ken

Chaiavi · 2015-09-02T15:51:24Z

My intuition is that crawler-commons should continue supporting it.

@ken - why do you suggest to deprecate it? - is it because we can expect
the user to use Apache's HttpClient ?

My intuition is that if we create a common library for crawling, the main
component in it should be the actual page fetcher, if we ditch that then
what do we leave in our project ? - only the robots handling and the
Sitemap parser ?

Although I agree that it is a high maintenance component as Igor seems to
not prioritize backwards compatibility...

On Wed, Sep 2, 2015 at 12:25 AM, Julien Nioche notifications@github.com
wrote:

+1 to your suggestion Ken

—
Reply to this email directly or view it on GitHub
#92 (comment)
.

kkrugler · 2015-09-02T20:48:37Z

We want to have code that's common across all crawlers. There are several HTTP libraries that crawlers could use, of which HttpClient is just one. So for that reason alone I wasn't confident it made sense to add it in the first place. And then how you configure the fetcher depends a lot on your use case - e.g. do you need to handle cookie caching? And finally there's the issue of keeping it in sync with HttpClient.

As for what should be in crawler-commons, here's the page from the original meeting (Nov 4th, 2009 at ApacheCon) where we first discussed this project: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

What's not yet in crawler-commons that I think would be useful is:

URL normalization
URL filtering
Page similarity (for detecting real page changes, and/or page deduplication)
Testing code (e.g. extract bixo code that synthesizes web pages from a real-world link graph)

I can think of other things like a DB of link shortening domains, similar to what we do for TLD support.

Chaiavi · 2015-09-03T07:30:52Z

Thank you for the link and explanations

Now I am still not for deprecating it but not against either, I trust your
judgement.

On Wed, Sep 2, 2015 at 11:48 PM, Ken Krugler notifications@github.com
wrote:

We want to have code that's common across all crawlers. There are
several HTTP libraries that crawlers could use, of which HttpClient is just
one. So for that reason alone I wasn't confident it made sense to add it in
the first place. And then how you configure the fetcher depends a lot on
your use case - e.g. do you need to handle cookie caching? And finally
there's the issue of keeping it in sync with HttpClient.

As for what should be in crawler-commons, here's the page from the
original meeting (Nov 4th, 2009 at ApacheCon) where we first discussed this
project: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

What's not yet in crawler-commons that I think would be useful is:

URL normalization

URL filtering

Page similarity (for detecting real page changes, and/or page
deduplication)

Testing code (e.g. extract bixo code that synthesizes web pages from
a real-world link graph)

I think think of other things like a DB of link shortening domains,
similar to what we do for TLD support.

—
Reply to this email directly or view it on GitHub
#92 (comment)
.

jnioche · 2015-09-10T15:14:19Z

@kkrugler do you want to submit a PR for this?

kkrugler · 2015-09-11T17:45:39Z

@jnioche yes, I'll do that.

As a side note, I just had to update the Bixo version of SimpleHttpFetcher to exclude some ciphers that were causing SSL socket connection failures when fetching pages from Wikipedia. This is specific to Java 7/8 and how ciphers are ordered on these platforms, but if we were going to add support for conditionally configuring that in crawler-commons, it would add another level of complexity to the API that I really don't want to implement or maintain :)

aecio · 2015-09-15T23:45:59Z

Is this a final decision? The HTTP fetcher is the component that actually made me start using this library. Implement a good HTTP fetcher is not easy, and certainly SimpleHttpFetcher is useful for some crawlers, even tough it's too specific for the major crawlers. I had even started to do some work in updating to the latest HttpClient library, but I could not finish it yet.

kkrugler · 2015-09-17T16:29:46Z

I would support a new project with pieces from here (and maybe Bixo), which is specifically around the fetching of content. I just don't have any extra time to spend on it currently.

Chaiavi · 2015-09-18T13:47:02Z

I am also +1 on it.

I will help as much as I can, I am working on a web crawler, and have
several fetchers implemented there, so I can use your fetcher and check
against other fetchers.

On Thu, Sep 17, 2015 at 7:29 PM, Ken Krugler notifications@github.com
wrote:

I would support a new project with pieces from here (and maybe Bixo),
which is specifically around the fetching of content. I just don't have any
extra time to spend on it currently.

—
Reply to this email directly or view it on GitHub
#92 (comment)
.

jnioche · 2015-12-02T10:33:05Z

Committed deprecation in #97

kkrugler self-assigned this Sep 1, 2015

kkrugler added the question label Sep 1, 2015

kkrugler added Priority-Medium and removed question labels Sep 11, 2015

kkrugler changed the title ~~Remove HTTP fetcher support~~ Deprecate HTTP fetcher support Sep 14, 2015

kkrugler mentioned this issue Sep 14, 2015

Deprecate fetcher classes #97

Closed

jnioche mentioned this issue Dec 2, 2015

BaseFetchException field _exception #105

Closed

jnioche closed this as completed Dec 2, 2015

kkrugler mentioned this issue Jan 9, 2016

Why is everything in fetcher deprecated? #113

Closed

lewismc added this to the crawler-commons-0.7 milestone Sep 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate HTTP fetcher support #92

Deprecate HTTP fetcher support #92

kkrugler commented Sep 1, 2015

jnioche commented Sep 1, 2015

Chaiavi commented Sep 2, 2015

kkrugler commented Sep 2, 2015

Chaiavi commented Sep 3, 2015

jnioche commented Sep 10, 2015

kkrugler commented Sep 11, 2015

aecio commented Sep 15, 2015

kkrugler commented Sep 17, 2015

Chaiavi commented Sep 18, 2015

jnioche commented Dec 2, 2015

Deprecate HTTP fetcher support #92

Deprecate HTTP fetcher support #92

Comments

kkrugler commented Sep 1, 2015

jnioche commented Sep 1, 2015

Chaiavi commented Sep 2, 2015

kkrugler commented Sep 2, 2015

Chaiavi commented Sep 3, 2015

jnioche commented Sep 10, 2015

kkrugler commented Sep 11, 2015

aecio commented Sep 15, 2015

kkrugler commented Sep 17, 2015

Chaiavi commented Sep 18, 2015

jnioche commented Dec 2, 2015