Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate HTTP fetcher support #92

Closed
kkrugler opened this issue Sep 1, 2015 · 10 comments
Closed

Deprecate HTTP fetcher support #92

kkrugler opened this issue Sep 1, 2015 · 10 comments

Comments

@kkrugler
Copy link
Contributor

kkrugler commented Sep 1, 2015

In researching a bug in Bixo, I realized that the SimpleHttpFetcher needs to be serializable so that we can easily use it with Hadoop jobs. But that's an odd dependency, and in researching how we use the fetching code in Bixo, it felt like SimpleHttpFetcher (which effectively wraps Apache's HttpClient code) is too specific to put into crawler-commons. Plus there's a lot of deferred maintenance work in that code, to keep it in sync with HttpClient (e.g. many methods it uses are now deprecated).

So I feel like we should deprecate this, and the RobotUtils createFetcher() and getRobotRules() methods.

Let me know if you feel strongly that we should continue supporting the fetcher code, otherwise I'll deprecate it soon, and file an issue to remove it after a few releases.

@kkrugler kkrugler self-assigned this Sep 1, 2015
@jnioche
Copy link
Contributor

jnioche commented Sep 1, 2015

+1 to your suggestion Ken

@Chaiavi
Copy link
Member

Chaiavi commented Sep 2, 2015

My intuition is that crawler-commons should continue supporting it.

@ken - why do you suggest to deprecate it? - is it because we can expect
the user to use Apache's HttpClient ?

My intuition is that if we create a common library for crawling, the main
component in it should be the actual page fetcher, if we ditch that then
what do we leave in our project ? - only the robots handling and the
Sitemap parser ?

Although I agree that it is a high maintenance component as Igor seems to
not prioritize backwards compatibility...

On Wed, Sep 2, 2015 at 12:25 AM, Julien Nioche notifications@github.com
wrote:

+1 to your suggestion Ken


Reply to this email directly or view it on GitHub
#92 (comment)
.

@kkrugler
Copy link
Contributor Author

kkrugler commented Sep 2, 2015

We want to have code that's common across all crawlers. There are several HTTP libraries that crawlers could use, of which HttpClient is just one. So for that reason alone I wasn't confident it made sense to add it in the first place. And then how you configure the fetcher depends a lot on your use case - e.g. do you need to handle cookie caching? And finally there's the issue of keeping it in sync with HttpClient.

As for what should be in crawler-commons, here's the page from the original meeting (Nov 4th, 2009 at ApacheCon) where we first discussed this project: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

What's not yet in crawler-commons that I think would be useful is:

  • URL normalization
  • URL filtering
  • Page similarity (for detecting real page changes, and/or page deduplication)
  • Testing code (e.g. extract bixo code that synthesizes web pages from a real-world link graph)

I can think of other things like a DB of link shortening domains, similar to what we do for TLD support.

@Chaiavi
Copy link
Member

Chaiavi commented Sep 3, 2015

Thank you for the link and explanations

Now I am still not for deprecating it but not against either, I trust your
judgement.

On Wed, Sep 2, 2015 at 11:48 PM, Ken Krugler notifications@github.com
wrote:

We want to have code that's common across all crawlers. There are
several HTTP libraries that crawlers could use, of which HttpClient is just
one. So for that reason alone I wasn't confident it made sense to add it in
the first place. And then how you configure the fetcher depends a lot on
your use case - e.g. do you need to handle cookie caching? And finally
there's the issue of keeping it in sync with HttpClient.

As for what should be in crawler-commons, here's the page from the
original meeting (Nov 4th, 2009 at ApacheCon) where we first discussed this
project: http://wiki.apache.org/nutch/ApacheConUs2009MeetUp

What's not yet in crawler-commons that I think would be useful is:

  • URL normalization
  • URL filtering
  • Page similarity (for detecting real page changes, and/or page
    deduplication)
  • Testing code (e.g. extract bixo code that synthesizes web pages from
    a real-world link graph)

I think think of other things like a DB of link shortening domains,
similar to what we do for TLD support.


Reply to this email directly or view it on GitHub
#92 (comment)
.

@jnioche
Copy link
Contributor

jnioche commented Sep 10, 2015

@kkrugler do you want to submit a PR for this?

@kkrugler
Copy link
Contributor Author

@jnioche yes, I'll do that.

As a side note, I just had to update the Bixo version of SimpleHttpFetcher to exclude some ciphers that were causing SSL socket connection failures when fetching pages from Wikipedia. This is specific to Java 7/8 and how ciphers are ordered on these platforms, but if we were going to add support for conditionally configuring that in crawler-commons, it would add another level of complexity to the API that I really don't want to implement or maintain :)

@kkrugler kkrugler changed the title Remove HTTP fetcher support Deprecate HTTP fetcher support Sep 14, 2015
@aecio
Copy link
Contributor

aecio commented Sep 15, 2015

Is this a final decision? The HTTP fetcher is the component that actually made me start using this library. Implement a good HTTP fetcher is not easy, and certainly SimpleHttpFetcher is useful for some crawlers, even tough it's too specific for the major crawlers. I had even started to do some work in updating to the latest HttpClient library, but I could not finish it yet.

@kkrugler
Copy link
Contributor Author

I would support a new project with pieces from here (and maybe Bixo), which is specifically around the fetching of content. I just don't have any extra time to spend on it currently.

@Chaiavi
Copy link
Member

Chaiavi commented Sep 18, 2015

I am also +1 on it.

I will help as much as I can, I am working on a web crawler, and have
several fetchers implemented there, so I can use your fetcher and check
against other fetchers.

On Thu, Sep 17, 2015 at 7:29 PM, Ken Krugler notifications@github.com
wrote:

I would support a new project with pieces from here (and maybe Bixo),
which is specifically around the fetching of content. I just don't have any
extra time to spend on it currently.


Reply to this email directly or view it on GitHub
#92 (comment)
.

@jnioche
Copy link
Contributor

jnioche commented Dec 2, 2015

Committed deprecation in #97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants