List of open source web crawlers - could they use URLFrontier? #45

jnioche · 2022-01-28T14:36:35Z

jnioche
Jan 28, 2022
Maintainer

This is a living document, feel free to add to it. The idea is not necessarily to convince the authors of these projects to ditch the storage layers they painfully worked on and adopt URLFrontier (although this would be an exciting prospect), but to

raise awareness of the project
get feedback on the API itself
get people to add URLFrontier as an alternative storage for their crawler

Contributing adaptors for URLFrontier to these project could for instance be a good project for students.

Having a collaboration with projects implemented in a language which is not Java would be a good illustration of the benefits of using gRPC.

Java

StormCrawler

I think this project is all-in on URLFrontier :)

crawler4j

Not maintained, see below fork by @rzo1

DigitalPebble blog

Heritrix

see comment below

FlinkCrawler

ask @kkrugler ;-)

BUbiNG

authors not interested in getting involved + project not actively maintained

Apache Nutch

not necessarily an obvious match, Nutch is 100% batch and very tied to data structures on HDFS
@sebastian-nagel @lewismc do you agree?

webmagic

Opened an issue on code4craft/webmagic#1098

RUST

spider-rs

need to check how the URLs are persisted

GO

gocolly

sent the authors a message + opened an issue gocolly/colly#680

crawlab

Generic framework for distributed web crawling; opened an issue crawlab-team/crawlab#1062

Python

Scrapy

https://docs.zyte.com/scrapy-cloud/frontier.html looks related
see comment below

rzo1 · 2022-01-28T14:50:54Z

rzo1
Jan 28, 2022

The orginal implementation of crawler4j is - sadly - unmaintained (for a while now). I forked it last year to ensure, that some of our academic projects can benefit from an updated version of it (and get rid of some legacy things).

I just started playing around with URLFrontier in a separate PoC branch, which looks quite interesting to me (and reduces a lof of the bloated logic which is required by the other sleepycat / hsqldb frontier impls there).

2 replies

rzo1 Feb 8, 2022

Just released 4.8.0, which contains a PoC implementation with url-frontier in version 1.0 :)

jnioche Feb 8, 2022
Maintainer Author

AMAZING!

lewismc · 2022-01-29T18:42:46Z

lewismc
Jan 29, 2022
Maintainer

Hi @jnioche excellent topic thread.

not necessarily an obvious match, Nutch is 100% batch and very tied to data structures on HDFS

I have been thinking about porting Nutch to run on Tez but it would be a significant undertaking and the last thing I want to do is split a community. Maybe some consumers don't like Tez... I don't know.

Regarding use of url-frontier... at this stage I think it is inevitable that an evolution of the Nutch codebase will come. The question is what will the architecture be...?

The Nutch tooling is rich. Configuration is a PITA for new comers. Metrics do not exist outside of the Hadoop ecosystem (which I am working on evolving) and there are known limitations when the crawldb gets huge.

The question we need to ask is how does the Nutch project (users & devs) benefit from embracing an url-frontier architecture and acquisition model?

Right now, I wouldn't even know how to go about adapting Nutch to run on url-frontier. Is there any API documentation? I can begin researching this. Thanks @jnioche for tagging me I have been watching this project with interest and I think you are doing great things...

0 replies

kkrugler · 2022-01-29T20:45:38Z

kkrugler
Jan 29, 2022
Maintainer

I've been holding off on any more work on FlinkCrawler until Stateful Functions seemed solid enough to use, as that's a much better option (versus pure streaming Flink) for a continuous crawler, assuming there's a scalable crawl DB. So maybe later this year???

3 replies

jnioche Feb 7, 2022
Maintainer Author

But isn't the idea with FlinkCrawler (and would be the same with the Stateful Functions) that you store the status of the URLs (i.e. the frontier) within the Flink states? If so that is the opposite of what URLFrontier does as it would store it externally and make it possible to interact with it from outside the Flink process.
Haven't really looked at FlinkCrawler yet, although I have used plenty of Flink for batch processing in the last couple of years.

kkrugler Feb 7, 2022
Maintainer

I did try using Flink state for the URL frontier, and it works, but it wasn't very straightforward or clean. State can only be looked up using a key/value relationship, so it was tricky and somewhat hacky to support fetching the "best" URLs. Which means having state stored elsewhere that can be queried would be interesting.

jnioche Feb 8, 2022
Maintainer Author

I'd love a URLFrontier datasource for Flink!

anjackson · 2022-02-03T14:18:15Z

anjackson
Feb 3, 2022

FWIW I put together a rough POC that runs a Scrapy crawl integrated with url-frontier, here: https://github.com/anjackson/scrapy-url-frontier#readme - it's not production ready but might help get things going.

1 reply

anjackson Feb 16, 2022

I also tried to assemble my summary of what a Heritrix 3 implementation would need here: ukwa/ukwa-heritrix#80

jnioche · 2022-04-13T10:26:15Z

jnioche
Apr 13, 2022
Maintainer Author

I got a message from the author of Yacy informing me that a project he is working on will be using URLFrontier

yacy/searchlab#4

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of open source web crawlers - could they use URLFrontier? #45

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

List of open source web crawlers - could they use URLFrontier? #45

jnioche Jan 28, 2022 Maintainer

Java

RUST

GO

Python

Replies: 5 comments · 6 replies

jnioche Feb 8, 2022 Maintainer Author

lewismc Jan 29, 2022 Maintainer

kkrugler Jan 29, 2022 Maintainer

jnioche Feb 7, 2022 Maintainer Author

kkrugler Feb 7, 2022 Maintainer

jnioche Feb 8, 2022 Maintainer Author

jnioche Apr 13, 2022 Maintainer Author

jnioche
Jan 28, 2022
Maintainer

Replies: 5 comments 6 replies

jnioche Feb 8, 2022
Maintainer Author

lewismc
Jan 29, 2022
Maintainer

kkrugler
Jan 29, 2022
Maintainer

jnioche Feb 7, 2022
Maintainer Author

kkrugler Feb 7, 2022
Maintainer

jnioche Feb 8, 2022
Maintainer Author

jnioche
Apr 13, 2022
Maintainer Author