Replies: 5 comments 6 replies
-
The orginal implementation of crawler4j is - sadly - unmaintained (for a while now). I forked it last year to ensure, that some of our academic projects can benefit from an updated version of it (and get rid of some legacy things). I just started playing around with URLFrontier in a separate PoC branch, which looks quite interesting to me (and reduces a lof of the bloated logic which is required by the other sleepycat / hsqldb frontier impls there). |
Beta Was this translation helpful? Give feedback.
-
Hi @jnioche excellent topic thread.
I have been thinking about porting Nutch to run on Tez but it would be a significant undertaking and the last thing I want to do is split a community. Maybe some consumers don't like Tez... I don't know. Regarding use of url-frontier... at this stage I think it is inevitable that an evolution of the Nutch codebase will come. The question is what will the architecture be...? The Nutch tooling is rich. Configuration is a PITA for new comers. Metrics do not exist outside of the Hadoop ecosystem (which I am working on evolving) and there are known limitations when the crawldb gets huge. The question we need to ask is how does the Nutch project (users & devs) benefit from embracing an url-frontier architecture and acquisition model? Right now, I wouldn't even know how to go about adapting Nutch to run on url-frontier. Is there any API documentation? I can begin researching this. Thanks @jnioche for tagging me I have been watching this project with interest and I think you are doing great things... |
Beta Was this translation helpful? Give feedback.
-
I've been holding off on any more work on FlinkCrawler until Stateful Functions seemed solid enough to use, as that's a much better option (versus pure streaming Flink) for a continuous crawler, assuming there's a scalable crawl DB. So maybe later this year??? |
Beta Was this translation helpful? Give feedback.
-
FWIW I put together a rough POC that runs a Scrapy crawl integrated with url-frontier, here: https://github.com/anjackson/scrapy-url-frontier#readme - it's not production ready but might help get things going. |
Beta Was this translation helpful? Give feedback.
-
I got a message from the author of Yacy informing me that a project he is working on will be using URLFrontier |
Beta Was this translation helpful? Give feedback.
-
This is a living document, feel free to add to it. The idea is not necessarily to convince the authors of these projects to ditch the storage layers they painfully worked on and adopt URLFrontier (although this would be an exciting prospect), but to
Contributing adaptors for URLFrontier to these project could for instance be a good project for students.
Having a collaboration with projects implemented in a language which is not Java would be a good illustration of the benefits of using gRPC.
Java
StormCrawler
I think this project is all-in on URLFrontier :)
crawler4j
Not maintained, see below fork by @rzo1
DigitalPebble blog
Heritrix
see comment below
FlinkCrawler
ask @kkrugler ;-)
BUbiNG
authors not interested in getting involved + project not actively maintained
Apache Nutch
not necessarily an obvious match, Nutch is 100% batch and very tied to data structures on HDFS
@sebastian-nagel @lewismc do you agree?
webmagic
Opened an issue on code4craft/webmagic#1098
RUST
spider-rs
need to check how the URLs are persisted
GO
gocolly
sent the authors a message + opened an issue gocolly/colly#680
crawlab
Generic framework for distributed web crawling; opened an issue crawlab-team/crawlab#1062
Python
Scrapy
https://docs.zyte.com/scrapy-cloud/frontier.html looks related
see comment below
Beta Was this translation helpful? Give feedback.
All reactions