Roadmap URLFrontier 2

This work is funded by the NGI0 Discovery fund, see description on the NLNet website.

The objective is to turn what is currently a working piece of software into an enterprise-grade solution. The improvements will mainly concern the service implementation and will be along the following dimensions:

Monitoring and Reporting (M&R): improve the useability of the system by adding configurable logging and metrics reporting
Discovery and Clustering (D&C): Further improve the performance of the service for very large volumes of data by adding efficient parallelisation across multiple nodes
Robustness and Resilience (R&R): Improve the service robustness with more graceful failure modes and more efficient restarts

Monitoring and reporting

When using the URLFrontier service in production, it is critical to be able to monitor its performance and understand any underlying issues. In its initial version, the service implementation has only a minimal level of logging which is not configurable. What we will be aiming to do for this project will be to:

● Make logging levels configurable and possibly expose this via the URLFrontier API

● Investigate ways of integrating with Loki

● Expose internal metrics to facilitate monitoring in production. This could either be via adding a new method to the API and returning metrics at a bespoke format or alternatively having a custom listener to expose the metrics in the format used by Prometheus.

● Design and implement a Grafana dashboard to display the metrics exposed

Both the logging and metrics will integrate with Grafana and Loki but we will aim to make the technical solution as generic as possible by keeping other mainstream solutions in mind.

Discovery and clustering

Each instance of a URLFrontier service currently works in isolation. Thanks to its efficient design and implementation, a single instance of URLFrontier can successfully handle large volumes of data even with relatively little RAM and average storage hardware, but beyond a certain volume, the performance drops. Having more than one instance becomes necessary in such situations. We will investigate ways of adding a layer of syndication between Frontier instances so that a node can discover other nodes, route content based on a key, report nodes present in the cluster or aggregate stats over the entire cluster. To achieve this, we will review the existing literature and state of the art and look at solutions chosen in established projects such as OpenSearch, Apache SOLR or Redis before implementing it in URLFrontier. The API will be modified in order to give users finer control over the operations e.g get statistics for the Frontier instance it is connected to or an entire cluster of instances. The changes introduced by this milestone will allow users to setup a cluster of URLFrontier instances and distribute the storage and computation over them. The functionalities will be identical to working with a single instance but will allow better performance and scalability.

Robustness and resilience

As part of this task, we will improve the crash recovery mechanisms in the service, e.g. if an instance dies abruptly, does the data get corrupted or is the service restartable? In its current version, the frontier implementation does not survive a kill signal. We also want to improve the time it takes to restart a node. We currently re-scan the whole collection at startup in a single thread, which takes a lot of time. Could we store the list of keys in RocksDB and use that when reloading so that the scan can be done in parallel? Ideally we would want a Frontier to reload its data in as little time as possible. Finally, in order to make the Frontier services usable with no down time in case of failure of one or more of its nodes, we could get them to provide content replication so that if a node fails, its content will be served by a different node of the cluster. This relies on discovery and clustering (D&C - see above) and is a challenging task. In this project we will investigate whether this is doable and implement it only if feasible within the time left in the project.

Multitenancy

This is an optional task which could be done if NGI Zero has unused budget to spend. This task would add support for multi-tenancy in URLFrontier by introducing a concept of crawlID, therefore handling logical crawls separately e.g. generic crawl vs specific ones.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap URLFrontier 2

Monitoring and reporting

Discovery and clustering

Robustness and resilience

Multitenancy

Clone this wiki locally