ORCA is a Crawler Analysis benchmark for Data Web Crawlers which runs on the HOBBIT platform. Currently, the following types of data nodes are available:
- RDF data served in various formats over HTTP (dump and dereferencing variants)
- RDFa data embedded in HTML, based on the RDFa Test Suite
- SPARQL endpoints, based on Virtuoso
- CKAN instances
This project is licensed under the GNU Affero General Public License v3.0. For the full license text, see LICENSE.
- Permanent URL: http://w3id.org/dice-research/orca/code
- GitHub: https://github.com/dice-group/orca
Parameter | Description | Ontology resources |
---|---|---|
Number of nodes | The number of nodes in the synthetic graph. | orca:numberOfNodes |
Average node degree | The average degree of the nodes in the generated graph. | orca:averageNodeGraphDegree |
RDF dataset size | Average number of triples of the generated RDF graphs. | orca:averageTriplesPerNode |
Average resource degree | The average degree of the resources in the RDF graphs. | orca:averageRdfGraphDegree |
Node type amounts | For each node type, the user can define the proportion of nodes that should have this type. | orca:httpDumpNodeWeight orca:dereferencingHttpNodeWeight orca:sparqlNodeWeight orca:ckanNodeWeight |
Dump file serialisations | For each available dump file serialisation, a boolean flag can be set. | orca:useNtDumps orca:useN3Dumps orca:useRdfXmlDumps orca:useTurtleDumps |
Dump file compression ratio | Proportion of dump files that are compressed. | orca:httpDumpNodeCompressedRatio |
Average ratio of disallowed resources | Proportion of resources that are generated within a node and marked as disallowed for crawling. | orca:averageDisallowedRatio |
Average crawl delay | The crawl delay of the node's robots.txt file. |
orca:averageCrawlDelay |
Seed | A seed value for initialising random number generators is used to ensure the repeatability of experiments. | orca:seed |
KPI | Description | Ontology resources |
---|---|---|
Recall | Number of true positives divided by the number of checked triples. | orca:microRecall orca:macroRecall |
Runtime | The time it takes from starting the crawling process to termination. | orca:runtime |
Requested disallowed resources | The number of forbidden resources crawled by the crawler, divided by the number of all resources forbidden by the robots.txt file. |
orca:ratioOfRequestedDisallowedResources |
Crawl delay fulfilment | The average measured delay between the requests received by a single node divided by the delay defined in the robots.txt file. If the measure is below 1.0 the crawler does not strictly follow the delay instruction. |
orca:minAverageCrawlDelayFulfillment orca:maxAverageCrawlDelayFulfillment orca:macroAverageCrawlDelayFulfillment |
Consumed hardware resources | The RAM and CPU consumption of the benchmarked crawler. | orca:totalCpuUsage orca:averageDiskUsage orca:averageMemoryUsage |
Triples over time | The number of triples in the sink over time. | orca:tripleCountOverTime |
This project is maintained by the Data Science Group at Paderborn University within its role as a member of the special group 7 of task force 6 of the BDVA.
ORCA has been accepted by the IEEE International Conference on Semantic Computing (ICSC). The paper should be cited as follows:
@InProceedings{roeder2021orca,
author = {Michael Röder and Geraldo de Souza Jr. and Denis Kuchelev and Abdelmoneim Amer Desouki and Axel-Cyrille Ngonga Ngomo},
booktitle = {Proceedings of the 15th IEEE International Conference on Semantic Computing (ICSC)},
title = {ORCA – a Benchmark for Data Web Crawlers},
year = {2021},
pages = {62-69},
publisher = {IEEE Computer Society},
keywords = {dice raki daikiri opal limbo sys:relevantFor:limbo sys:relevantFor:opal group_aksw roeder ngonga kuchelev gsjunior},
url = {https://papers.dice-research.org/2021/ICSC2021_ORCA/ORCA_public.pdf},
}