Skip to content

Commit

Permalink
Merge pull request #2 from brechtvdv/eswc
Browse files Browse the repository at this point in the history
Rework for eswc
  • Loading branch information
brechtvdv committed Mar 20, 2019
2 parents fa37318 + 429a80e commit b032b0c
Show file tree
Hide file tree
Showing 7 changed files with 38 additions and 17 deletions.
16 changes: 12 additions & 4 deletions content/conclusion.md
@@ -1,9 +1,17 @@
## Conclusion
{:#conclusion}

Data owners can publish their Linked Open Data very cost-efficient on their website with JSON-LD snippets. After an initial cost of adding this feature to their website, they can have an always up-to-date dataset with negligible maintenance costs. The cultural heritage website hetarchief.be showcases an official maintained paged collection of Linked Data Fragments about newspapers. By extending Comunica, in-depth data analysis and federated querying over this dataset is possible. To improve querying speed, Linked Data services ([SPARQL-endpoint](http://semanticweb.org/wiki/SPARQL_endpoint.html), [HDT](cite:cites Fernndez2013BinaryRR) file, TPF interface...) with a higher maintenance cost can be created on top of JSON-LD snippets. Such interfaces would suffer from scalability problems: Optical Character Recognition (OCR) texts have bad compression rates, and thus require gigabytes of disk space. With our solution, these OCR-text are published in a seperate document keeping the maintenance cost low while harvesting in an automated way is still possible. By using our demonstrator, non-technical users are able to extract a data dump from an enriched website.
Data owners can publish their Linked Open Data very cost-efficient on their website with JSON-LD snippets. After an initial cost of adding this feature to their website, they can have an always up-to-date dataset with negligible maintenance costs, however, machine clients that query and harvest over websites can introduce unforeseen spikes of activity. Data owners will need to extend their monitoring capabilities to not only focus on human interaction (e.g. Google Analytics) and apply a HTTP caching strategy for stale resources.

To gain traction with an international audience, e.g. the science stories platform ([http://sciencestories.io](http://sciencestories.io)), a reconciliation service could be created with knowledge bases (cfr. Wikidata).
Next to embedding the data, hypermedia controls or search engine optimization features, also the [International Image Interoperability Framework](https://iiif.io/api/image/2.1/) (IIIF) Image API for sharing images could be described within a JSON-LD snippet for raising the discoverability of this service. IIIF API information already uses JSON-LD to describe its features such as tiling and licensing which makes this an excellent snippet addition helping an organization become more visible on the Web.
Linked Data services ([HDT](cite:cites Fernndez2013BinaryRR) file, TPF interface...) with a higher maintenance cost can be created on top of JSON-LD snippets, but these would suffer from scalability problems: Optical Character Recognition (OCR) texts have bad compression rates, and thus require gigabytes of disk space. With our solution, these OCR-text are published in a seperate document keeping the maintenance cost low while harvesting in an automated way is still possible.

In future work, extending Comunica for harvesting Hydra collections would help organizations to improve their collection management. These collections could be defined on their main page of their website improving Open Data discoverability. Also work on supporting multiple views acting as indexes for collections would benefit querying performance on sorting or filtering operations on e.g. geospatial or temporal data.
In future work, extending Comunica for harvesting Hydra collections would help organizations to improve their collection management. These collections could be defined on their main page of their website improving Open Data discoverability.

<!--By using our demonstrator, non-technical users are able to extract a data dump from an enriched website.-->

<!-- The cultural heritage website hetarchief.be showcases an official maintained paged collection of Linked Data Fragments about newspapers. By extending Comunica, in-depth data analysis and federated querying over this dataset is possible. To improve querying speed, Linked Data services ([SPARQL-endpoint](http://semanticweb.org/wiki/SPARQL_endpoint.html), [HDT](cite:cites Fernndez2013BinaryRR) file, TPF interface...) with a higher maintenance cost can be created on top of JSON-LD snippets. Such interfaces would suffer from scalability problems: Optical Character Recognition (OCR) texts have bad compression rates, and thus require gigabytes of disk space. With our solution, these OCR-text are published in a seperate document keeping the maintenance cost low while harvesting in an automated way is still possible. By using our demonstrator, non-technical users are able to extract a data dump from an enriched website. -->

<!-- To gain traction with an international audience, e.g. the science stories platform ([http://sciencestories.io](http://sciencestories.io)), a reconciliation service could be created with knowledge bases (cfr. Wikidata).
Next to embedding the data, hypermedia controls or search engine optimization features, also the [International Image Interoperability Framework](https://iiif.io/api/image/2.1/) (IIIF) Image API for sharing images could be described within a JSON-LD snippet for raising the discoverability of this service. IIIF API information already uses JSON-LD to describe its features such as tiling and licensing which makes this an excellent snippet addition helping an organization become more visible on the Web. -->

<!--In future work, extending Comunica for harvesting Hydra collections would help organizations to improve their collection management. These collections could be defined on their main page of their website improving Open Data discoverability. Also work on supporting multiple views acting as indexes for collections would benefit querying performance on sorting or filtering operations on e.g. geospatial or temporal data.-->
2 changes: 1 addition & 1 deletion content/demonstrator.md
Expand Up @@ -6,7 +6,7 @@ The application is written with the front-end playground Codepen [https://codepe

<figure id="codepen">
<center>
<img src="img/codepen.PNG">
<img id="codepen-img" src="img/codepen.PNG">
</center>
<figcaption markdown="block">
A spreadsheet is generated by entering a URL of a newspaper from hetarchief.be.
Expand Down
10 changes: 5 additions & 5 deletions content/implementation.md
Expand Up @@ -5,8 +5,8 @@

Every newspaper webpage is annotated with JSON-LD snippets containing domain-specific metadata and hypermedia controls. The former metadata is described using acknowledged vocabularies such as [Dublin Core Terms](http://dublincore.org/documents/dcmi-terms/) (DCTerms), [Friend of a Friend](http://xmlns.com/foaf/spec/) (FOAF), [Schema.org](https://schema.org/) etc. The latter is described using the [Hydra](https://www.hydra-cg.com/spec/latest/core) vocabulary for hypermedia-driven Web APIs. Although hetarchief.be contains several human-readable hypermedia controls (free text search bar, search facets, pagination for every [newspaper](https://hetarchief.be/nl/media/brief-van-den-soldaat-aan-zijne-verdrukte-medeburgers/I2STYUAOpmFKmbFRXNmV0PTp) ) only Hydras partial collection view controls are implemented: hydra:next describes the next newspaper, vice versa hydra:previous. Also an estimate of the amount of triples on a page is added using hydra:totalItems and void:triples. This helps user agents to build more efficient query plans.

<figure id="partial-collection-controls" class="listing">
````/code/hydra-partial-collection-view.txt````
<figure id="partial-collection-controls" class="figure">
````/code/hydra-partial-collection-view.txt````
<figcaption markdown="block">
Every newspaper describes its next and previous newspaper using Hydra partial collection view controls. This wires Linked Data Fragments together into a dataset.
</figcaption>
Expand All @@ -32,13 +32,13 @@ That is why we added an actor (`ActorRdfParseHtmlScript`) for parsing such HTML
This intermediate parser searches for data snippets and forwards these to their respective RDF parser.
In case of a JSON-LD snippet, the body of a script tag `<script type="application/ld+json">` will be parsed by the JSON-LD parse actor.

By adding these two actors to Comunica, we can now query over a paged collection that is declaratively described with data snippets. As federated querying comes out-of-the-box with Comunica, this cultural heritage collection can now be queried together with other knowledge bases (cfr. Wikidata). For example, [](#federated-querying-comunica) crawls over 17 newspaper pages. The first result appears after reading the first page. All results are available after 1,5 minutes. This is caused by deficiency of indexes where all pages need examination before having a complete answer.
By adding these two actors to Comunica, we can now query over a paged collection that is declaratively described with data snippets. As federated querying comes out-of-the-box with Comunica, this cultural heritage collection can now be queried together with other knowledge bases (cfr. Wikidata). For example, retrieving basic information such as title, publication date etc. from 17 newspaper pages requires 1,5 minutes until all results are retrieved. This is caused by deficiency of indexes where all pages need examination before having a complete answer.

<figure id="federated-querying-comunica" class="listing">
<!-- <figure id="federated-querying-comunica" class="listing">
````/code/federated-querying-comunica.txt````
<figcaption markdown="block">
SPARQL-query over a paged collection of hetarchief.be and the TPF interface of Wikidata using the JavaScript-based command line interface of Comunica.
</figcaption>
</figure>
</figure> -->

In next section we will demonstrate how SPARQL-querying can be applied for extracting a spreadsheet.
4 changes: 2 additions & 2 deletions content/index.md.erb
Expand Up @@ -17,7 +17,7 @@ title: "Using an existing website as a queryable low-cost LOD publishing interfa

<section class="context" markdown="block">
## In reply to
- [The Web Conference 2019 Call for Demonstrations](https://www2019.thewebconf.org/call-for-demonstrations){:rel="as:inReplyTo"}
- [ESWC 2019 Call for Posters and Demos](https://2019.eswc-conferences.org/call-for-posters-and-demos/){:rel="as:inReplyTo"}
</section>

</header>
Expand All @@ -36,9 +36,9 @@ title: "Using an existing website as a queryable low-cost LOD publishing interfa
<main>
<!-- Add sections by specifying their file name, excluding the '.md' suffix. -->
<%= section 'introduction' %>
<%= section 'sota' %>
<%= section 'implementation' %>
<%= section 'demonstrator' %>
<%= section 'sota' %>
<%= section 'conclusion' %>
</main>

Expand Down
6 changes: 4 additions & 2 deletions content/introduction.md
@@ -1,3 +1,5 @@
<div class="printonly">This is a print-version of a paper first written for the Web. The Web-version is available at <a href="https://github.com/brechtvdv/Article-Using-an-existing-website-as-a-queryable-low-cost-LOD-publishing-interface">https://github.com/brechtvdv/Article-Using-an-existing-website-as-a-queryable-low-cost-LOD-publishing-interface</a>.</div>

## Introduction
{:#introduction}

Expand All @@ -18,9 +20,9 @@ Website maintainers are currently using [JSON-LD](https://json-ld.org/spec/lates

[Comunica](cite:cites taelman_iswc_2018) is a Linked Data user agent that can run federated queries over several heterogeneous Web APIs, such as data dumps, SPARQL-endpoints, Linked Data documents and Triple Pattern Fragments. This engine has been developed to make it easy to plug in specific types of functionality as separate modules. Such modules can be added or removed depending on the configuration. As such, by looking for affordances in Web APIs more intelligent user agents can be created.

First we describe how hetarchief.be is enriched with JSON-LD snippets. Next, we explain how we allow Comunica to query over this and other sources by adding two building blocks.
First we give a short background of the Comunica tool and the Hydra partial collection views.
We then describe how hetarchief.be is enriched with JSON-LD snippets. Next, we explain how we allow Comunica to query over this and other sources by adding two building blocks.
After this, we demonstrate how a custom data dump can be created by an end-user that wants to further analyze this data, for instance in spreadsheet software.
The online version of this paper embeds this demo and can be tested live.
We then give a short background of the Comunica tool and the Hydra partial collection views.
Finally, we conclude the demonstrator with a discussion and perspective on future work.

4 changes: 2 additions & 2 deletions content/sota.md
Expand Up @@ -5,14 +5,14 @@

Every piece of functionality in Comunica can be implemented as seperate building blocks based on the _actor_ programming model, where each actor can respond to a specific action. Actors that share same functionality, but with different implementations, can be grouped with a communication channel called a _bus_. Interaction between actors is possible through a _mediator_ that wraps around a bus to get an action's result from a single actor. This result depends on the configuration of the mediator, e.g. a race mediator will return the response of the actor that is able to reply the earliest.

<figure id="actor">
<!-- <figure id="actor">
<center>
<img src="img/actor-mediator-bus.svg">
</center>
<figcaption markdown="block">
Actor 0 initiates an action to a mediator. The mediator communicates through a bus with all actors 1, 2 and 3 that are able to solve the action and gives back the most favorable result according to its configuration.
</figcaption>
</figure>
</figure> -->

### Hydra partial collection views

Expand Down
13 changes: 12 additions & 1 deletion content/styles/print.scss
@@ -1,4 +1,4 @@
@import "acm.scss";
@import "lncs.scss";
@import "shared.scss";

header {
Expand All @@ -9,4 +9,15 @@ header {

figure.listing pre {
white-space: initial;
}

#wiperstimes, #partial-collection-controls{
display: none;
}

#codepen-img {
width: 50%;
display: block;
margin-left: auto;
margin-right: auto;
}

0 comments on commit b032b0c

Please sign in to comment.