Full Fuseki configuration syncing #208

planetsizebrain · 2023-02-06T23:54:37Z

planetsizebrain
Feb 6, 2023

We are building a data mesh and a data product in this data mesh can have multiple input and output ports. One of the port types is linked data in a triple store (Jena Fuseki). All of the data mesh components are deployed in a Kubernetes cluster. Currently we've got a Jena Fuseki docker image that we're deploying in this cluster as a single pod. Because we want support High Availability for all the the data mesh components we have been looking into how to provide this for Jena Fuseki and found this project.

After some testing we found that we can definitely use this project to sync datasets between multiple Jena Fuseki instances, but we did run into one annoying issue. As part of the data mesh, a data product developer can create a new data product. If the developer chooses for the data product to have a linked data triple store output port the system will create a new dataset in Jena Fuseki. When we set up our cluster with multiple Jena Fuseki instances and the RDF Delta Patch Log server we didn't see this dataset creation get synced to the second Jena Fuseki instance.

After looking in the issue tracker we found a ticket and a discussion that made clear that this was something that more people might expect to be in the library, but that currently isn't.

So now we are left with the issue that we won't seem to be able to provide Jena Fuseki in HA mode. This means this part of the system won't be resilient and that we can only provide the single Jena Fuseki instance with more memory and CPU to increase performance instead of being able to provision additional instances.

So now we are wondering if adding this kind of Fuseki configuration syncing is something that is on the roadmap? If it isn't and won't be in the near future, how easy or involved it would be to add ourselves? (and if not too complicated if we could get some pointers).

We did find some way to get something to work though, but that involves first using the RDF Delta API to create a new log when we create a new dataset for a data product. After creating this log we then multiplex the create dataset call to all known Jena Fuseki instances. If we do this (we extended out custom Jena Fuseki docker image to include 3 RDF Delta libraries) we can see that our Jena Fuseki instance connect correctly to the Patch Log server and that if we create/update/delete triples in the new dataset on one Jena Fuseki instance these changes get correctly replicated to the other. So this seems to kinda work, but we were wondering if there are things we might be missing or forgetting that could be an issue?

Answered by afs

Feb 12, 2023

We (Telicent) are building a data platform around data mesh ideas.
It is based around Kafka to ingest, transform and transport the data. The data becomes RDF, or SPARQL operations or RDF Patch, on the main data topic and systems pick the data up from there. Systems include Fuseki (with a Apache Kafka connector) as well as a search index API based on Elastic Search; these are the smart caches in the diagram.

RDF Delta technology may be part of that.

The vision for RDF Delta v2: not a roadmap because it is resourcing dependent - no timescales.

Switch from zookeeper to RAFT for concensus. RAFT is available as a library nowadays so it should be possible to have the Fuseki servers themselves r…

View full answer

afs · 2023-02-12T20:33:33Z

afs
Feb 12, 2023
Maintainer

To the last part - you haven't missed anything. There isn't a way to add a new dataset and setting up a patch log without taking the steps you outline, including rebooting the servers to pick up the new configuration. The reboot can be done as a rolling reboot (one replica server at a time).

0 replies

afs · 2023-02-12T20:35:13Z

afs
Feb 12, 2023
Maintainer

We (Telicent) are building a data platform around data mesh ideas.
It is based around Kafka to ingest, transform and transport the data. The data becomes RDF, or SPARQL operations or RDF Patch, on the main data topic and systems pick the data up from there. Systems include Fuseki (with a Apache Kafka connector) as well as a search index API based on Elastic Search; these are the smart caches in the diagram.

RDF Delta technology may be part of that.

The vision for RDF Delta v2: not a roadmap because it is resourcing dependent - no timescales.

Switch from zookeeper to RAFT for concensus. RAFT is available as a library nowadays so it should be possible to have the Fuseki servers themselves running the consensus.

Use Kafka for the distribution of patches, then no patch log server needed. This including kafka archiving) gives the patches persistence and replay, and the server is HA.

Something has to change for patch storage - one other possibility is a Azure-native patch store. It seems MinIO is not perfect as an S3 emulator. The fact it is AGPL3 isn't great for the enterprise either.

Together, these make the RDF Delta deployment smaller and easier to administer.

For Fuseki:

Enable a "reload configuration" option for Fuseki so the server can switch to a new configuration while gracefully finishing all operations on the old configuration. Change of configuration is "within limits" (to be determined) - adding a new dataset, removing a dataset would be within limits; changing TCP port will not.

Then there can be a Kafka topic for the configuration and a graceful switch over.

This is delivered via Fuseki modules so additional features are selected by putting jars in a lib/ directory. The choice of feature can't be changed without reboot.

A UI for a data workbench - query, update data - not the server administration UI.

Server configuration is by configuration file. API calls or the command line options for a database when running the server don't work well when there is a need to have a record of the configuration. It would be nice to have a configuration editor.

1 reply

edmondchuc Jan 22, 2024

Hi @afs, we at KurrawongAI are very excited for RDF Delta v2. We are working on a data system that uses Kafka exactly as you've described along with Fuseki and the RDF Delta tech.

Please, if you need us to test any part of the system, we are more than happy to help.

planetsizebrain · 2023-03-07T23:39:23Z

planetsizebrain
Mar 7, 2023
Author

@afs thank you for the detailed answer. It might not be what we wanted to hear, but it does confirm things and points to certain avenues we might look at for the future.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full Fuseki configuration syncing #208

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Full Fuseki configuration syncing #208

planetsizebrain Feb 6, 2023

Replies: 3 comments · 1 reply

afs Feb 12, 2023 Maintainer

afs Feb 12, 2023 Maintainer

edmondchuc Jan 22, 2024

planetsizebrain Mar 7, 2023 Author

planetsizebrain
Feb 6, 2023

Replies: 3 comments 1 reply

afs
Feb 12, 2023
Maintainer

afs
Feb 12, 2023
Maintainer

planetsizebrain
Mar 7, 2023
Author