Full Fuseki configuration syncing #208
-
We are building a data mesh and a data product in this data mesh can have multiple input and output ports. One of the port types is linked data in a triple store (Jena Fuseki). All of the data mesh components are deployed in a Kubernetes cluster. Currently we've got a Jena Fuseki docker image that we're deploying in this cluster as a single pod. Because we want support High Availability for all the the data mesh components we have been looking into how to provide this for Jena Fuseki and found this project. After some testing we found that we can definitely use this project to sync datasets between multiple Jena Fuseki instances, but we did run into one annoying issue. As part of the data mesh, a data product developer can create a new data product. If the developer chooses for the data product to have a linked data triple store output port the system will create a new dataset in Jena Fuseki. When we set up our cluster with multiple Jena Fuseki instances and the RDF Delta Patch Log server we didn't see this dataset creation get synced to the second Jena Fuseki instance. After looking in the issue tracker we found a ticket and a discussion that made clear that this was something that more people might expect to be in the library, but that currently isn't. So now we are left with the issue that we won't seem to be able to provide Jena Fuseki in HA mode. This means this part of the system won't be resilient and that we can only provide the single Jena Fuseki instance with more memory and CPU to increase performance instead of being able to provision additional instances. So now we are wondering if adding this kind of Fuseki configuration syncing is something that is on the roadmap? If it isn't and won't be in the near future, how easy or involved it would be to add ourselves? (and if not too complicated if we could get some pointers). We did find some way to get something to work though, but that involves first using the RDF Delta API to create a new log when we create a new dataset for a data product. After creating this log we then multiplex the create dataset call to all known Jena Fuseki instances. If we do this (we extended out custom Jena Fuseki docker image to include 3 RDF Delta libraries) we can see that our Jena Fuseki instance connect correctly to the Patch Log server and that if we create/update/delete triples in the new dataset on one Jena Fuseki instance these changes get correctly replicated to the other. So this seems to kinda work, but we were wondering if there are things we might be missing or forgetting that could be an issue? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
To the last part - you haven't missed anything. There isn't a way to add a new dataset and setting up a patch log without taking the steps you outline, including rebooting the servers to pick up the new configuration. The reboot can be done as a rolling reboot (one replica server at a time). |
Beta Was this translation helpful? Give feedback.
-
We (Telicent) are building a data platform around data mesh ideas. RDF Delta technology may be part of that. The vision for RDF Delta v2: not a roadmap because it is resourcing dependent - no timescales. Switch from zookeeper to RAFT for concensus. RAFT is available as a library nowadays so it should be possible to have the Fuseki servers themselves running the consensus. Use Kafka for the distribution of patches, then no patch log server needed. This including kafka archiving) gives the patches persistence and replay, and the server is HA. Something has to change for patch storage - one other possibility is a Azure-native patch store. It seems MinIO is not perfect as an S3 emulator. The fact it is AGPL3 isn't great for the enterprise either. Together, these make the RDF Delta deployment smaller and easier to administer. For Fuseki: Enable a "reload configuration" option for Fuseki so the server can switch to a new configuration while gracefully finishing all operations on the old configuration. Change of configuration is "within limits" (to be determined) - adding a new dataset, removing a dataset would be within limits; changing TCP port will not. Then there can be a Kafka topic for the configuration and a graceful switch over. This is delivered via Fuseki modules so additional features are selected by putting jars in a A UI for a data workbench - query, update data - not the server administration UI. Server configuration is by configuration file. API calls or the command line options for a database when running the server don't work well when there is a need to have a record of the configuration. It would be nice to have a configuration editor. |
Beta Was this translation helpful? Give feedback.
-
@afs thank you for the detailed answer. It might not be what we wanted to hear, but it does confirm things and points to certain avenues we might look at for the future. |
Beta Was this translation helpful? Give feedback.
We (Telicent) are building a data platform around data mesh ideas.
It is based around Kafka to ingest, transform and transport the data. The data becomes RDF, or SPARQL operations or RDF Patch, on the main data topic and systems pick the data up from there. Systems include Fuseki (with a Apache Kafka connector) as well as a search index API based on Elastic Search; these are the smart caches in the diagram.
RDF Delta technology may be part of that.
The vision for RDF Delta v2: not a roadmap because it is resourcing dependent - no timescales.
Switch from zookeeper to RAFT for concensus. RAFT is available as a library nowadays so it should be possible to have the Fuseki servers themselves r…