I'm new to blazegraph, could you clarify? #203

Olivier4477 · 2021-06-10T13:18:39Z

Hello,

I discover blazegraph. I want to use government data for an app.

However, the data (.rdf) is very big (3.30go)
For example, if I do:
curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"

Blazegraph takes about 2 hours to load the data. When I wrote to you, I did:
curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL'

Obviously the drop time is also very long.

Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)?

How can I speed up the upload / update of data?

Thanking you

Have a good day

thompsonbry · 2021-06-10T13:53:17Z

The easiest is to run two instances (ideally on two machines). Load into one in the background, cut over once loaded, then delete the journal on the other instance and start your next load there.

…

On Thu, Jun 10, 2021 at 06:18 Olivier4477 ***@***.***> wrote: Hello, I discover blazegraph. I want to use government data for an app. However, the data (.rdf) is very big (3.30go) For example, if I do: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" Blazegraph takes about 2 hours to load the data. When I wrote to you, I did: curl -X POST http: // localhost: 9999 / blazegraph / namespace / kb / sparql --data-urlencode 'update = DROP ALL' Obviously the drop time is also very long. Knowing that the data (.rdf) is updated every day, how can I update blazegraph? Is it possible to update blazegraph without deleting (drop all)? How can I speed up the upload / update of data? Thanking you Have a good day — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#203>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YEO2W7OQ3NMP4YEZBTTSC3UBANCNFSM46OQIYTQ> .

Olivier4477 · 2021-06-10T14:00:22Z

thank you for your reply.

But I already have to use a minimum 8GB machine for blazegraph to work ...
If I have to use a second it is not the same budget.

Is it really the only solution?

It is not possible for example:
curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"
to specify a table name (example current date)
at midnight load the update (with the new name of the table) and delete the table from the days before?

Or another possibility?

I really want to use the data the government provides me but it's RDF / sparql ...

thank you so much

thompsonbry · 2021-06-10T14:07:57Z

Run two instances on the same machine then. There is no trivial way to identify all of the allocations in the storage layer associated with one loaded triple or quad store such that they may be trivially dropped. It is possible to use lower level apis to drop indices but you might not be freeing up the allocations immediately if you do that - this depends on how the rwstore is set up. On the other hand, as long as the machine can handle the two workloads (load and query) you can just use two instances. You can also use the DataLoader for loading into the second one. This way you can always have the full database responding at the same URL and port with a short downtime when you kill that process and restart it over the other database.

…

On Thu, Jun 10, 2021 at 07:00 Olivier4477 ***@***.***> wrote: thank you for your reply. But I already have to use a minimum 8GB machine for blazegraph to work ... If I have to use a second it is not the same budget. Is it really the only solution? It is not possible for example: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" to specify a table name (example current date) at midnight load the update (with the new name of the table) and delete the table from the days before? Or another possibility? I really want to use the data the government provides me but it's RDF / sparql ... thank you so much — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YDQYCWHFF4LJJDZYY3TSDAQRANCNFSM46OQIYTQ> .

Olivier4477 · 2021-06-10T14:16:02Z

Ok I think I understood your logic, but to put it into practice I will need help.

I'll explain, I use a docker-compose like this:

This image is provided in government documentation for data usage.

So for the moment I do:
docker-compose up
then I load the data like this:
curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql"
(the data file must be stored in dataset / kb / data

Then, if I want to reload I must:
docker-compose rm blazegraph
then docker system plum
then relaunch blazegraph

This is how I proceed now.

Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram)

thompsonbry · 2021-06-10T14:18:58Z

Not a docker expert. You’ll need to get someone else’s advise on that.

…

On Thu, Jun 10, 2021 at 07:16 Olivier4477 ***@***.***> wrote: Ok I think I understood your logic, but to put it into practice I will need help. I'll explain, I use a docker-compose like this: version: '3.1' services: blazegraph: image: conjecto/blazegraph:2.1.5 restart: always ports: - 9999:9999 environment: JAVA_OPTS: "-Xms2g -Xmx3g" volumes: - ./dataset:/docker-entrypoint-initdb.d datatourisme: build: docker ports: - "8080:80" restart: always depends_on: - blazegraph This image is provided in government documentation for data usage. So for the moment I do: docker-compose up then I load the data like this: curl -X POST -H "Content-Type: application / rdf + xml" --data-binary @ flux.rdf "http: // localhost: 9999 / blazegraph / namespace / kb / sparql" (the data file must be stored in dataset / kb / data Then, if I want to reload I must: docker-compose rm blazegraph then docker system plum then relaunch blazegraph This is how I proceed now. Before this solution, I used apache java Jena for sparql, it took 5 hours to load the data (on my computer 32 gb of ram) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#203 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATW7YC76JSU3ZASGMM5ENLTSDCLLANCNFSM46OQIYTQ> .

Olivier4477 · 2021-06-10T14:20:38Z

Ok but ... how would you have done?
Use blazegraph.jar directly?

in any case thank you very much, hoping that another person can take over to help me

Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm new to blazegraph, could you clarify? #203

I'm new to blazegraph, could you clarify? #203

Olivier4477 commented Jun 10, 2021

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021 •

edited

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021

I'm new to blazegraph, could you clarify? #203

I'm new to blazegraph, could you clarify? #203

Comments

Olivier4477 commented Jun 10, 2021

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021 • edited

thompsonbry commented Jun 10, 2021 via email

Olivier4477 commented Jun 10, 2021

Olivier4477 commented Jun 10, 2021 •

edited