Version
6.0.0 Also reproduced on 5.5.0.
What happened?
Version
6.0.0
Also reproduced on 5.5.0.
What happened?
We can reproduce a failure mode where repeated canceled federated SERVICE queries leave
the target dataset effectively wedged until Fuseki is restarted.
The pattern is:
- A direct query to dataset
target succeeds.
- A federated query from dataset
source to dataset target using SERVICE <http://127.0.0.1:3030/target/sparql> succeeds.
- We then issue a burst of heavy federated queries from
source to target, with the
client canceling/timing out the outer HTTP request almost immediately.
- After that, even a simple direct query to dataset
target times out.
- Restarting Fuseki clears the problem.
This is reproducible for us on Jena/Fuseki 6.0.0 and also on 5.5.0.
Why this looks distinct from ordinary timeout behavior
This does not look like only the outer client timing out.
After the cancellation storm:
- a direct query to the target dataset also times out
- the store recovers only after restart
So the target dataset/server appears to be left in a bad runtime state.
Reproducing it
We reproduced this against an isolated standalone Fuseki 6.0.0 container built from the
official Apache release tarball.
Our production datasets are private, but the failure can be described with this structure:
- source dataset:
source
- target dataset:
target
Baseline direct probe against target:
SELECT * WHERE {
<urn:probe-subject> ?p ?o
}
LIMIT 5
Baseline federated probe from source to target:
SELECT * WHERE {
SERVICE <http://127.0.0.1:3030/target/sparql> {
<urn:probe-subject> ?p ?o
}
}
LIMIT 5
Cancellation-storm query:
SELECT * WHERE {
SERVICE <http://127.0.0.1:3030/target/sparql> {
?s ?p ?o
}
}
We then repeatedly send that last query to the source dataset and cancel the outer HTTP
request almost immediately, for example:
for i in $(seq 1 40); do
curl -sS --max-time 0.05 -G \
--data-urlencode 'query=SELECT * WHERE { SERVICE <http://127.0.0.1:3030/target/sparql> { ?s ?p ?o } }' \
http://127.0.0.1:3030/source/sparql >/dev/null || true
done
Actual result
Before stress:
- direct query succeeds
- federated query succeeds
After the canceled federated-query burst:
- direct query to target times out
- federated query also fails/times out
- Fuseki restart is required to recover
Expected result
Canceled outer federated queries should not leave the target dataset/server wedged.
After the canceled requests, normal direct queries to the target dataset should still work.
Relevant logs
From the Jena 6.0.0 Fuseki log, after the stress starts we see many inner requests like:
GET http://127.0.0.1:3030/target/sparql?query=SELECT++%2A%0AWHERE%0A++%7B+?s++?p++?o+%7D%0A
The outer requests are being canceled by the client, but the inner SERVICE subqueries
continue to run. After enough of these, the target dataset stops responding to even direct
queries.
Notes
- This was reproduced using loopback 127.0.0.1, so it does not appear to require Docker DNS/
container-name routing.
- We specifically tested 6.0.0 because the changelog mentions query-cancellation
improvements, but we still reproduce this failure.
Version
6.0.0 Also reproduced on 5.5.0.
What happened?
Version
6.0.0
Also reproduced on 5.5.0.
What happened?
We can reproduce a failure mode where repeated canceled federated
SERVICEqueries leavethe target dataset effectively wedged until Fuseki is restarted.
The pattern is:
targetsucceeds.sourceto datasettargetusingSERVICE <http://127.0.0.1:3030/target/sparql>succeeds.sourcetotarget, with theclient canceling/timing out the outer HTTP request almost immediately.
targettimes out.This is reproducible for us on Jena/Fuseki 6.0.0 and also on 5.5.0.
Why this looks distinct from ordinary timeout behavior
This does not look like only the outer client timing out.
After the cancellation storm:
So the target dataset/server appears to be left in a bad runtime state.
Reproducing it
We reproduced this against an isolated standalone Fuseki 6.0.0 container built from the
official Apache release tarball.
Our production datasets are private, but the failure can be described with this structure:
sourcetargetBaseline direct probe against
target:Baseline federated probe from source to target:
Cancellation-storm query:
We then repeatedly send that last query to the source dataset and cancel the outer HTTP
request almost immediately, for example:
Actual result
Before stress:
After the canceled federated-query burst:
Expected result
Canceled outer federated queries should not leave the target dataset/server wedged.
After the canceled requests, normal direct queries to the target dataset should still work.
Relevant logs
From the Jena 6.0.0 Fuseki log, after the stress starts we see many inner requests like:
The outer requests are being canceled by the client, but the inner SERVICE subqueries
continue to run. After enough of these, the target dataset stops responding to even direct
queries.
Notes
container-name routing.
improvements, but we still reproduce this failure.