Skip to content

Fuseki 6.0.0: canceled federated SERVICE queries can wedge the target dataset until restart #3837

@plvsadi

Description

@plvsadi

Version

6.0.0 Also reproduced on 5.5.0.

What happened?

Version

6.0.0

Also reproduced on 5.5.0.

What happened?

We can reproduce a failure mode where repeated canceled federated SERVICE queries leave
the target dataset effectively wedged until Fuseki is restarted.

The pattern is:

  1. A direct query to dataset target succeeds.
  2. A federated query from dataset source to dataset target using SERVICE <http://127.0.0.1:3030/target/sparql> succeeds.
  3. We then issue a burst of heavy federated queries from source to target, with the
    client canceling/timing out the outer HTTP request almost immediately.
  4. After that, even a simple direct query to dataset target times out.
  5. Restarting Fuseki clears the problem.

This is reproducible for us on Jena/Fuseki 6.0.0 and also on 5.5.0.

Why this looks distinct from ordinary timeout behavior

This does not look like only the outer client timing out.

After the cancellation storm:

  • a direct query to the target dataset also times out
  • the store recovers only after restart

So the target dataset/server appears to be left in a bad runtime state.

Reproducing it

We reproduced this against an isolated standalone Fuseki 6.0.0 container built from the
official Apache release tarball.

Our production datasets are private, but the failure can be described with this structure:

  • source dataset: source
  • target dataset: target

Baseline direct probe against target:

SELECT * WHERE {
  <urn:probe-subject> ?p ?o
}
LIMIT 5

Baseline federated probe from source to target:

  SELECT * WHERE {
    SERVICE <http://127.0.0.1:3030/target/sparql> {
      <urn:probe-subject> ?p ?o
    }
  }
  LIMIT 5

Cancellation-storm query:

  SELECT * WHERE {
    SERVICE <http://127.0.0.1:3030/target/sparql> {
      ?s ?p ?o
    }
  }

We then repeatedly send that last query to the source dataset and cancel the outer HTTP
request almost immediately, for example:

  for i in $(seq 1 40); do
    curl -sS --max-time 0.05 -G \
      --data-urlencode 'query=SELECT * WHERE { SERVICE <http://127.0.0.1:3030/target/sparql> { ?s ?p ?o } }' \
      http://127.0.0.1:3030/source/sparql >/dev/null || true
  done 

Actual result

Before stress:

  • direct query succeeds
  • federated query succeeds

After the canceled federated-query burst:

  • direct query to target times out
  • federated query also fails/times out
  • Fuseki restart is required to recover

Expected result

Canceled outer federated queries should not leave the target dataset/server wedged.
After the canceled requests, normal direct queries to the target dataset should still work.

Relevant logs

From the Jena 6.0.0 Fuseki log, after the stress starts we see many inner requests like:

  GET http://127.0.0.1:3030/target/sparql?query=SELECT++%2A%0AWHERE%0A++%7B+?s++?p++?o+%7D%0A

The outer requests are being canceled by the client, but the inner SERVICE subqueries
continue to run. After enough of these, the target dataset stops responding to even direct
queries.

Notes

  • This was reproduced using loopback 127.0.0.1, so it does not appear to require Docker DNS/
    container-name routing.
  • We specifically tested 6.0.0 because the changelog mentions query-cancellation
    improvements, but we still reproduce this failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions