Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARQL MAX aggregate should result in type error on empty set #1978

Closed
jeenbroekstra opened this issue Mar 7, 2020 · 9 comments
Closed

SPARQL MAX aggregate should result in type error on empty set #1978

jeenbroekstra opened this issue Mar 7, 2020 · 9 comments
Assignees
Milestone

Comments

@jeenbroekstra
Copy link
Contributor

@jeenbroekstra jeenbroekstra commented Mar 7, 2020

Reported by @manuelfiorelli on the rdf4j-user list.

Let's assume to evaluate the following queries against an empty repository.

The query below returns no results as expected:

select ?s {
   ?s ?p ?o .
}
group by ?s

However, this query returns one result, which doesn't contain any binding:

select ?s (max(?o) as ?omax) {
   ?s ?p ?o .
 }
 group by ?s

This is incorrect, the result should be empty, not 1 result with an empty binding.

The cause appears to be that evaluation of MAX (as well as MIN and SAMPLE) hides type errors and replaces them with an empty binding. This is not conform the spec: it should result in a type error (which should then be converted into a empty result by the processing of the extension).

@jeenbroekstra jeenbroekstra added this to the 3.1.2 milestone Mar 7, 2020
@jeenbroekstra jeenbroekstra self-assigned this Mar 7, 2020
@jeenbroekstra jeenbroekstra added this to To do in Next releases via automation Mar 7, 2020
@jeenbroekstra

This comment has been minimized.

Copy link
Contributor Author

@jeenbroekstra jeenbroekstra commented Mar 7, 2020

I came across this W3C compliance test, which indicates that in fact the current behavior is correct: https://www.w3.org/2009/sparql/docs/tests/summary.html#aggregates-agg-empty-group. The result is a single, empty solution.

I need to dive a bit deeper into the algebra to see what I missed.

jeenbroekstra added a commit that referenced this issue Mar 7, 2020
- original behavior was correct: GH-1978 is not a bug
- used opportunity to clean up implementation flow
- added additional unit and compliance tests
jeenbroekstra added a commit that referenced this issue Mar 7, 2020
- original behavior was correct: GH-1978 is not a bug
- used opportunity to clean up implementation flow
- added additional unit and compliance tests
@jeenbroekstra

This comment has been minimized.

Copy link
Contributor Author

@jeenbroekstra jeenbroekstra commented Mar 7, 2020

While it is true that the result of a MAX/MIN/SAMPLE aggregate on an empty solution set should be a type error, this is hidden by how groups and extensions are evaluated: the type error is converted to an empty binding, which eventually result in a query with a single, empty, solution. This is conform spec and not a bug.

@jeenbroekstra jeenbroekstra modified the milestone: 3.1.2 Mar 7, 2020
@jeenbroekstra jeenbroekstra removed the bug label Mar 7, 2020
Next releases automation moved this from To do to Done Mar 8, 2020
jeenbroekstra added a commit that referenced this issue Mar 8, 2020
- original behavior was correct: GH-1978 is not a bug
- used opportunity to clean up implementation flow
- added additional unit and compliance tests
@afs

This comment has been minimized.

Copy link

@afs afs commented Mar 8, 2020

cc: @manuelfiorelli

There is another step that plays a part in the final outcome.

When there are zero rows from the query pattern, the grouping is empty and the
aggregate function is not called. The result is zero rows. (FWIW: Jena has got this wrong
more than once over the years.) The SPARQL test can't be right (for a start, it
has the wrong variables).

With the example query, where there are no values for the grouping.

Group(exprlist, Ω) = { }

then the aggregation becomes:

 Aggregation(exprlist, func, scalarvals, { } )
   =  { (key, F(Ω)) | key → Ω in { } }

which is { } - there is no key → Ω in { }. This is where max would be
called but if there are zero results from the group step, max aggregation isn't called.

But I found a problem in the spec when it is aggregation without a group by, and the pattern having no rows.

I've tried to write out the details, having got it wrong before, in
https://afs.github.io/sparql-agg-group-empty
as I read the spec, with a possible fix for the no group by case.

The intuition is

  • group+aggregate: same number of rows as the group
  • aggregate, no group: one row.
@jeenbroekstra jeenbroekstra reopened this Mar 8, 2020
Next releases automation moved this from Done to In progress Mar 8, 2020
@jeenbroekstra

This comment has been minimized.

Copy link
Contributor Author

@jeenbroekstra jeenbroekstra commented Mar 9, 2020

Thanks for analyzing this @afs . I must admit I had a hard time puzzling out where the empty row came from looking purely at the algebra, so I relied on the test case and hand-waved it a bit as "it's the grouping and extension conversions".

The crux of the argument seems to be that, in the spec as currently defined, there is a difference between an aggregate with an explicit grouping clause, and an aggregate without. That in itself seems odd to me as conceptually, an aggregate without an explicit grouping still groups - it just groups on the entire solution. From what I gather of a quick read of your writeup, your proposed fix is actually exactly making that distinction disappear.

I need to take a bit of time to puzzle with it and see where it leads us. Do you reckon this is something we should write up for the SPARQL 1.2 group as well?

Fwiw I am not yet convinced that your proposed fix fixes it "in the right direction". It seems counter-intuitive to me that, for example, a SUM aggregate:

SELECT (SUM(?value) as ?c)
WHERE {
    ?x ex:p ?value
} GROUP BY ?x

should result in an empty result, rather than a single "0" (with or without explicit grouping). Especially given that https://www.w3.org/TR/sparql11-query/#defn_aggSum explicitly defines the "empty input" case:

Sum(S) = "0"^^xsd:integer when card[S] = 0

The intent there seems to be that (irrespective of grouping) there's always a result - the fact that this is (apparently) hidden because of how grouping and the application of aggregation on it works seems an unintended side effect to me.

@afs

This comment has been minimized.

Copy link

@afs afs commented Mar 9, 2020

Hi Jeen,

Do you reckon this is something we should write up for the SPARQL 1.2 group as
well?

Yes, and raise errata on the SPARQL 1.1 spec for points where we agree the spec
is wrong or needs clarification.

I think it is only one current test that is wrong but the coverage of cases isn't
100%. Looks like the bad test was generated from a buggy implementation and in
the rush to finish (there was a hard time limit on the WG), things got missed.

Let's work on tests to capture what we want SPARQL to do.

Users expectations comes from SQL. We don't have to follow that but it's good to
know when and why we deviate.

MySQL with an empty table.

mysql> create table DATA (A int, B int);
Query OK, 0 rows affected (2.13 sec)

mysql>  describe DATA;
+-------+------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+-----+---------+-------+
| A     | int  | YES  |     | NULL    |       |
| B     | int  | YES  |     | NULL    |       |
+-------+------+------+-----+---------+-------+
2 rows in set (0.00 sec)

mysql>  select count(*) from DATA;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

mysql>  select sum(A) from DATA;
+--------+
| sum(A) |
+--------+
|   NULL |
+--------+
1 row in set (0.00 sec)

mysql>  select sum(A) from DATA group by B;
Empty set (0.00 sec)

mysql>  select count(*) from DATA group by B;
Empty set (0.00 sec)

in other words, with an empty table, aggregate-no-group is one row, aggregate-group is zero rows.

Suggested new test:

SELECT (count(*) AS ?C) { FILTER(false) } ==> One row, ?C=0

Here's a github repo to collect tests in:

https://github.com/afs/sparql-group-agg.git

@jeenbroekstra

This comment has been minimized.

Copy link
Contributor Author

@jeenbroekstra jeenbroekstra commented Mar 10, 2020

Makes sense. And thinking about it a bit more, there's some symmetry in the solution. I guess the intuitive way to explain it is something along the lines of: "if you use explicit grouping, and there are no solutions, there are no groups, therefore the result is empty. However, if you don't use grouping, there is one fixed "default" group (which is optionally empty). Therefore it will have a non-empty result."

I'll try and write up some test cases later this week and contribute them to your repo. Thanks for setting it up!

jeenbroekstra added a commit that referenced this issue Mar 11, 2020
- modified and extended unit tests
- disabled incorrect W3C conformance test per discussion on GH-1978
jeenbroekstra added a commit that referenced this issue Mar 11, 2020
- modified and extended unit tests
- disabled incorrect W3C conformance test per discussion on GH-1978
jeenbroekstra added a commit that referenced this issue Mar 12, 2020
- refactored testsuite setup a bit to make future extension easier
- added two simple test cases
jeenbroekstra added a commit that referenced this issue Mar 12, 2020
- refactored testsuite setup a bit to make future extension easier
- added two simple test cases
@manuelfiorelli

This comment has been minimized.

Copy link
Contributor

@manuelfiorelli manuelfiorelli commented Mar 12, 2020

Hi @afs and @jeenbroekstra thank you for the detailed analysis and sorry for the late reply. The intuitive explanation by Jeen makes sense to me. If I got it right, this is precisely the main point of the report written by @afs.

When I filed this issue, I didn't think about the deep implications of using an empty graph. In fact, I discovered the problem using some test data and a graph pattern that didn't match anything. However, I tried to "simplify" the issue by avoiding the failing graph pattern.

If I got the discussion right, when I have a group by and an failing graph pattern, in the "revised" specification, we should still zero solutions (instead of a solution with unbound variables). Am I right?

jeenbroekstra added a commit that referenced this issue Mar 12, 2020
- modified and extended unit tests
- disabled incorrect W3C conformance test per discussion on GH-1978
- refactored testsuite setup a bit to make future extension easier
- added two simple test cases in manifest tests
Next releases automation moved this from In progress to Done Mar 12, 2020
jeenbroekstra added a commit that referenced this issue Mar 12, 2020
- modified and extended unit tests
- disabled incorrect W3C conformance test per discussion on GH-1978
- refactored testsuite setup a bit to make future extension easier
- added two simple test cases in manifest tests
@jeenbroekstra

This comment has been minimized.

Copy link
Contributor Author

@jeenbroekstra jeenbroekstra commented Mar 12, 2020

@manuelfiorelli you are correct. Fix will be included in 3.1.2 release.

@manuelfiorelli

This comment has been minimized.

Copy link
Contributor

@manuelfiorelli manuelfiorelli commented Mar 13, 2020

thank you for the confirmation and the patch release.

jeenbroekstra added a commit that referenced this issue Mar 16, 2020
- took the opportunity to slightly update the testsuite code, it was
  still using SeRQL and ignore tests involving SPARQL/JSON results
jeenbroekstra added a commit that referenced this issue Mar 17, 2020
- took the opportunity to slightly update the testsuite code, it was
  still using SeRQL and ignore tests involving SPARQL/JSON results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Next releases
  
Done
3 participants
You can’t perform that action at this time.