Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour with aggregate function #213

Open
johnnymoretti opened this issue Nov 17, 2021 · 8 comments
Open

Strange behaviour with aggregate function #213

johnnymoretti opened this issue Nov 17, 2021 · 8 comments

Comments

@johnnymoretti
Copy link

Hi,
I'm using a local instance of Blazegraph 2.1.6 RC and I've noticed that if I submit a query with an aggregate function like group_concat it hangs unless I add a limit instruction.

The following query hangs

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 

This other one does not hang

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 
limit 1000000000

Obviously my dataset is smaller than 1000000000: it contains 138245 triples.

Any suggestion ?

@thompsonbry
Copy link
Contributor

thompsonbry commented Nov 17, 2021 via email

@johnnymoretti
Copy link
Author

Thank you for your reply and ok I understand the problem.
In order to correct this behavior and clarify if the problem is related to the materialization, there is a simple way to force the materialization ?

@johnnymoretti
Copy link
Author

Ok it is definitely a materialization problem. I set the materializeProjectionInQuery variable in the AST2BOpContext class to false and thus I implicitly forced the projection to be materialized outside the plan: in this way the queries work. However, the issue about materialization remains open. Thanks a lot for your help.

@johnnymoretti
Copy link
Author

I have another strange thing here.
The following query now works with the previously mentioned trick :

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 

but this other one stucks again:

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 
order by ?wrs

to fix this problem I have to write the query like this:

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
Select * {
  SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;
      separator=", ") AS ?wrs) 
  where{
    ?lemma ontolex:writtenRep ?wr
  } group by ?lemma
}
order by ?wrs

I think there is something wrong with the nesting optimization or something related to the order by function.

Can you please help me ?

@thompsonbry
Copy link
Contributor

thompsonbry commented Nov 23, 2021 via email

@mschmidt00
Copy link
Contributor

Looking at the first question again in more detail, I am attaching the two explains (for a data set with 20k triples, all having distinct subjects and objects but the same prediacte). As suggested initially, there is a difference in the plan: the slow version (explain-wo-limit) contains a dedicated ChunkedMaterializationOp at the end, which clearly dominates runtime.

explain-w-limit

explain-wo-limit

Looking at the code, I guess the problem happens here: https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata-rdf/src/java/com/bigdata/bop/rdf/join/ChunkedMaterializationOp.java#L375

We add the constructed TermIds (all with termId=0L, indicating that they are mocked) to the idToConstMap. I confirmed that all of the collide in the hash code (which is 0 as well).

Possible ideas for fixes would be

1.) Create a decorator for the idToConstMap (see e.g. https://stackoverflow.com/questions/41461515/hashmap-implementation-that-allows-to-override-hash-and-equals-method)
2.) Wrap the keys into dummy objects that override hashCode(), refining the method to return a hash code different from 0L for TermIds with that have TermId 0L and a value set (i.e., just the hash code of the value)
3.) Leave TermIds with 0L out of the map in the first place and handle them separately (iiuc, the main reason for using the hash map is de-duplication, which may be of limited value for constructed values anyways -- though I would consider this a less preferred option)

It may also be possible to change hashCode() in TermId directly, i.e. if termId==0L compute the hash code based on the cached value always. While this would be the "clean" way to fix the issue, it sounds quite risky (and may cause problems with other code paths -- while I cannot come up with a scenario where this is problematic and I would believe it is the right change conceptually, it's really difficult to assess the blast radius of such an invasive change, both in terms of correctness and performance).

@mschmidt00
Copy link
Contributor

Regarding your proposal:" I set the materializeProjectionInQuery variable in the AST2BOpContext class to false and thus I implicitly forced the projection to be materialized outside the plan". Yes, I think that should be okay as well, but it is more of a workaround (which may negatively impact the performance of other queries).

@mschmidt00
Copy link
Contributor

Regarding your second question, here's the explain for the variant with ORDER BY (executed on some dummy data). As you can see, it contains two ChunkedMaterializationOp's that are expensive, due to the same reasons discussed above (with your change for materializeProjectionInQuery, there would probably be only one):

explain-order-by-variant

In the light of this variant, my recommendation would be to fix the root cause (hash collisions) instead of changing materializeProjectionInQuery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants