Minus #340

floriankramer · 2020-05-19T12:01:53Z

This pr contains an implementation of Minus.

joka921

Some nitpicks and some questions.

As far as I understand it, this hashing approach is efficient for A MINUS {B} iff B is small in size,
But on the other hand it is now the only operation, that handles undefined values correct.
As soon as we have correct (Optional) Joins wrt undefined values, we could also add a "MINUS-JOIN" implementation that uses no space overhead and also the linear scanning approach.

src/engine/Minus.cpp

src/engine/Minus.h

src/engine/QueryExecutionTree.h

src/engine/QueryPlanner.cpp

src/engine/Minus.cpp

hannahbast · 2020-08-05T13:08:13Z

The following simple (I thought) query takes very long on Wikidata. The line in the log where it stays for a very long time is "DEBUG: Computing minus of results of size 188,688 and 6,282,838".

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person1 ?person2 WHERE {
  ?x wdt:P26 ?y .
  MINUS { ?y wdt:P31 wd:Q5 } .
  ?x @en@rdfs:label ?person1 .
  ?y @en@rdfs:label ?person2 .
}

graue70 · 2020-08-31T12:56:41Z

This isn't available on https://qlever.cs.uni-freiburg.de yet, is it?

hannahbast · 2020-09-04T20:08:22Z

@graue70 We tried it, but the current implementation is too slow for Wikidata, so we took it back again for the Wikidata instance. The problem is that MINUS becomes complicated (and IMHO unintuitive) when OPTIONAL triples are involved (leading to empty values). This PR deals properly with that (I think), but at the price of an impractical running time. We need an implementation that runs fast in the standard case (without empty values).

What do you need the MINUS for? Maybe there is a reasonably simple equivalent way to achieve it without MINUS.

graue70 · 2020-09-05T20:14:52Z

What do you need the MINUS for? Maybe there is a reasonably simple equivalent way to achieve it without MINUS.

@hannahbast It's not super important, but I could decrease result sizes for my entity pre-computation queries (aliases etc.) if I could MINUS ?entity wdt:P31/wdt:P279* wd:Q17442446.

floriankramer · 2020-11-20T10:53:52Z

@joka921 I've fixed all of the problems mentioned in the review comments again (must have lost the fixes somewhere among reverting to the older implementation.

joka921

In General:
Really great, I'm trying to merge this to the Running instance to try it out.
There's only few nitpicks on details.

joka921 · 2020-11-26T15:44:54Z

src/engine/Minus.cpp

+      return;
+    } else {
+      // This case should never happen.
+      AD_CHECK(false);


There is a compile-time version of this check.
static_assert(false) doesn't work, but we can both search for it.

Given the constraint A_WIDTH == OUT_WIDTH is true for all invocations of the function I removed a template parameter, making the check obsolete.

joka921 · 2020-11-26T15:50:57Z

src/engine/Minus.cpp

+          size_t backIdx = result.size() - 1;
+          for (size_t col = 0; col < a.cols(); col++) {
+            result(backIdx, col) = a(ia, col);
+          }


This exact code snippet appears twice, making it a copyRow(ia) -lambda would be cleaner.

joka921 · 2020-11-26T15:51:20Z

src/engine/Minus.cpp

+    result.emplace_back();
+    size_t backIdx = result.size() - 1;
+    for (size_t col = 0; col < a.cols(); col++) {
+      result(backIdx, col) = a(ia, col);


and again....

joka921 · 2020-11-26T15:52:54Z

src/engine/Minus.cpp

+    size_t ib, const vector<array<size_t, 2>>& joinColumns) {
+  for (size_t i = 1; i < joinColumns.size(); ++i) {
+    Id va = a(ia, joinColumns[i][0]);
+    Id vb = b(ib, joinColumns[i][1]);


Please use auto, it will make life for me easier when possibly changing/renaming types

For which of the types do you want me to use auto?
I personally am not a huge fan of auto, as I feel like it makes the code harder to read. I'd rather use type aliases, which can have a descriptive name while still allowing for easy type changes (such as the Id type).

Ok, my argument was more for easiness of refactoring, but possibly it is also good to once look at all the changes
you have to do when changing such stuff.

Could I interest you in the uniform initialization syntax
Id va{a(ia, col[i[0])}
Then we would have all the benefits, since we also get errors for narrowing conversions.
(I will also try to use this in the future :))

@joka921 I'm not a fan of the curly brace initialization, as it is a rather unusual syntax, but given that it has clear advantages in this case I'll switch over to it.

joka921 · 2020-11-26T15:54:07Z

src/engine/Minus.h

+  enum class RowComparison {
+    EQUAL,
+    NOT_EQUAL_LEFT_SMALLER,
+    NOT_EQUAL_RIGHT_SMALLER


What is wrong with EQUAL, SMALLER, GREATER ? Your choice seems somewhat verbose

I guess I got a bit carried away with trying to be verbose. I've removed the NOT_EQUAL_ but have kept the LEFT and RIGHT to make it immediately obvious which side is smaller / greater.

joka921 · 2020-11-26T15:57:46Z

src/engine/Minus.h

+  bool _multiplicitiesComputed;
+
+  std::vector<array<size_t, 2>> _matchedColumns;
+


This is something, that is never done until now in QLever but I have come to like the following format:

Members at the beginning ( no matter whether they are private or public) So one easily sees, how "big" and how "pointerish" a class is

Then public constructors, public methods, private methods.

This doesn't need to be changed here for merging, but what do you think about this?

It's interesting. It seems to be a good Idea in a codebase with lots of heavy data handling. I usually go for

Inner Types

Constructors

Public Methods

Private Methods

Private Fields

Which focuses more on the ways a user can interact with the class (it's public methods). While still keeping its data storage consice at the bottom. I think that picking a fixed order is definitely a good idea. For me personally the order I'm used too works well, and is the least effort. If you'd rather put the fields at the top I'm fine with that as well, I don't think the actual position matters to much, as long as it's on one end of the file and clearly grouped.

Your order also works, since one can easily find the ending of a class and work their way up.
I just realized that we have many types where it goes

private: // private members // private methods

And then the data is somewhere unfindably spread over the middle.

niklas88 · 2020-11-28T22:46:04Z

Haven't looked at this enough to give a review but just wanted to say I like the progress I'm seeing you both making!
So @joka921 if you feel confident enough to give approval already I'd say go ahead and merge it.

joka921 · 2020-11-29T09:25:21Z

@niklas88
I have already integrated it into one of the running instances on our machines. So we will just let it run for some time, there are
even people who have immediate use cases for this, and if no apparent problems appear, we can still merge it.

@graue70 The wikidata-full instance on galera now runs with a MINUS support. Do you have some scripts where you can test them for performance and expected results?

@Theresa93 The same question.

graue70 · 2020-12-02T11:30:07Z

@graue70 The wikidata-full instance on galera now runs with a MINUS support. Do you have some scripts where you can test them for performance and expected results?

@joka921 Sorry, but I don't have the time to test this at the moment.

hannahbast · 2020-12-15T18:28:28Z

The following query gives an incorrect result on Freebase Easy https://qlever.cs.uni-freiburg.de/Fbeasy

SELECT ?person ?country_of_nationality WHERE {
  ?person <Award_Won> <Nobel_Prize_in_Physics> .
  ?person <Country_of_nationality> ?country_of_nationality .
  MINUS {
    ?person <Country_of_nationality> ?country_of_nationality .
    ?country_of_nationality <Contained_by> <Europe> 
  }
}

Reason: The result of the two triples outside of the MINUS is a 2-column table ordered by ?person. The result of the two triples inside of the MINUS is a 2-column table ordered by ?country_of_nationality. Currently MINUS does not explicitly sort by respective join columns. So it works, when the two tables happen to be sorted in the right way. But for the above query, where this is not the case, it fails.

…floriankramer-minus # Conflicts: # src/engine/CMakeLists.txt # src/engine/IdTable.h # src/engine/QueryExecutionTree.h # src/engine/QueryPlanner.cpp # src/parser/ParsedQuery.cpp # src/parser/ParsedQuery.h # src/parser/SparqlLexer.cpp # src/parser/SparqlParser.cpp # src/util/HashSet.h # test/CMakeLists.txt

Removed one warning.

joka921

The existing code looks fine, I only had a small nitpicks which are fast to fix.

There is however still the serious bug, that nobody sorts the inputs to the MINUS, which is requested.

joka921 · 2021-05-06T08:38:47Z

e2e/scientists_queries.yaml

+      - num_cols: 1
+      - selected: ["?a"]
+      - contains_row: ["<Barney_Pell>"]
+      - contains_row: ["<Duc_Pham>"]


ADD Some more MINUS tests.

Empty first part

empty second part

more than one triple in input

more than one triple in output

src/util/HashSet.h

src/parser/SparqlParser.cpp

src/engine/QueryPlanner.cpp

src/engine/Minus.cpp

joka921 · 2021-05-06T09:56:44Z

src/engine/Minus.cpp

+    size_t backIdx = result.size() - 1;
+    for (size_t col = 0; col < a.cols(); col++) {
+      result(backIdx, col) = a(ia, col);


Don't we have efficient result.push_back(a(ia)) in the IdTable class?

joka921 · 2021-05-06T10:01:00Z

src/engine/Minus.cpp

+  result.reserve(result.size() + (a.size() - ia));
+  while (ia < a.size()) {
+    writeResult(ia);
+    ia++;
+  }


Doesn't the IdTable support std::insert() or something like this?
If not, it probably should.

floriankramer requested a review from joka921 May 19, 2020 12:01

joka921 reviewed Jun 12, 2020

View reviewed changes

floriankramer force-pushed the minus branch from 673fb20 to ad643a0 Compare November 18, 2020 14:49

floriankramer added 8 commits November 20, 2020 11:38

Initial (faulty) implementation of minus.

6064447

Switched to a correct hashing based implementation for minus.

66274a5

Added Minus to the QueryPlanner.

fd630ae

Added Minus to the lexer.

5a51e14

Added an e2e query.

303d321

Fixed formatting.

8f33bd7

Updated group and order by in the lexer test.

8b11c0e

Removed support for ID_NO_VALUE from Minus.

fc34aaa

floriankramer force-pushed the minus branch from 85d4a13 to fc34aaa Compare November 20, 2020 10:39

Addressed review comments fixing several bugs.

fe25ef6

joka921 approved these changes Nov 26, 2020

View reviewed changes

Some cleanup for the review.

aac286f

floriankramer added 3 commits December 3, 2020 13:40

Switched minus to curly brace initialization for the Id type.

149c1d6

Reordered the minus header.

532d17b

Merge branch 'master' into minus

9b34c48

joka921 force-pushed the master branch from b553831 to 6680993 Compare May 5, 2021 08:18

joka921 added 2 commits May 6, 2021 10:22

Clang-format

387c563

joka921 added 4 commits May 6, 2021 10:49

Added Werror AND normal build.

6c6543f

correct json version againg

4ca96a7

No fast fail on github actions.

9337b9a

Removed one warning.

correct single include json version

aa3100b

joka921 requested changes May 6, 2021

View reviewed changes

joka921 added 10 commits May 6, 2021 12:24

Input to Minus is now enforced to be sorted!

d6c7f36

clang-format.

c11f208

We now have a single-header nlohmann

8f2d24e

Fixed the runtime INFO single json.

c8a40d5

tyring g++-10 and clang-10

c94c622

g++ 10 and clang 11

b790137

Fixed Compiler warning.

d296ea1

Changes from self-review + included Timeout support

d7dacb1

Fixed compilation

46fd433

Fixed Compilation (checkTimeout requires a non-static compute method)

642a6fb

joka921 merged commit 88e44fb into ad-freiburg:master May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minus #340

Minus #340

floriankramer commented May 19, 2020 •

edited

joka921 left a comment

hannahbast commented Aug 5, 2020 •

edited

graue70 commented Aug 31, 2020

hannahbast commented Sep 4, 2020 •

edited

graue70 commented Sep 5, 2020

floriankramer commented Nov 20, 2020

joka921 left a comment

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 29, 2020

floriankramer Dec 2, 2020

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 26, 2020

floriankramer Nov 27, 2020

joka921 Nov 29, 2020

niklas88 commented Nov 28, 2020

joka921 commented Nov 29, 2020

graue70 commented Dec 2, 2020

hannahbast commented Dec 15, 2020

joka921 left a comment

joka921 May 6, 2021

joka921 May 6, 2021

joka921 May 6, 2021

		bool _multiplicitiesComputed;

		std::vector<array<size_t, 2>> _matchedColumns;

Minus #340

Minus #340

Conversation

floriankramer commented May 19, 2020 • edited

joka921 left a comment

Choose a reason for hiding this comment

hannahbast commented Aug 5, 2020 • edited

graue70 commented Aug 31, 2020

hannahbast commented Sep 4, 2020 • edited

graue70 commented Sep 5, 2020

floriankramer commented Nov 20, 2020

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented Nov 28, 2020

joka921 commented Nov 29, 2020

graue70 commented Dec 2, 2020

hannahbast commented Dec 15, 2020

joka921 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floriankramer commented May 19, 2020 •

edited

hannahbast commented Aug 5, 2020 •

edited

hannahbast commented Sep 4, 2020 •

edited