Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minus #340

Merged
merged 29 commits into from May 6, 2021
Merged

Minus #340

merged 29 commits into from May 6, 2021

Conversation

floriankramer
Copy link
Member

@floriankramer floriankramer commented May 19, 2020

This pr contains an implementation of Minus.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks and some questions.

As far as I understand it, this hashing approach is efficient for A MINUS {B} iff B is small in size,
But on the other hand it is now the only operation, that handles undefined values correct.
As soon as we have correct (Optional) Joins wrt undefined values, we could also add a "MINUS-JOIN" implementation that uses no space overhead and also the linear scanning approach.

src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.h Outdated Show resolved Hide resolved
src/engine/QueryExecutionTree.h Outdated Show resolved Hide resolved
src/engine/QueryPlanner.cpp Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Show resolved Hide resolved
@hannahbast
Copy link
Member

hannahbast commented Aug 5, 2020

The following simple (I thought) query takes very long on Wikidata. The line in the log where it stays for a very long time is "DEBUG: Computing minus of results of size 188,688 and 6,282,838".

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?person1 ?person2 WHERE {
  ?x wdt:P26 ?y .
  MINUS { ?y wdt:P31 wd:Q5 } .
  ?x @en@rdfs:label ?person1 .
  ?y @en@rdfs:label ?person2 .
}

@graue70
Copy link
Contributor

graue70 commented Aug 31, 2020

This isn't available on https://qlever.cs.uni-freiburg.de yet, is it?

@hannahbast
Copy link
Member

hannahbast commented Sep 4, 2020

@graue70 We tried it, but the current implementation is too slow for Wikidata, so we took it back again for the Wikidata instance. The problem is that MINUS becomes complicated (and IMHO unintuitive) when OPTIONAL triples are involved (leading to empty values). This PR deals properly with that (I think), but at the price of an impractical running time. We need an implementation that runs fast in the standard case (without empty values).

What do you need the MINUS for? Maybe there is a reasonably simple equivalent way to achieve it without MINUS.

@graue70
Copy link
Contributor

graue70 commented Sep 5, 2020

What do you need the MINUS for? Maybe there is a reasonably simple equivalent way to achieve it without MINUS.

@hannahbast It's not super important, but I could decrease result sizes for my entity pre-computation queries (aliases etc.) if I could MINUS ?entity wdt:P31/wdt:P279* wd:Q17442446.

@floriankramer
Copy link
Member Author

@joka921 I've fixed all of the problems mentioned in the review comments again (must have lost the fixes somewhere among reverting to the older implementation.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In General:
Really great, I'm trying to merge this to the Running instance to try it out.
There's only few nitpicks on details.

return;
} else {
// This case should never happen.
AD_CHECK(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a compile-time version of this check.
static_assert(false) doesn't work, but we can both search for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraint A_WIDTH == OUT_WIDTH is true for all invocations of the function I removed a template parameter, making the check obsolete.

size_t backIdx = result.size() - 1;
for (size_t col = 0; col < a.cols(); col++) {
result(backIdx, col) = a(ia, col);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exact code snippet appears twice, making it a copyRow(ia) -lambda would be cleaner.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result.emplace_back();
size_t backIdx = result.size() - 1;
for (size_t col = 0; col < a.cols(); col++) {
result(backIdx, col) = a(ia, col);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and again....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

size_t ib, const vector<array<size_t, 2>>& joinColumns) {
for (size_t i = 1; i < joinColumns.size(); ++i) {
Id va = a(ia, joinColumns[i][0]);
Id vb = b(ib, joinColumns[i][1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use auto, it will make life for me easier when possibly changing/renaming types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which of the types do you want me to use auto?
I personally am not a huge fan of auto, as I feel like it makes the code harder to read. I'd rather use type aliases, which can have a descriptive name while still allowing for easy type changes (such as the Id type).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my argument was more for easiness of refactoring, but possibly it is also good to once look at all the changes
you have to do when changing such stuff.

Could I interest you in the uniform initialization syntax
Id va{a(ia, col[i[0])}
Then we would have all the benefits, since we also get errors for narrowing conversions.
(I will also try to use this in the future :))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joka921 I'm not a fan of the curly brace initialization, as it is a rather unusual syntax, but given that it has clear advantages in this case I'll switch over to it.

enum class RowComparison {
EQUAL,
NOT_EQUAL_LEFT_SMALLER,
NOT_EQUAL_RIGHT_SMALLER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with EQUAL, SMALLER, GREATER ? Your choice seems somewhat verbose

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I got a bit carried away with trying to be verbose. I've removed the NOT_EQUAL_ but have kept the LEFT and RIGHT to make it immediately obvious which side is smaller / greater.

bool _multiplicitiesComputed;

std::vector<array<size_t, 2>> _matchedColumns;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something, that is never done until now in QLever but I have come to like the following format:

  • Members at the beginning ( no matter whether they are private or public) So one easily sees, how "big" and how "pointerish" a class is
  • Then public constructors, public methods, private methods.

This doesn't need to be changed here for merging, but what do you think about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting. It seems to be a good Idea in a codebase with lots of heavy data handling. I usually go for

  • Inner Types
  • Constructors
  • Public Methods
  • Private Methods
  • Private Fields

Which focuses more on the ways a user can interact with the class (it's public methods). While still keeping its data storage consice at the bottom. I think that picking a fixed order is definitely a good idea. For me personally the order I'm used too works well, and is the least effort. If you'd rather put the fields at the top I'm fine with that as well, I don't think the actual position matters to much, as long as it's on one end of the file and clearly grouped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your order also works, since one can easily find the ending of a class and work their way up.
I just realized that we have many types where it goes

private:
// private members

// private methods

And then the data is somewhere unfindably spread over the middle.

@niklas88
Copy link
Member

Haven't looked at this enough to give a review but just wanted to say I like the progress I'm seeing you both making!
So @joka921 if you feel confident enough to give approval already I'd say go ahead and merge it.

@joka921
Copy link
Member

joka921 commented Nov 29, 2020

@niklas88
I have already integrated it into one of the running instances on our machines. So we will just let it run for some time, there are
even people who have immediate use cases for this, and if no apparent problems appear, we can still merge it.

@graue70 The wikidata-full instance on galera now runs with a MINUS support. Do you have some scripts where you can test them for performance and expected results?

@Theresa93 The same question.

@graue70
Copy link
Contributor

graue70 commented Dec 2, 2020

@graue70 The wikidata-full instance on galera now runs with a MINUS support. Do you have some scripts where you can test them for performance and expected results?

@joka921 Sorry, but I don't have the time to test this at the moment.

@hannahbast
Copy link
Member

The following query gives an incorrect result on Freebase Easy https://qlever.cs.uni-freiburg.de/Fbeasy

SELECT ?person ?country_of_nationality WHERE {
  ?person <Award_Won> <Nobel_Prize_in_Physics> .
  ?person <Country_of_nationality> ?country_of_nationality .
  MINUS {
    ?person <Country_of_nationality> ?country_of_nationality .
    ?country_of_nationality <Contained_by> <Europe> 
  }
}

Reason: The result of the two triples outside of the MINUS is a 2-column table ordered by ?person. The result of the two triples inside of the MINUS is a 2-column table ordered by ?country_of_nationality. Currently MINUS does not explicitly sort by respective join columns. So it works, when the two tables happen to be sorted in the right way. But for the above query, where this is not the case, it fails.

…floriankramer-minus

# Conflicts:
#	src/engine/CMakeLists.txt
#	src/engine/IdTable.h
#	src/engine/QueryExecutionTree.h
#	src/engine/QueryPlanner.cpp
#	src/parser/ParsedQuery.cpp
#	src/parser/ParsedQuery.h
#	src/parser/SparqlLexer.cpp
#	src/parser/SparqlParser.cpp
#	src/util/HashSet.h
#	test/CMakeLists.txt
Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code looks fine, I only had a small nitpicks which are fast to fix.

There is however still the serious bug, that nobody sorts the inputs to the MINUS, which is requested.

- num_cols: 1
- selected: ["?a"]
- contains_row: ["<Barney_Pell>"]
- contains_row: ["<Duc_Pham>"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ADD Some more MINUS tests.

  • Empty first part
  • empty second part
  • more than one triple in input
  • more than one triple in output

src/util/HashSet.h Outdated Show resolved Hide resolved
src/parser/SparqlParser.cpp Outdated Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
src/engine/QueryPlanner.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
src/engine/Minus.cpp Outdated Show resolved Hide resolved
Comment on lines 152 to 154
size_t backIdx = result.size() - 1;
for (size_t col = 0; col < a.cols(); col++) {
result(backIdx, col) = a(ia, col);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have efficient result.push_back(a(ia)) in the IdTable class?

Comment on lines +207 to +211
result.reserve(result.size() + (a.size() - ia));
while (ia < a.size()) {
writeResult(ia);
ia++;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the IdTable support std::insert() or something like this?
If not, it probably should.

@joka921 joka921 merged commit 88e44fb into ad-freiburg:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants