New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minus #340
Minus #340
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks and some questions.
As far as I understand it, this hashing approach is efficient for A MINUS {B}
iff B is small in size,
But on the other hand it is now the only operation, that handles undefined values correct.
As soon as we have correct (Optional) Joins wrt undefined values, we could also add a "MINUS-JOIN" implementation that uses no space overhead and also the linear scanning approach.
The following simple (I thought) query takes very long on Wikidata. The line in the log where it stays for a very long time is "DEBUG: Computing minus of results of size 188,688 and 6,282,838".
|
This isn't available on https://qlever.cs.uni-freiburg.de yet, is it? |
@graue70 We tried it, but the current implementation is too slow for Wikidata, so we took it back again for the Wikidata instance. The problem is that MINUS becomes complicated (and IMHO unintuitive) when OPTIONAL triples are involved (leading to empty values). This PR deals properly with that (I think), but at the price of an impractical running time. We need an implementation that runs fast in the standard case (without empty values). What do you need the MINUS for? Maybe there is a reasonably simple equivalent way to achieve it without MINUS. |
@hannahbast It's not super important, but I could decrease result sizes for my entity pre-computation queries (aliases etc.) if I could MINUS |
@joka921 I've fixed all of the problems mentioned in the review comments again (must have lost the fixes somewhere among reverting to the older implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In General:
Really great, I'm trying to merge this to the Running instance to try it out.
There's only few nitpicks on details.
src/engine/Minus.cpp
Outdated
return; | ||
} else { | ||
// This case should never happen. | ||
AD_CHECK(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a compile-time version of this check.
static_assert(false)
doesn't work, but we can both search for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the constraint A_WIDTH
== OUT_WIDTH
is true for all invocations of the function I removed a template parameter, making the check obsolete.
src/engine/Minus.cpp
Outdated
size_t backIdx = result.size() - 1; | ||
for (size_t col = 0; col < a.cols(); col++) { | ||
result(backIdx, col) = a(ia, col); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exact code snippet appears twice, making it a copyRow(ia)
-lambda would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/engine/Minus.cpp
Outdated
result.emplace_back(); | ||
size_t backIdx = result.size() - 1; | ||
for (size_t col = 0; col < a.cols(); col++) { | ||
result(backIdx, col) = a(ia, col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and again....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
src/engine/Minus.cpp
Outdated
size_t ib, const vector<array<size_t, 2>>& joinColumns) { | ||
for (size_t i = 1; i < joinColumns.size(); ++i) { | ||
Id va = a(ia, joinColumns[i][0]); | ||
Id vb = b(ib, joinColumns[i][1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use auto
, it will make life for me easier when possibly changing/renaming types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For which of the types do you want me to use auto?
I personally am not a huge fan of auto, as I feel like it makes the code harder to read. I'd rather use type aliases, which can have a descriptive name while still allowing for easy type changes (such as the Id
type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, my argument was more for easiness of refactoring, but possibly it is also good to once look at all the changes
you have to do when changing such stuff.
Could I interest you in the uniform initialization syntax
Id va{a(ia, col[i[0])}
Then we would have all the benefits, since we also get errors for narrowing conversions.
(I will also try to use this in the future :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joka921 I'm not a fan of the curly brace initialization, as it is a rather unusual syntax, but given that it has clear advantages in this case I'll switch over to it.
src/engine/Minus.h
Outdated
enum class RowComparison { | ||
EQUAL, | ||
NOT_EQUAL_LEFT_SMALLER, | ||
NOT_EQUAL_RIGHT_SMALLER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is wrong with EQUAL, SMALLER, GREATER
? Your choice seems somewhat verbose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I got a bit carried away with trying to be verbose. I've removed the NOT_EQUAL_
but have kept the LEFT
and RIGHT
to make it immediately obvious which side is smaller / greater.
src/engine/Minus.h
Outdated
bool _multiplicitiesComputed; | ||
|
||
std::vector<array<size_t, 2>> _matchedColumns; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something, that is never done until now in QLever but I have come to like the following format:
- Members at the beginning ( no matter whether they are private or public) So one easily sees, how "big" and how "pointerish" a class is
- Then public constructors, public methods, private methods.
This doesn't need to be changed here for merging, but what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting. It seems to be a good Idea in a codebase with lots of heavy data handling. I usually go for
- Inner Types
- Constructors
- Public Methods
- Private Methods
- Private Fields
Which focuses more on the ways a user can interact with the class (it's public methods). While still keeping its data storage consice at the bottom. I think that picking a fixed order is definitely a good idea. For me personally the order I'm used too works well, and is the least effort. If you'd rather put the fields at the top I'm fine with that as well, I don't think the actual position matters to much, as long as it's on one end of the file and clearly grouped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your order also works, since one can easily find the ending of a class and work their way up.
I just realized that we have many types where it goes
private:
// private members
// private methods
And then the data is somewhere unfindably spread over the middle.
Haven't looked at this enough to give a review but just wanted to say I like the progress I'm seeing you both making! |
@niklas88 @graue70 The wikidata-full instance on galera now runs with a MINUS support. Do you have some scripts where you can test them for performance and expected results? @Theresa93 The same question. |
The following query gives an incorrect result on Freebase Easy https://qlever.cs.uni-freiburg.de/Fbeasy
Reason: The result of the two triples outside of the MINUS is a 2-column table ordered by ?person. The result of the two triples inside of the MINUS is a 2-column table ordered by ?country_of_nationality. Currently MINUS does not explicitly sort by respective join columns. So it works, when the two tables happen to be sorted in the right way. But for the above query, where this is not the case, it fails. |
…floriankramer-minus # Conflicts: # src/engine/CMakeLists.txt # src/engine/IdTable.h # src/engine/QueryExecutionTree.h # src/engine/QueryPlanner.cpp # src/parser/ParsedQuery.cpp # src/parser/ParsedQuery.h # src/parser/SparqlLexer.cpp # src/parser/SparqlParser.cpp # src/util/HashSet.h # test/CMakeLists.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code looks fine, I only had a small nitpicks which are fast to fix.
There is however still the serious bug, that nobody sorts the inputs to the MINUS, which is requested.
- num_cols: 1 | ||
- selected: ["?a"] | ||
- contains_row: ["<Barney_Pell>"] | ||
- contains_row: ["<Duc_Pham>"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ADD Some more MINUS tests.
- Empty first part
- empty second part
- more than one triple in input
- more than one triple in output
src/engine/Minus.cpp
Outdated
size_t backIdx = result.size() - 1; | ||
for (size_t col = 0; col < a.cols(); col++) { | ||
result(backIdx, col) = a(ia, col); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have efficient result.push_back(a(ia)) in the IdTable
class?
result.reserve(result.size() + (a.size() - ia)); | ||
while (ia < a.size()) { | ||
writeResult(ia); | ||
ia++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the IdTable
support std::insert()
or something like this?
If not, it probably should.
This pr contains an implementation of Minus.