New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disambiguate non-existing optional value and "" #107
Disambiguate non-existing optional value and "" #107
Conversation
So
Interestingly on both the concatenation of a missing value becomes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Functionality looks fine to me, but you could start eliminating the code duplicates immediately at the places where you change stuff.
src/engine/QueryExecutionTree.h
Outdated
os << "\"" << ad_utility::escapeForJson(entity.value()) << "\"]"; | ||
} else { | ||
os << "null]"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not needed to introduce these duplications. You often have the workflow:
if optional value -> escape it for json (maybe convert to value)
else return "null";
This can perfectly be refactored into a function, with the "convert to value" (YES, NO, CHECK) being a parameter (runtime or template for efficiency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That function sounds oddly specific and it would also need to handle the quotes. I actually thought about changing escapeForJson()
to take an const std::optional<string>&
and rename it to something like toJson()
but decided against it at the time to make it a less intrusive change. Then again that might actually be a good solution. It will still work for normal strings and it reflects that in JSON values can be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but the only thing we gain by your version, is that we know that "null]" does not need to be escaped. And I think there is not so much to gain there.
escapeForJson(entity.value_or("null")); as I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No my toJson()
function (just pushed) also adds the quotes. So if the optional ist empty we get "null"
otherwise "\"escaped string\""
src/index/Index.cpp
Outdated
if (id < _vocab.size()) { | ||
return _vocab[id]; | ||
} else if (id == ID_NO_VALUE) { | ||
return ""; | ||
return std::nullopt; | ||
} else { | ||
id -= _vocab.size(); | ||
AD_CHECK(id < _vocab.getExternalVocab().size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a method of the Index and not of the Vocabulary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I guess because the _vocab
us more or less private
and the Index was intended as a single class interface to the whole index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but we can still implement it it the Vocabulary and wrap it in the index
Index::getOptionalForId(id) {return _vocab.getOptionalForId;}
In general, the Index class is already full enough (functionality AND Code duplications:))
src/engine/Filter.cpp
Outdated
return true; | ||
} | ||
return ad_utility::endsWith(entity.value(), | ||
this->_rhsString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about writing a Function, that returns this language filtering lambda? This is used multiple times again in the code below. We could e.g. write
getEngine().filter(*static_cast<vector<RT>*>(subRes->_fixedSizeData), getLanguageFilterLambda(l)
where getLanguageFilterLambda
is a member function of the filter class.
Additionally, is the captured variable l
not a numeric type and thus can/should be captured by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This was not really possible in c++11 since returning a "lambda" without the overhead of a function was a pain, but now (since c++14) we can write
auto getLambda(int toCapture) {
return [this, toCapture](const int x) {return this._memberVal + x < toCapture; };
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, let me investigate. I was hesitant to make the changes too big but this is already somewhat intrusive so it might actually be a good time to clean this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually found a way (pushed) to get rid of multiple lambdas as well as a lot of the repetition
src/engine/Filter.cpp
Outdated
return true; | ||
} | ||
return ad_utility::endsWith(entity.value(), | ||
this->_rhsString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reuse lambda here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the changes with the optional,
I just don't like them to be in the Vocabulary class (not only because I'm heavily restructuring there at the moment). But if this code finds a suitable place I'm really happy with this.
} else { | ||
return std::nullopt; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never thought about this, but now that I see it i don't like it:
We have to know, that with the externalized vocabulary (knowledge base vocab), we must NOT call operator[] from the outside but idToOptionalString but, if we know that nothing is externalized, we can call the operator[]. I think this is a recipe for desaster. I can agree to merge this for now, since it is not the fault of your PR but I suggest the following with my PrefixCompression:
There we either have to always check for _words.size()
or we completely disable the operator[] since we only can return strings because of the compression. And for uncompressed vocabularies (text vocab) we forbid externalization, so the operator[] or "idToOptionalStringRef" always works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another argument: this is a really unexpected interface, if operator[] returns an optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok then I'd prefer to disable operator[]
in your PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we only leave idToOptionalStringRef (if uncompressed -> unexternalized) and idToOptionalString (if compressed -> possibly externalized).
AD_CHECK(id < _externalLiterals.size()); | ||
res = _externalLiterals[id]; | ||
} | ||
return res; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know how this works, and why you did this, but having ID_NO_VALUE in the Vocabulary class is unexpected to me:
This is a special property of the optional operator as I expected it and so I think the following should rather be in the class handling the optional:
if (id == ID_NO_VALUE) {
return std::nullopt;
} else {
return _vocab.idToString(id) // or operator[] depending on what can be used.
}
I'm sorry I didn't see this the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you feel that it is unexpected. The problem is that idToString(id)
is called pretty much throughout the codebase whenever we go from id to a string and every time we do this it might be a non existant optional value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is indeed a good argument.
//! I.e. it escapes illegal characters and adss quotes around the string. | ||
//! Taking a std::optional<string> it takes normal strings (which are implicitly | ||
//! converted or std::nullopt which it converts to 'null' (without quotes) | ||
inline string toJson(const std::optional<string>& orig); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems a good solution to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion my last questions, LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not remember to check the Filter.cpp. But You absolutely found the right way to get rid of this code duplications.
src/engine/Filter.cpp
Outdated
AD_CHECK(_rhsId != std::numeric_limits<size_t>::max()); | ||
Id l = _lhsInd; | ||
Id r = _rhsInd; | ||
if (r == std::numeric_limits<Id>::max()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an Id? it is an unsigned type and the index of result table that is to be computed. An Id is an internal numeric value for a given word in the Vocabulary. I think this could be misleading.
Until now non existing values from an
OPTIONAL
were handled asID_NO_VALUE
only up to the point where they get turned into strings viaidToString()
(inIndex
andResultTable
). There they just become""
.This has several problems, for one it makes it impossible to disambiguate between a value that is the empty string and one that is missing (see #106). It turns out, this is not only true when using the API but also inside QLever e.g. during
FILTER
when matching the language tag and duringGROUP BY
when doingGROUP_CONCAT()
.Sadly the fix is quite intrusive. So I want to get some eyes on the basic concept which is using
std::optional
(how fitting) when turning Ids into strings.At the moment this PR uses
null
instead of""
in the JSON output and let's non-existing values through the languageFILTER
. ForGROUP_CONCAT
I still just convert to""
. Will have to figure out what the correct behavior is here. Also while we are there it might make sense to clean up theFILTER
code and remove a lot of code duplication.