Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate non-existing optional value and "" #107

Merged
merged 3 commits into from Aug 27, 2018
Merged

Disambiguate non-existing optional value and "" #107

merged 3 commits into from Aug 27, 2018

Conversation

niklas88
Copy link
Member

Until now non existing values from an OPTIONAL were handled as ID_NO_VALUE only up to the point where they get turned into strings via idToString() (in Index and ResultTable). There they just become "".

This has several problems, for one it makes it impossible to disambiguate between a value that is the empty string and one that is missing (see #106). It turns out, this is not only true when using the API but also inside QLever e.g. during FILTER when matching the language tag and during GROUP BY when doing GROUP_CONCAT().

Sadly the fix is quite intrusive. So I want to get some eyes on the basic concept which is using std::optional (how fitting) when turning Ids into strings.

At the moment this PR uses null instead of "" in the JSON output and let's non-existing values through the language FILTER. For GROUP_CONCAT I still just convert to "". Will have to figure out what the correct behavior is here. Also while we are there it might make sense to clean up the FILTER code and remove a lot of code duplication.

@niklas88
Copy link
Member Author

So GROUP_CONCAT at least does the same as on WikiData's query service

SELECT ?x (GROUP_CONCAT(?y) AS ?partners) WHERE {
          ?x <is-a> <Scientist> .
          OPTIONAL {?x <Spouse_(or_domestic_partner)> ?y .}
          FILTER (?x < <Ada_Lovelace>)
}
GROUP BY ?x

Interestingly on both the concatenation of a missing value becomes ""

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Functionality looks fine to me, but you could start eliminating the code duplicates immediately at the places where you change stuff.

os << "\"" << ad_utility::escapeForJson(entity.value()) << "\"]";
} else {
os << "null]";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not needed to introduce these duplications. You often have the workflow:
if optional value -> escape it for json (maybe convert to value)
else return "null";
This can perfectly be refactored into a function, with the "convert to value" (YES, NO, CHECK) being a parameter (runtime or template for efficiency).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That function sounds oddly specific and it would also need to handle the quotes. I actually thought about changing escapeForJson() to take an const std::optional<string>& and rename it to something like toJson() but decided against it at the time to make it a less intrusive change. Then again that might actually be a good solution. It will still work for normal strings and it reflects that in JSON values can be null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but the only thing we gain by your version, is that we know that "null]" does not need to be escaped. And I think there is not so much to gain there.

escapeForJson(entity.value_or("null")); as I see it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No my toJson() function (just pushed) also adds the quotes. So if the optional ist empty we get "null" otherwise "\"escaped string\""

if (id < _vocab.size()) {
return _vocab[id];
} else if (id == ID_NO_VALUE) {
return "";
return std::nullopt;
} else {
id -= _vocab.size();
AD_CHECK(id < _vocab.getExternalVocab().size()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a method of the Index and not of the Vocabulary?

Copy link
Member Author

@niklas88 niklas88 Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I guess because the _vocab us more or less private and the Index was intended as a single class interface to the whole index

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but we can still implement it it the Vocabulary and wrap it in the index
Index::getOptionalForId(id) {return _vocab.getOptionalForId;}

In general, the Index class is already full enough (functionality AND Code duplications:))

return true;
}
return ad_utility::endsWith(entity.value(),
this->_rhsString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about writing a Function, that returns this language filtering lambda? This is used multiple times again in the code below. We could e.g. write
getEngine().filter(*static_cast<vector<RT>*>(subRes->_fixedSizeData), getLanguageFilterLambda(l)
where getLanguageFilterLambda is a member function of the filter class.
Additionally, is the captured variable l not a numeric type and thus can/should be captured by value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This was not really possible in c++11 since returning a "lambda" without the overhead of a function was a pain, but now (since c++14) we can write

auto getLambda(int toCapture) {
  return [this, toCapture](const int x) {return this._memberVal + x < toCapture; };
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, let me investigate. I was hesitant to make the changes too big but this is already somewhat intrusive so it might actually be a good time to clean this up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually found a way (pushed) to get rid of multiple lambdas as well as a lot of the repetition

return true;
}
return ad_utility::endsWith(entity.value(),
this->_rhsString);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reuse lambda here

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the changes with the optional,
I just don't like them to be in the Vocabulary class (not only because I'm heavily restructuring there at the moment). But if this code finds a suitable place I'm really happy with this.

} else {
return std::nullopt;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never thought about this, but now that I see it i don't like it:
We have to know, that with the externalized vocabulary (knowledge base vocab), we must NOT call operator[] from the outside but idToOptionalString but, if we know that nothing is externalized, we can call the operator[]. I think this is a recipe for desaster. I can agree to merge this for now, since it is not the fault of your PR but I suggest the following with my PrefixCompression:
There we either have to always check for _words.size() or we completely disable the operator[] since we only can return strings because of the compression. And for uncompressed vocabularies (text vocab) we forbid externalization, so the operator[] or "idToOptionalStringRef" always works as expected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another argument: this is a really unexpected interface, if operator[] returns an optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok then I'd prefer to disable operator[] in your PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we only leave idToOptionalStringRef (if uncompressed -> unexternalized) and idToOptionalString (if compressed -> possibly externalized).

AD_CHECK(id < _externalLiterals.size());
res = _externalLiterals[id];
}
return res;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know how this works, and why you did this, but having ID_NO_VALUE in the Vocabulary class is unexpected to me:
This is a special property of the optional operator as I expected it and so I think the following should rather be in the class handling the optional:

if (id == ID_NO_VALUE) {
  return std::nullopt;
} else {
  return _vocab.idToString(id) // or operator[] depending on what can be used.
}

I'm sorry I didn't see this the first time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you feel that it is unexpected. The problem is that idToString(id) is called pretty much throughout the codebase whenever we go from id to a string and every time we do this it might be a non existant optional value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is indeed a good argument.

//! I.e. it escapes illegal characters and adss quotes around the string.
//! Taking a std::optional<string> it takes normal strings (which are implicitly
//! converted or std::nullopt which it converts to 'null' (without quotes)
inline string toJson(const std::optional<string>& orig);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems a good solution to me!

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion my last questions, LGTM

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not remember to check the Filter.cpp. But You absolutely found the right way to get rid of this code duplications.

AD_CHECK(_rhsId != std::numeric_limits<size_t>::max());
Id l = _lhsInd;
Id r = _rhsInd;
if (r == std::numeric_limits<Id>::max()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an Id? it is an unsigned type and the index of result table that is to be computed. An Id is an internal numeric value for a given word in the Vocabulary. I think this could be misleading.

@niklas88 niklas88 removed the request for review from floriankramer August 27, 2018 08:09
@niklas88 niklas88 merged commit 7a25341 into ad-freiburg:master Aug 27, 2018
@niklas88 niklas88 deleted the fix_optional_disambiguity branch October 31, 2018 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants