New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out-of-Vocab for VALUES is not an error #309
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ 5dbe3fc Thank you for this PR! I have a question concerning the proposed handling of OOV values. Please correct me if I am misunderstanding something.
Is the VALUES clause (currently) the only way that OOV IDs can find their way into an intermediate result?
If not, how else can it happen?
If yes, is it the right way to filter them out after the event instead of not adding them in the first place?
test/StringSortComparatorTest.cpp
Outdated
@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) { | |||
} | |||
} | |||
|
|||
TEST(LocaleManagerTest, Normalization) { | |||
// é as single codepoints | |||
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler: ... = "\xc3\xa9"
test/StringSortComparatorTest.cpp
Outdated
// é as single codepoints | ||
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 }; | ||
// é as e + accent aigu | ||
const char b[] = {'e', static_cast<char>(0xCC), static_cast<char>(0x81), 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito: ... = "e\xcc\x81"
src/engine/Server.cpp
Outdated
_requestProcessingTimer.cont(); | ||
qet.writeResultToStreamAsJson(os, query._selectedVariables, limit, offset, | ||
maxSend); | ||
j["res"] = json::parse(os.str()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this efficient for a huge result (e.g. millions or billions of triples)?
However, the UI only asks for the Top-X results and big results are usually exported as CSV or TSV
|
||
return os.str(); | ||
return j.dump(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comment above
src/engine/Values.cpp
Outdated
} | ||
numActuallyWritten++; | ||
skipRow:; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the indentation correct for this label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
test/StringSortComparatorTest.cpp
Outdated
@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) { | |||
} | |||
} | |||
|
|||
TEST(LocaleManagerTest, Normalization) { | |||
// é as single codepoints | |||
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplify to: ... = "\xc3\xa9"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log of the 1-1 code review between Hannah and Johannes on 20.01.2020. Minor comments in the code, here are some meta-level comments, partly for Hannah to understand what has been done.
@ Use of identical level: Johannes convinced me that it is very helpful to have deterministic index builds. This requires the identical levels (there are numerous pairs of literals which are non-equal only on the identical level). The performance indeed suffers (Johannes says a sort of many strings is 5-10 slower with ICU than with plain ASCII sorting, but already for the non-identical level). However, Johannes plans another PR (or rather, or modified version of PR 302), which will sort using ICU sort keys. The sort keys will be generated at a time when the index builder is busy with IO.
@ Unicode normalization: Normalize all elements of all triples, using a wrapper function to ICU's normalization function (which has a rather involved interface and has a state which requires an explicit initialization). The distinction between for example an "e" combined with a "´" and a "é" gets lost (two difference encodings in Unicode), but that is exactly what we want.
@ QueryExecutionTree traversal: Needed a possibility to traverse the tree and collect the warnings. So far, the nodes of the execution tree were operations, which contained pointers to their children depending on the operation. But no operation-independent way of traversal.
@ OOV: So far, when a OOV value was specified in the VALUES clause, an exception was thrown. Now there is a warning in the JSON response (which the e2e test indeed tests) and that value is simply ignored. @jbuerklin: It would be good to have a possibility in the QLeverUI to see any warnings that were sent in the JSON. Suggestions: have a small but visible signal somewhere that there were warnings + include them as part of the information given when clicking on "Analyze".
Ok, The ICU-Level discussion goes on: @hannahbast @niklas88 Another thing that is already open for review are the changes in |
Ok so now i have
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work as always!
e2e/queryit.py
Outdated
for requested_warning in value: | ||
found = False | ||
for actual_warning in result["warnings"]: | ||
if actual_warning.startswith(requested_warning): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.startswith()
is already a bit fuzzy, why not just do .contains()
and then it is all about containing a warning
@@ -145,144 +147,22 @@ class QueryExecutionTree { | |||
size_t _sizeEstimate; | |||
|
|||
std::shared_ptr<const ResultTable> _cachedResult = nullptr; | |||
void writeJsonTable( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, no more manual JSON building, very good!
Different encodings of the same codepoint (e.g. é and e + accent) now are internally always mapped to the same representation.
…ues' bug - The Operation classes now have a getChildren() method that returns non-owning pointers to all of their children - Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved using the collectWarnings() function that internally uses the getChildren() function mentioned above Fixed the Values out-of-vocab bug - Out-of-vocab entries in value clauses previously triggered exceptions. - Now rows that contain unknown words are ignored and trigger a warning in the result json. - This is also tested in the end-to-end test
- Previously the JSON was created manually - Now nlohmann::json is used in every step - Typically, only small results or small parts of results are retrieved via JSON, so this step is not too performance critical
442ad7b
to
7253f8b
Compare
@hannahbast have your questions/requested changes been answered? I think this can be merged |
This fixes #290.
When an out-of-vocab entity is used in a
VALUES
clause this previously triggered an exception which is wrong. This PR now ignores the affected rows and emits warnings that are passed to the user in theJson
response.Internally this implements a general mechanism to transport warnings for all kinds of operations.