Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-Vocab for VALUES is not an error #309

Merged
merged 3 commits into from Mar 11, 2020

Conversation

joka921
Copy link
Member

@joka921 joka921 commented Jan 16, 2020

This fixes #290.

When an out-of-vocab entity is used in a VALUES clause this previously triggered an exception which is wrong. This PR now ignores the affected rows and emits warnings that are passed to the user in the
Json response.

Internally this implements a general mechanism to transport warnings for all kinds of operations.

Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ 5dbe3fc Thank you for this PR! I have a question concerning the proposed handling of OOV values. Please correct me if I am misunderstanding something.

Is the VALUES clause (currently) the only way that OOV IDs can find their way into an intermediate result?

If not, how else can it happen?

If yes, is it the right way to filter them out after the event instead of not adding them in the first place?

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
}
}

TEST(LocaleManagerTest, Normalization) {
// é as single codepoints
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler: ... = "\xc3\xa9"

// é as single codepoints
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 };
// é as e + accent aigu
const char b[] = {'e', static_cast<char>(0xCC), static_cast<char>(0x81), 0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dito: ... = "e\xcc\x81"

_requestProcessingTimer.cont();
qet.writeResultToStreamAsJson(os, query._selectedVariables, limit, offset,
maxSend);
j["res"] = json::parse(os.str());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this efficient for a huge result (e.g. millions or billions of triples)?

However, the UI only asks for the Top-X results and big results are usually exported as CSV or TSV


return os.str();
return j.dump(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above

}
numActuallyWritten++;
skipRow:;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the indentation correct for this label?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
}
}

TEST(LocaleManagerTest, Normalization) {
// é as single codepoints
const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplify to: ... = "\xc3\xa9"

Copy link
Member

@hannahbast hannahbast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log of the 1-1 code review between Hannah and Johannes on 20.01.2020. Minor comments in the code, here are some meta-level comments, partly for Hannah to understand what has been done.

@ Use of identical level: Johannes convinced me that it is very helpful to have deterministic index builds. This requires the identical levels (there are numerous pairs of literals which are non-equal only on the identical level). The performance indeed suffers (Johannes says a sort of many strings is 5-10 slower with ICU than with plain ASCII sorting, but already for the non-identical level). However, Johannes plans another PR (or rather, or modified version of PR 302), which will sort using ICU sort keys. The sort keys will be generated at a time when the index builder is busy with IO.

@ Unicode normalization: Normalize all elements of all triples, using a wrapper function to ICU's normalization function (which has a rather involved interface and has a state which requires an explicit initialization). The distinction between for example an "e" combined with a "´" and a "é" gets lost (two difference encodings in Unicode), but that is exactly what we want.

@ QueryExecutionTree traversal: Needed a possibility to traverse the tree and collect the warnings. So far, the nodes of the execution tree were operations, which contained pointers to their children depending on the operation. But no operation-independent way of traversal.

@ OOV: So far, when a OOV value was specified in the VALUES clause, an exception was thrown. Now there is a warning in the JSON response (which the e2e test indeed tests) and that value is simply ignored. @jbuerklin: It would be good to have a possibility in the QLeverUI to see any warnings that were sent in the JSON. Suggestions: have a small but visible signal somewhere that there were warnings + include them as part of the information given when clicking on "Analyze".

@joka921
Copy link
Member Author

joka921 commented Jan 21, 2020

Ok, The ICU-Level discussion goes on: @hannahbast @niklas88
After rethinking this again I found the following: We need the IDENTICAL Level always.
As soon as there can be words, that compare Equal but are not our vocabulary breaks apart. This is a bug that does not often occur in "typical" queries but it is a bug nonetheless.
(we strictly need to consecutive entries in the vocabulary to compare not equal. I am even thinking about enforcing it by additionally running strcmp if the Unicode collation returns equal. That way we might save the Quarternary Level (I just got the idea while writing this). But I also first want to discuss this.

Another thing that is already open for review are the changes in QueryExecutionTree.h/.cpp and Server.cpp where I refactored the Json pipeline to consistently use nlohmann::json.

@joka921 joka921 requested a review from niklas88 January 21, 2020 20:47
@hannahbast
Copy link
Member

@joka921 @niklas88 Thanks, Johannes! Can you be more specific with what you mean by "our vocabulary breaks apart", what exactly happens?

@joka921
Copy link
Member Author

joka921 commented Jan 24, 2020

Ok so now i have

  • Adressed the small wishes from @hannahbast 's review.
  • Removed all Unicode-fixing business out of this PR and refactored it to another branch since those
    are completely different issues.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work as always!

e2e/queryit.py Outdated
for requested_warning in value:
found = False
for actual_warning in result["warnings"]:
if actual_warning.startswith(requested_warning):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.startswith() is already a bit fuzzy, why not just do .contains() and then it is all about containing a warning

@@ -145,144 +147,22 @@ class QueryExecutionTree {
size_t _sizeEstimate;

std::shared_ptr<const ResultTable> _cachedResult = nullptr;
void writeJsonTable(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, no more manual JSON building, very good!

Different encodings of the same codepoint (e.g. é and e + accent) now are internally always mapped to the same representation.
…ues' bug

- The Operation classes now have a getChildren()
  method that returns non-owning pointers to all of their children

- Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved
  using the collectWarnings() function that internally uses the getChildren()
  function mentioned above

Fixed the Values out-of-vocab bug

- Out-of-vocab entries in value clauses previously triggered exceptions.
- Now rows that contain unknown words are ignored and trigger a warning in the result json.
- This is also tested in the end-to-end test
- Previously the JSON was created manually
- Now nlohmann::json is used in every step
- Typically, only small results or small parts of results are retrieved via JSON, so this step is not too performance critical
@niklas88
Copy link
Member

niklas88 commented Feb 20, 2020

@hannahbast have your questions/requested changes been answered? I think this can be merged

@niklas88 niklas88 merged commit feb3f39 into ad-freiburg:master Mar 11, 2020
@joka921 joka921 deleted the f.NoValuesError branch August 24, 2022 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If VALUES result set is empty, an exception is thrown
3 participants