Out-of-Vocab for VALUES is not an error #309

joka921 · 2020-01-16T17:09:00Z

This fixes #290.

When an out-of-vocab entity is used in a VALUES clause this previously triggered an exception which is wrong. This PR now ignores the affected rows and emits warnings that are passed to the user in the
Json response.

Internally this implements a general mechanism to transport warnings for all kinds of operations.

hannahbast

@ 5dbe3fc Thank you for this PR! I have a question concerning the proposed handling of OOV values. Please correct me if I am misunderstanding something.

Is the VALUES clause (currently) the only way that OOV IDs can find their way into an intermediate result?

If not, how else can it happen?

If yes, is it the right way to filter them out after the event instead of not adding them in the first place?

hannahbast · 2020-01-20T14:08:35Z

test/StringSortComparatorTest.cpp

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
  }
 }

+TEST(LocaleManagerTest, Normalization) {
+  // é as single codepoints
+  const char a[]  = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 };


Simpler: ... = "\xc3\xa9"

hannahbast · 2020-01-20T14:08:53Z

test/StringSortComparatorTest.cpp

+  // é as single codepoints
+  const char a[]  = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0 };
+  // é as e + accent aigu
+  const char b[] = {'e', static_cast<char>(0xCC), static_cast<char>(0x81), 0};


Dito: ... = "e\xcc\x81"

hannahbast · 2020-01-20T14:26:39Z

src/engine/Server.cpp

+    _requestProcessingTimer.cont();
+    qet.writeResultToStreamAsJson(os, query._selectedVariables, limit, offset,
+                                  maxSend);
+    j["res"] = json::parse(os.str());


Is this efficient for a huge result (e.g. millions or billions of triples)?

However, the UI only asks for the Top-X results and big results are usually exported as CSV or TSV

hannahbast · 2020-01-20T14:27:02Z

src/engine/Server.cpp


-  return os.str();
+  return j.dump(4);


See the comment above

hannahbast · 2020-01-20T14:31:21Z

src/engine/Values.cpp

    }
+    numActuallyWritten++;
+  skipRow:;


Is the indentation correct for this label?

src/engine/Values.cpp

hannahbast · 2020-01-20T14:35:57Z

test/StringSortComparatorTest.cpp

@@ -41,6 +41,22 @@ TEST(LocaleManagerTest, Punctuation) {
  }
 }

+TEST(LocaleManagerTest, Normalization) {
+  // é as single codepoints
+  const char a[] = {static_cast<char>(0xC3), static_cast<char>(0xA9), 0};


Simplify to: ... = "\xc3\xa9"

hannahbast

Log of the 1-1 code review between Hannah and Johannes on 20.01.2020. Minor comments in the code, here are some meta-level comments, partly for Hannah to understand what has been done.

@ Use of identical level: Johannes convinced me that it is very helpful to have deterministic index builds. This requires the identical levels (there are numerous pairs of literals which are non-equal only on the identical level). The performance indeed suffers (Johannes says a sort of many strings is 5-10 slower with ICU than with plain ASCII sorting, but already for the non-identical level). However, Johannes plans another PR (or rather, or modified version of PR 302), which will sort using ICU sort keys. The sort keys will be generated at a time when the index builder is busy with IO.

@ Unicode normalization: Normalize all elements of all triples, using a wrapper function to ICU's normalization function (which has a rather involved interface and has a state which requires an explicit initialization). The distinction between for example an "e" combined with a "´" and a "é" gets lost (two difference encodings in Unicode), but that is exactly what we want.

@ QueryExecutionTree traversal: Needed a possibility to traverse the tree and collect the warnings. So far, the nodes of the execution tree were operations, which contained pointers to their children depending on the operation. But no operation-independent way of traversal.

@ OOV: So far, when a OOV value was specified in the VALUES clause, an exception was thrown. Now there is a warning in the JSON response (which the e2e test indeed tests) and that value is simply ignored. @jbuerklin: It would be good to have a possibility in the QLeverUI to see any warnings that were sent in the JSON. Suggestions: have a small but visible signal somewhere that there were warnings + include them as part of the information given when clicking on "Analyze".

joka921 · 2020-01-21T20:46:46Z

Ok, The ICU-Level discussion goes on: @hannahbast @niklas88
After rethinking this again I found the following: We need the IDENTICAL Level always.
As soon as there can be words, that compare Equal but are not our vocabulary breaks apart. This is a bug that does not often occur in "typical" queries but it is a bug nonetheless.
(we strictly need to consecutive entries in the vocabulary to compare not equal. I am even thinking about enforcing it by additionally running strcmp if the Unicode collation returns equal. That way we might save the Quarternary Level (I just got the idea while writing this). But I also first want to discuss this.

Another thing that is already open for review are the changes in QueryExecutionTree.h/.cpp and Server.cpp where I refactored the Json pipeline to consistently use nlohmann::json.

hannahbast · 2020-01-22T07:53:20Z

@joka921 @niklas88 Thanks, Johannes! Can you be more specific with what you mean by "our vocabulary breaks apart", what exactly happens?

joka921 · 2020-01-24T08:49:19Z

Ok so now i have

Adressed the small wishes from @hannahbast 's review.
Removed all Unicode-fixing business out of this PR and refactored it to another branch since those
are completely different issues.

niklas88

Great work as always!

niklas88 · 2020-01-25T20:10:52Z

e2e/queryit.py

+            for requested_warning in value:
+                found = False
+                for actual_warning in result["warnings"]:
+                    if actual_warning.startswith(requested_warning):


.startswith() is already a bit fuzzy, why not just do .contains() and then it is all about containing a warning

niklas88 · 2020-01-25T20:13:15Z

src/engine/QueryExecutionTree.h

@@ -145,144 +147,22 @@ class QueryExecutionTree {
  size_t _sizeEstimate;

  std::shared_ptr<const ResultTable> _cachedResult = nullptr;
-  void writeJsonTable(


Yay, no more manual JSON building, very good!

Different encodings of the same codepoint (e.g. é and e + accent) now are internally always mapped to the same representation.

…ues' bug - The Operation classes now have a getChildren() method that returns non-owning pointers to all of their children - Each Operation holds a vector of strings to store warnings that are emitted during the result computation. Those can be recursively retrieved using the collectWarnings() function that internally uses the getChildren() function mentioned above Fixed the Values out-of-vocab bug - Out-of-vocab entries in value clauses previously triggered exceptions. - Now rows that contain unknown words are ignored and trigger a warning in the result json. - This is also tested in the end-to-end test

- Previously the JSON was created manually - Now nlohmann::json is used in every step - Typically, only small results or small parts of results are retrieved via JSON, so this step is not too performance critical

niklas88 · 2020-02-20T10:40:31Z

@hannahbast have your questions/requested changes been answered? I think this can be merged

hannahbast reviewed Jan 18, 2020

View reviewed changes

hannahbast reviewed Jan 20, 2020

View reviewed changes

src/engine/Server.cpp

return os.str();

return j.dump(4);

Copy link

Member

hannahbast Jan 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above

hannahbast reviewed Jan 20, 2020

View reviewed changes

src/engine/Values.cpp Show resolved Hide resolved

hannahbast reviewed Jan 20, 2020

View reviewed changes

hannahbast requested changes Jan 20, 2020

View reviewed changes

joka921 requested a review from niklas88 January 21, 2020 20:47

joka921 requested a review from hannahbast January 24, 2020 08:48

niklas88 approved these changes Jan 25, 2020

View reviewed changes

joka921 added 3 commits January 26, 2020 11:06

Implemented and tested Unicode Normalization

f439dff

Different encodings of the same codepoint (e.g. é and e + accent) now are internally always mapped to the same representation.

The whole Result to JSON pipeline now uses nlohmann::json.

7253f8b

- Previously the JSON was created manually - Now nlohmann::json is used in every step - Typically, only small results or small parts of results are retrieved via JSON, so this step is not too performance critical

joka921 force-pushed the f.NoValuesError branch from 442ad7b to 7253f8b Compare January 26, 2020 10:22

niklas88 merged commit feb3f39 into ad-freiburg:master Mar 11, 2020

joka921 deleted the f.NoValuesError branch August 24, 2022 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out-of-Vocab for VALUES is not an error #309

Out-of-Vocab for VALUES is not an error #309

joka921 commented Jan 16, 2020

hannahbast left a comment •

edited

hannahbast Jan 20, 2020

hannahbast Jan 20, 2020

hannahbast Jan 20, 2020

hannahbast Jan 20, 2020

hannahbast Jan 20, 2020

joka921 Jan 24, 2020

hannahbast Jan 20, 2020

hannahbast left a comment •

edited

joka921 commented Jan 21, 2020

hannahbast commented Jan 22, 2020

joka921 commented Jan 24, 2020

niklas88 left a comment

niklas88 Jan 25, 2020

niklas88 Jan 25, 2020

niklas88 commented Feb 20, 2020 •

edited

Out-of-Vocab for VALUES is not an error #309

Out-of-Vocab for VALUES is not an error #309

Conversation

joka921 commented Jan 16, 2020

hannahbast left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hannahbast left a comment • edited

Choose a reason for hiding this comment

joka921 commented Jan 21, 2020

hannahbast commented Jan 22, 2020

joka921 commented Jan 24, 2020

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 commented Feb 20, 2020 • edited

hannahbast left a comment •

edited

hannahbast left a comment •

edited

niklas88 commented Feb 20, 2020 •

edited