Added a prefix filter and support for different column types to filters #145

floriankramer · 2018-10-30T17:10:19Z

This is a cleaner implementation of two features that were hacked together for the presentation builds.
The prefix filter handles regexes that do prefix checking efficiently while the changes to the way filters are created allow for filtering on a greater variety of ResultTypes.

…bulary.

niklas88

LGTM except for missing e2e prefix queries

niklas88 · 2018-11-02T09:10:22Z

e2e/scientists_queries.yaml

+          - num_rows: 1 
+          - num_cols: 2
+          - selected: ["?profession", "?awards"]
+          - contains_row: ["<Apothecary>", '<Victoria_Cross>']


I think we are still missing e2e queries for prefix filtering

There is a 'regex-albert-physics-award' query that uses the prefix filter. I can add more, but I couldn't think of a significantly different test case for prefix filters.

niklas88 · 2018-11-02T09:14:56Z

src/engine/Filter.cpp

-template <class RT>
-vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r,
-                                  shared_ptr<const ResultTable> subRes) const {
+template <class RT, ResultTable::ResultType T>


Youb already renamed _lhsInd to _lhs above we may want to rename the l here too as it could easily be misunderstood as just filtering a range and l meaning left.

niklas88 · 2018-11-02T09:18:21Z

src/engine/Filter.cpp

  }
  return res;
 }

+template <class RT>
+vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r,


I like this cleanup with templating over the ResultType and using ValueReader, thank you for that.

niklas88 · 2018-11-02T09:27:24Z

src/engine/Filter.cpp

+      size_t lowerBound = getIndex().getVocab().getValueIdForGE(rhs);
+      std::string upperBoundStr = rhs;
+      // TODO(florian): This only works for ascii but could break unicode
+      upperBoundStr[upperBoundStr.size() - 1]++;


Since we are sorting lexicographically by bytes and ignore the encoding the only problem is that char may be signed and thus could register as < while technically being larger, right? Also if it were unsigned we would only have to make sure that we don't increment through the overflow at 255, right? Though I think 255 shouldn't really appear in UTF-8 text. I'm not sure how to get this working with implementation defined char signedness however or even how string sorting really works with signed char..

I did some testing with gcc and the sorting of strings with unicode characters. gcc uses signed chars by default but the sorting works without any issues, and even when incrementing the lower byte of '¿' which makes it an invalid utf-8 character the sorting still worked fine. I'd agree that 255 should not be an issue, as it is not a valid utf-8 character byte. Considering this seems to work well in practice I'd just remove the TODO.

Sounds good. I guess the string sorting does something like casting to unsigned char because otherwise the order too would be platform defined, which sounds like a bad idea. From my experience on Linux char is signed on x86_64 (unless passing -funsigned-char to GCC/clang). On ARM and ARM64 (e.g. on my RaspberryPi) on the other hand char is always unsigned. Interestingly enough this isn't just a typedef

niklas88 · 2018-11-02T09:31:21Z

src/engine/Filter.cpp

+    shared_ptr<const ResultTable> subRes) const {
+  ResultTable::ResultType resultType = subRes->getResultType(l);
+  // Catch some unsupported combinations
+  if (resultType != ResultTable::ResultType::KB &&


What about GrOUP_CONCAT? That's still a string but not KB?

I had forgotten about that, it's included in the next commit.

niklas88 · 2018-11-02T10:18:44Z

src/engine/Filter.cpp

+        // Find a matching entry in subRes' _localVocab. If _rhs is not in the
+        // _localVocab of subRes r will be equal to  _localVocab.size() and
+        // not match the index of any entry in _localVocab.
+        for (r = 0; r < subRes->_localVocab->size(); r++) {


This is the GROUP_CONCAT case I was wondering about, right?

If the local vocab is ordered we could optimize this as a binary search, so maybe add a TODO?

The problem with keeping the local vocabulary ordered is that every reordering of the vocabulary would also require a rewriting of all local_vocab columns, as the entries in those columns are just offsets into the local_vocab_ vector. Right now the sorting of the local vocabulary would only help us for equality filters on local_vocab columns which should be a rather rare use case, so I believe we would overall loose performance from keeping the local vocabulary sorted.

Makes sense, then let's keep it as is

niklas88 · 2018-11-02T10:23:19Z

src/engine/ResultTable.cpp

@@ -11,7 +11,8 @@ ResultTable::ResultTable()
      _sortedBy(0),
      _varSizeData(),
      _fixedSizeData(nullptr),
-      _status(ResultTable::OTHER) {}
+      _status(ResultTable::OTHER),
+      _localVocab(std::make_shared<std::vector<std::string>>()) {}


What's the advantage of having this as a shared_ptr?

It prevents copying of the contents of _localVocab after evaluating a subtree. As elements in the vocabulary are currently never moved or deleted this should not create any problems.

Ok, I think this warrants a comment. Also can't we just std::move() the _localVocab?

I think we could use std::move, the only advantage of using shared_pointers that I can think of right now is that the old ResultTable does not loose access to the vocabulary after it was copied. That would probably not be a problem right now, but cost wise shared pointers should be very close to std::move, and they avoid problems with interpreting the old results after already having moved the vocabulary.

niklas88 · 2018-11-02T10:24:47Z

src/index/Vocabulary.h

@@ -207,7 +207,7 @@ class Vocabulary {

  Id getValueIdForLE(const string& indexWord) const {
    Id lb = lower_bound(indexWord);
-    if (at(lb) != indexWord && lb > 0) {
+    if ((lb < _words.size() && at(lb) != indexWord) && lb > 0) {


shouldn't the lb > 0 come before accessing in at(lb) so the short circuit saves us from an illegal access?

lb is unsigned so this should not be a problem. the greater than 0 check is only to avoid an underflow when decreasing lb later on.

niklas88 · 2018-11-03T13:50:14Z

src/parser/SparqlParser.cpp

@@ -512,15 +520,27 @@ void SparqlParser::addFilter(const string& str, vector<SparqlFilter>* _filters,
          // an expensive regex filter.
          bool isSimple = true;
          bool escaped = false;
+
+          std::vector<bool> regexControlChars(sizeof(char), false);


sizeof(char) is 1, I think you're looking for std::numeric_limists<unsigned char>::max(). Also this still allocates. If you use std::array instead of std::vector the allocation becomes a simple addittion on the stack pointer but it still leaves us with setting the bad char elements on each call. Instead we could make this a member variable of SparqlParser and only set it during construction.

Alternatively I think the linear search isn't really all that bad with searching in a very small string so the old variant is probably just as fast. Though I would use a std::array<char> = {'[', …}; then since I'm not sure if all chars fit into the small string op and again, std::array would allocate on the stack

I switched it back to the search in a string, but made the string static and const to avoid recreating it.

Yeah, that's a nice solution, I might have turned this into a bit of a useless detour for such a small thing but thanks for your effort!

niklas88

LGTM and will merge in a minute (or less)

floriankramer added 3 commits October 25, 2018 17:40

Added a prefix filter

dc09636

Made filters aware of their input columns ResultType

b80f8da

Fixes ad-freiburg#144. Added A check for out of range ids to the voca…

4430744

…bulary.

floriankramer requested a review from niklas88 October 30, 2018 17:10

Undid the style changes to src/index/IndexMetaDataImpl.h

28dfdee

niklas88 reviewed Nov 2, 2018

View reviewed changes

Addressed review comments

a15dc2b

niklas88 reviewed Nov 3, 2018

View reviewed changes

Added more comments, reverted the prefix filter check.

ef5c8e5

niklas88 approved these changes Nov 5, 2018

View reviewed changes

niklas88 merged commit 53f972c into ad-freiburg:master Nov 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a prefix filter and support for different column types to filters #145

Added a prefix filter and support for different column types to filters #145

floriankramer commented Oct 30, 2018

niklas88 left a comment

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

floriankramer Nov 5, 2018

niklas88 Nov 2, 2018

floriankramer Nov 2, 2018

niklas88 Nov 2, 2018

niklas88 Nov 3, 2018

floriankramer Nov 5, 2018

niklas88 Nov 5, 2018

niklas88 left a comment

Added a prefix filter and support for different column types to filters #145

Added a prefix filter and support for different column types to filters #145

Conversation

floriankramer commented Oct 30, 2018

niklas88 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niklas88 left a comment

Choose a reason for hiding this comment