New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added a prefix filter and support for different column types to filters #145
Added a prefix filter and support for different column types to filters #145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for missing e2e prefix queries
- num_rows: 1 | ||
- num_cols: 2 | ||
- selected: ["?profession", "?awards"] | ||
- contains_row: ["<Apothecary>", '<Victoria_Cross>'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are still missing e2e queries for prefix filtering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a 'regex-albert-physics-award' query that uses the prefix filter. I can add more, but I couldn't think of a significantly different test case for prefix filters.
template <class RT> | ||
vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r, | ||
shared_ptr<const ResultTable> subRes) const { | ||
template <class RT, ResultTable::ResultType T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Youb already renamed _lhsInd
to _lhs
above we may want to rename the l
here too as it could easily be misunderstood as just filtering a range and l
meaning left
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/engine/Filter.cpp
Outdated
} | ||
return res; | ||
} | ||
|
||
template <class RT> | ||
vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this cleanup with templating over the ResultType
and using ValueReader
, thank you for that.
src/engine/Filter.cpp
Outdated
size_t lowerBound = getIndex().getVocab().getValueIdForGE(rhs); | ||
std::string upperBoundStr = rhs; | ||
// TODO(florian): This only works for ascii but could break unicode | ||
upperBoundStr[upperBoundStr.size() - 1]++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are sorting lexicographically by bytes and ignore the encoding the only problem is that char
may be signed
and thus could register as <
while technically being larger, right? Also if it were unsigned
we would only have to make sure that we don't increment through the overflow at 255
, right? Though I think 255
shouldn't really appear in UTF-8 text. I'm not sure how to get this working with implementation defined char
signedness however or even how string sorting really works with signed char
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some testing with gcc and the sorting of strings with unicode characters. gcc uses signed chars by default but the sorting works without any issues, and even when incrementing the lower byte of '¿' which makes it an invalid utf-8 character the sorting still worked fine. I'd agree that 255 should not be an issue, as it is not a valid utf-8 character byte. Considering this seems to work well in practice I'd just remove the TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I guess the string sorting does something like casting to unsigned char
because otherwise the order too would be platform defined, which sounds like a bad idea. From my experience on Linux char
is signed on x86_64 (unless passing -funsigned-char
to GCC/clang). On ARM and ARM64 (e.g. on my RaspberryPi) on the other hand char
is always unsigned
. Interestingly enough this isn't just a typedef
shared_ptr<const ResultTable> subRes) const { | ||
ResultTable::ResultType resultType = subRes->getResultType(l); | ||
// Catch some unsupported combinations | ||
if (resultType != ResultTable::ResultType::KB && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about GrOUP_CONCAT
? That's still a string but not KB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had forgotten about that, it's included in the next commit.
src/engine/Filter.cpp
Outdated
// Find a matching entry in subRes' _localVocab. If _rhs is not in the | ||
// _localVocab of subRes r will be equal to _localVocab.size() and | ||
// not match the index of any entry in _localVocab. | ||
for (r = 0; r < subRes->_localVocab->size(); r++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the GROUP_CONCAT case I was wondering about, right?
If the local vocab is ordered we could optimize this as a binary search, so maybe add a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with keeping the local vocabulary ordered is that every reordering of the vocabulary would also require a rewriting of all local_vocab columns, as the entries in those columns are just offsets into the local_vocab_ vector. Right now the sorting of the local vocabulary would only help us for equality filters on local_vocab columns which should be a rather rare use case, so I believe we would overall loose performance from keeping the local vocabulary sorted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, then let's keep it as is
@@ -11,7 +11,8 @@ ResultTable::ResultTable() | |||
_sortedBy(0), | |||
_varSizeData(), | |||
_fixedSizeData(nullptr), | |||
_status(ResultTable::OTHER) {} | |||
_status(ResultTable::OTHER), | |||
_localVocab(std::make_shared<std::vector<std::string>>()) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of having this as a shared_ptr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It prevents copying of the contents of _localVocab after evaluating a subtree. As elements in the vocabulary are currently never moved or deleted this should not create any problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think this warrants a comment. Also can't we just std::move()
the _localVocab
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use std::move, the only advantage of using shared_pointers that I can think of right now is that the old ResultTable does not loose access to the vocabulary after it was copied. That would probably not be a problem right now, but cost wise shared pointers should be very close to std::move, and they avoid problems with interpreting the old results after already having moved the vocabulary.
src/index/Vocabulary.h
Outdated
@@ -207,7 +207,7 @@ class Vocabulary { | |||
|
|||
Id getValueIdForLE(const string& indexWord) const { | |||
Id lb = lower_bound(indexWord); | |||
if (at(lb) != indexWord && lb > 0) { | |||
if ((lb < _words.size() && at(lb) != indexWord) && lb > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the lb > 0
come before accessing in at(lb)
so the short circuit saves us from an illegal access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lb is unsigned so this should not be a problem. the greater than 0 check is only to avoid an underflow when decreasing lb later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
src/parser/SparqlParser.cpp
Outdated
@@ -512,15 +520,27 @@ void SparqlParser::addFilter(const string& str, vector<SparqlFilter>* _filters, | |||
// an expensive regex filter. | |||
bool isSimple = true; | |||
bool escaped = false; | |||
|
|||
std::vector<bool> regexControlChars(sizeof(char), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sizeof(char)
is 1, I think you're looking for std::numeric_limists<unsigned char>::max()
. Also this still allocates. If you use std::array
instead of std::vector
the allocation becomes a simple addittion on the stack pointer but it still leaves us with setting the bad char elements on each call. Instead we could make this a member variable of SparqlParser
and only set it during construction.
Alternatively I think the linear search isn't really all that bad with searching in a very small string so the old variant is probably just as fast. Though I would use a std::array<char> = {'[', …};
then since I'm not sure if all chars fit into the small string op and again, std::array
would allocate on the stack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched it back to the search in a string, but made the string static and const to avoid recreating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a nice solution, I might have turned this into a bit of a useless detour for such a small thing but thanks for your effort!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and will merge in a minute (or less)
This is a cleaner implementation of two features that were hacked together for the presentation builds.
The prefix filter handles regexes that do prefix checking efficiently while the changes to the way filters are created allow for filtering on a greater variety of ResultTypes.