Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a prefix filter and support for different column types to filters #145

Merged
merged 6 commits into from Nov 5, 2018

Conversation

floriankramer
Copy link
Member

This is a cleaner implementation of two features that were hacked together for the presentation builds.
The prefix filter handles regexes that do prefix checking efficiently while the changes to the way filters are created allow for filtering on a greater variety of ResultTypes.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for missing e2e prefix queries

- num_rows: 1
- num_cols: 2
- selected: ["?profession", "?awards"]
- contains_row: ["<Apothecary>", '<Victoria_Cross>']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are still missing e2e queries for prefix filtering

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a 'regex-albert-physics-award' query that uses the prefix filter. I can add more, but I couldn't think of a significantly different test case for prefix filters.

template <class RT>
vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r,
shared_ptr<const ResultTable> subRes) const {
template <class RT, ResultTable::ResultType T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Youb already renamed _lhsInd to _lhs above we may want to rename the l here too as it could easily be misunderstood as just filtering a range and l meaning left.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
return res;
}

template <class RT>
vector<RT>* Filter::computeFilter(vector<RT>* res, size_t l, size_t r,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this cleanup with templating over the ResultType and using ValueReader, thank you for that.

size_t lowerBound = getIndex().getVocab().getValueIdForGE(rhs);
std::string upperBoundStr = rhs;
// TODO(florian): This only works for ascii but could break unicode
upperBoundStr[upperBoundStr.size() - 1]++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are sorting lexicographically by bytes and ignore the encoding the only problem is that char may be signed and thus could register as < while technically being larger, right? Also if it were unsigned we would only have to make sure that we don't increment through the overflow at 255, right? Though I think 255 shouldn't really appear in UTF-8 text. I'm not sure how to get this working with implementation defined char signedness however or even how string sorting really works with signed char..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing with gcc and the sorting of strings with unicode characters. gcc uses signed chars by default but the sorting works without any issues, and even when incrementing the lower byte of '¿' which makes it an invalid utf-8 character the sorting still worked fine. I'd agree that 255 should not be an issue, as it is not a valid utf-8 character byte. Considering this seems to work well in practice I'd just remove the TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I guess the string sorting does something like casting to unsigned char because otherwise the order too would be platform defined, which sounds like a bad idea. From my experience on Linux char is signed on x86_64 (unless passing -funsigned-char to GCC/clang). On ARM and ARM64 (e.g. on my RaspberryPi) on the other hand char is always unsigned. Interestingly enough this isn't just a typedef

shared_ptr<const ResultTable> subRes) const {
ResultTable::ResultType resultType = subRes->getResultType(l);
// Catch some unsupported combinations
if (resultType != ResultTable::ResultType::KB &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about GrOUP_CONCAT? That's still a string but not KB?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had forgotten about that, it's included in the next commit.

// Find a matching entry in subRes' _localVocab. If _rhs is not in the
// _localVocab of subRes r will be equal to _localVocab.size() and
// not match the index of any entry in _localVocab.
for (r = 0; r < subRes->_localVocab->size(); r++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the GROUP_CONCAT case I was wondering about, right?

If the local vocab is ordered we could optimize this as a binary search, so maybe add a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with keeping the local vocabulary ordered is that every reordering of the vocabulary would also require a rewriting of all local_vocab columns, as the entries in those columns are just offsets into the local_vocab_ vector. Right now the sorting of the local vocabulary would only help us for equality filters on local_vocab columns which should be a rather rare use case, so I believe we would overall loose performance from keeping the local vocabulary sorted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, then let's keep it as is

@@ -11,7 +11,8 @@ ResultTable::ResultTable()
_sortedBy(0),
_varSizeData(),
_fixedSizeData(nullptr),
_status(ResultTable::OTHER) {}
_status(ResultTable::OTHER),
_localVocab(std::make_shared<std::vector<std::string>>()) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of having this as a shared_ptr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents copying of the contents of _localVocab after evaluating a subtree. As elements in the vocabulary are currently never moved or deleted this should not create any problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think this warrants a comment. Also can't we just std::move() the _localVocab?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use std::move, the only advantage of using shared_pointers that I can think of right now is that the old ResultTable does not loose access to the vocabulary after it was copied. That would probably not be a problem right now, but cost wise shared pointers should be very close to std::move, and they avoid problems with interpreting the old results after already having moved the vocabulary.

@@ -207,7 +207,7 @@ class Vocabulary {

Id getValueIdForLE(const string& indexWord) const {
Id lb = lower_bound(indexWord);
if (at(lb) != indexWord && lb > 0) {
if ((lb < _words.size() && at(lb) != indexWord) && lb > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the lb > 0 come before accessing in at(lb) so the short circuit saves us from an illegal access?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lb is unsigned so this should not be a problem. the greater than 0 check is only to avoid an underflow when decreasing lb later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

@@ -512,15 +520,27 @@ void SparqlParser::addFilter(const string& str, vector<SparqlFilter>* _filters,
// an expensive regex filter.
bool isSimple = true;
bool escaped = false;

std::vector<bool> regexControlChars(sizeof(char), false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sizeof(char) is 1, I think you're looking for std::numeric_limists<unsigned char>::max(). Also this still allocates. If you use std::array instead of std::vector the allocation becomes a simple addittion on the stack pointer but it still leaves us with setting the bad char elements on each call. Instead we could make this a member variable of SparqlParser and only set it during construction.

Alternatively I think the linear search isn't really all that bad with searching in a very small string so the old variant is probably just as fast. Though I would use a std::array<char> = {'[', …}; then since I'm not sure if all chars fit into the small string op and again, std::array would allocate on the stack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched it back to the search in a string, but made the string static and const to avoid recreating it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a nice solution, I might have turned this into a bit of a useless detour for such a small thing but thanks for your effort!

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and will merge in a minute (or less)

@niklas88 niklas88 merged commit 53f972c into ad-freiburg:master Nov 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants