New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added splitting at multiple whitespace characters #36
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes, this sounds useful. I've added a few comments but only nitpicking.
src/parser/SparqlParser.cpp
Outdated
@@ -63,7 +63,7 @@ void SparqlParser::parsePrologue(string str, ParsedQuery& query) { | |||
|
|||
// _____________________________________________________________________________ | |||
void SparqlParser::addPrefix(const string& str, ParsedQuery& query) { | |||
auto parts = ad_utility::split(ad_utility::strip(str, ' '), ' '); | |||
auto parts = ad_utility::splitWs(ad_utility::strip(str, ' ')); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since splitWs removes the whitespace from the splits shouldn't the strip() be redudant?
src/util/StringUtils.h
Outdated
size_t start = 0; | ||
size_t pos = 0; | ||
while (pos < orig.size()) { | ||
if (isspace(orig[pos])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is std::isspace()
from cctype, so if we use this would make sense to write the std::
explicitly. Then again we may actually want to use ::isspace()
globally imported from #include <ctype.h>
because unlike std::isspace()
this doesn't depend on the locale and is thus probably a bit faster. Since we're dealing with what is basically always UTF-8 encoded but in the ASCII subset we don't really need locale support I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, both functions have an additional pitfall in taking ints that represent unsigned
char. Since on x86_64 char is unsigned they need an additional cast to unsigned char.. to deal with negative char values as they appear in UTF-8 or else the behavior is technically undefined though I guess it does the right thing in most actual implementations.
src/util/StringUtils.h
Outdated
@@ -381,6 +385,33 @@ vector<string> split(const string& orig, const char sep) { | |||
return result; | |||
} | |||
|
|||
// _____________________________________________________________________________ | |||
vector<string> splitWs(const string &orig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a weird mix of type& name
and type &name
in this file. I strongly prefer type& name
because to me being a reference is part of the type. Google allows both giving no preference while Stroustrup prefers the former. I propose we keep it this way and I'll do a clang format over the whole file after the merge.
src/util/StringUtils.h
Outdated
size_t start = 0; | ||
size_t pos = 0; | ||
while (pos < orig.size()) { | ||
if (orig[pos] >= 0 && ::isspace(orig[pos])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got a little obsessed with this and tried it out on my Raspberry Pi where char
is unsigned. Your code definitely works and even just ::isspace(orig[pos])
works with both -fsigned-char
and -funsigned-char
.
So I feel like doing an extra condition is a bit too much just to make it clear that we expect the chars to use the full 8-bits. I think ::isspace(static_cast<unsigned char>(orig[pos]))
might be a good compromise. It shows we were thinking about 8-bit chars and it matches the ::isspace()
documentation representable as an unsigned char.
src/util/StringUtils.h
Outdated
pos++; | ||
} | ||
// avoid adding whitespace at the back of the string | ||
if (!(orig[orig.size() - 1] >= 0 && ::isspace(orig[orig.size() - 1]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be replaced by just if (start != orig.size())
because we would already have skipped the whitespace. Then it also more closely matches the condition if (start != pos)
in the loop which basically does the same thing.
// unicode code point 224 has a second byte (160), that equals the space | ||
// character if the first bit is ignored | ||
// (which may happen when casting char to int). | ||
string s7 = u8"Test\u00e0test"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find 🥇
I added a splitWs function to the StringUtils and used that during the SparqlParsing to allow for multiple whitespace characters between variable names in the select part and within prefix declarations.
This allows for direct usage of the queries of e.g. the sp2b benchmark, and is also used within the examples in the W3C recommendation for sparql.