Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added splitting at multiple whitespace characters #36

Merged
merged 4 commits into from Feb 28, 2018

Conversation

floriankramer
Copy link
Member

I added a splitWs function to the StringUtils and used that during the SparqlParsing to allow for multiple whitespace characters between variable names in the select part and within prefix declarations.
This allows for direct usage of the queries of e.g. the sp2b benchmark, and is also used within the examples in the W3C recommendation for sparql.

Copy link
Member

@niklas88 niklas88 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes, this sounds useful. I've added a few comments but only nitpicking.

@@ -63,7 +63,7 @@ void SparqlParser::parsePrologue(string str, ParsedQuery& query) {

// _____________________________________________________________________________
void SparqlParser::addPrefix(const string& str, ParsedQuery& query) {
auto parts = ad_utility::split(ad_utility::strip(str, ' '), ' ');
auto parts = ad_utility::splitWs(ad_utility::strip(str, ' '));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since splitWs removes the whitespace from the splits shouldn't the strip() be redudant?

size_t start = 0;
size_t pos = 0;
while (pos < orig.size()) {
if (isspace(orig[pos])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is std::isspace() from cctype, so if we use this would make sense to write the std:: explicitly. Then again we may actually want to use ::isspace() globally imported from #include <ctype.h> because unlike std::isspace() this doesn't depend on the locale and is thus probably a bit faster. Since we're dealing with what is basically always UTF-8 encoded but in the ASCII subset we don't really need locale support I think.

Copy link
Member

@niklas88 niklas88 Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, both functions have an additional pitfall in taking ints that represent unsigned
char. Since on x86_64 char is unsigned they need an additional cast to unsigned char.. to deal with negative char values as they appear in UTF-8 or else the behavior is technically undefined though I guess it does the right thing in most actual implementations.

@@ -381,6 +385,33 @@ vector<string> split(const string& orig, const char sep) {
return result;
}

// _____________________________________________________________________________
vector<string> splitWs(const string &orig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a weird mix of type& name and type &name in this file. I strongly prefer type& name because to me being a reference is part of the type. Google allows both giving no preference while Stroustrup prefers the former. I propose we keep it this way and I'll do a clang format over the whole file after the merge.

size_t start = 0;
size_t pos = 0;
while (pos < orig.size()) {
if (orig[pos] >= 0 && ::isspace(orig[pos])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a little obsessed with this and tried it out on my Raspberry Pi where char is unsigned. Your code definitely works and even just ::isspace(orig[pos]) works with both -fsigned-char and -funsigned-char.
So I feel like doing an extra condition is a bit too much just to make it clear that we expect the chars to use the full 8-bits. I think ::isspace(static_cast<unsigned char>(orig[pos])) might be a good compromise. It shows we were thinking about 8-bit chars and it matches the ::isspace() documentation representable as an unsigned char.

pos++;
}
// avoid adding whitespace at the back of the string
if (!(orig[orig.size() - 1] >= 0 && ::isspace(orig[orig.size() - 1]))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be replaced by just if (start != orig.size()) because we would already have skipped the whitespace. Then it also more closely matches the condition if (start != pos) in the loop which basically does the same thing.

// unicode code point 224 has a second byte (160), that equals the space
// character if the first bit is ignored
// (which may happen when casting char to int).
string s7 = u8"Test\u00e0test";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find 🥇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants