Added splitting at multiple whitespace characters #36

floriankramer · 2018-02-25T15:39:16Z

I added a splitWs function to the StringUtils and used that during the SparqlParsing to allow for multiple whitespace characters between variable names in the select part and within prefix declarations.
This allows for direct usage of the queries of e.g. the sp2b benchmark, and is also used within the examples in the W3C recommendation for sparql.

niklas88

Thanks for these changes, this sounds useful. I've added a few comments but only nitpicking.

niklas88 · 2018-02-26T10:39:19Z

src/parser/SparqlParser.cpp

@@ -63,7 +63,7 @@ void SparqlParser::parsePrologue(string str, ParsedQuery& query) {

 // _____________________________________________________________________________
 void SparqlParser::addPrefix(const string& str, ParsedQuery& query) {
-  auto parts = ad_utility::split(ad_utility::strip(str, ' '), ' ');
+  auto parts = ad_utility::splitWs(ad_utility::strip(str, ' '));


Since splitWs removes the whitespace from the splits shouldn't the strip() be redudant?

niklas88 · 2018-02-26T10:52:04Z

src/util/StringUtils.h

+    size_t start = 0;
+    size_t pos = 0;
+    while (pos < orig.size()) {
+      if (isspace(orig[pos])) {


This is std::isspace() from cctype, so if we use this would make sense to write the std:: explicitly. Then again we may actually want to use ::isspace() globally imported from #include <ctype.h> because unlike std::isspace() this doesn't depend on the locale and is thus probably a bit faster. Since we're dealing with what is basically always UTF-8 encoded but in the ASCII subset we don't really need locale support I think.

That said, both functions have an additional pitfall in taking ints that represent unsigned
char. Since on x86_64 char is unsigned they need an additional cast to unsigned char.. to deal with negative char values as they appear in UTF-8 or else the behavior is technically undefined though I guess it does the right thing in most actual implementations.

…nnecessary strip

niklas88 · 2018-02-27T09:55:27Z

src/util/StringUtils.h

@@ -381,6 +385,33 @@ vector<string> split(const string& orig, const char sep) {
  return result;
 }

+// _____________________________________________________________________________
+vector<string> splitWs(const string &orig) {


We have a weird mix of type& name and type &name in this file. I strongly prefer type& name because to me being a reference is part of the type. Google allows both giving no preference while Stroustrup prefers the former. I propose we keep it this way and I'll do a clang format over the whole file after the merge.

niklas88 · 2018-02-27T10:12:17Z

src/util/StringUtils.h

+    size_t start = 0;
+    size_t pos = 0;
+    while (pos < orig.size()) {
+      if (orig[pos] >= 0 && ::isspace(orig[pos])) {


I got a little obsessed with this and tried it out on my Raspberry Pi where char is unsigned. Your code definitely works and even just ::isspace(orig[pos]) works with both -fsigned-char and -funsigned-char.
So I feel like doing an extra condition is a bit too much just to make it clear that we expect the chars to use the full 8-bits. I think ::isspace(static_cast<unsigned char>(orig[pos])) might be a good compromise. It shows we were thinking about 8-bit chars and it matches the ::isspace() documentation representable as an unsigned char.

niklas88 · 2018-02-27T10:16:06Z

src/util/StringUtils.h

+      pos++;
+    }
+    // avoid adding whitespace at the back of the string
+    if (!(orig[orig.size() - 1] >= 0 && ::isspace(orig[orig.size() - 1]))) {


this can be replaced by just if (start != orig.size()) because we would already have skipped the whitespace. Then it also more closely matches the condition if (start != pos) in the loop which basically does the same thing.

niklas88 · 2018-02-27T10:16:55Z

test/StringUtilsTest.cpp

+  // unicode code point 224 has a second byte (160), that equals the space
+  // character if the first bit is ignored
+  // (which may happen when casting char to int).
+  string s7 = u8"Test\u00e0test";


Great find 🥇

Added splitting at multiple whitespace characters.

6f1fee9

niklas88 suggested changes Feb 26, 2018

View reviewed changes

floriankramer added 2 commits February 26, 2018 13:18

Merge branch 'master' into parse_whitespace

3d52d0f

Added explicit handling for unicode chars, another test and removed u…

7ddcfe9

…nnecessary strip

niklas88 reviewed Feb 27, 2018

View reviewed changes

Removed unecessary checks, and changed reference style

7b8f6a0

niklas88 approved these changes Feb 28, 2018

View reviewed changes

niklas88 merged commit 9218e17 into ad-freiburg:master Feb 28, 2018

floriankramer deleted the parse_whitespace branch February 28, 2018 11:07

leonqli mentioned this pull request Jul 6, 2023

Error when building from source code: /usr/bin/ld: cannot find -lzstd: No such file or directory #1020

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added splitting at multiple whitespace characters #36

Added splitting at multiple whitespace characters #36

floriankramer commented Feb 25, 2018

niklas88 left a comment

niklas88 Feb 26, 2018

niklas88 Feb 26, 2018

niklas88 Feb 26, 2018 •

edited

niklas88 Feb 27, 2018

niklas88 Feb 27, 2018

niklas88 Feb 27, 2018

niklas88 Feb 27, 2018

Added splitting at multiple whitespace characters #36

Added splitting at multiple whitespace characters #36

Conversation

floriankramer commented Feb 25, 2018

niklas88 left a comment

Choose a reason for hiding this comment

niklas88 Feb 26, 2018

Choose a reason for hiding this comment

niklas88 Feb 26, 2018

Choose a reason for hiding this comment

niklas88 Feb 26, 2018 • edited

Choose a reason for hiding this comment

niklas88 Feb 27, 2018

Choose a reason for hiding this comment

niklas88 Feb 27, 2018

Choose a reason for hiding this comment

niklas88 Feb 27, 2018

Choose a reason for hiding this comment

niklas88 Feb 27, 2018

Choose a reason for hiding this comment

niklas88 Feb 26, 2018 •

edited