Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added splitting at multiple whitespace characters #36

Merged
merged 4 commits into from
Feb 28, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 3 additions & 3 deletions src/parser/SparqlParser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ void SparqlParser::parsePrologue(string str, ParsedQuery& query) {

// _____________________________________________________________________________
void SparqlParser::addPrefix(const string& str, ParsedQuery& query) {
auto parts = ad_utility::split(ad_utility::strip(str, ' '), ' ');
auto parts = ad_utility::splitWs(ad_utility::strip(str, ' '));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since splitWs removes the whitespace from the splits shouldn't the strip() be redudant?

if (parts.size() != 3) {
throw ParseException(string("Invalid PREFIX statement: ") + str);
}
Expand All @@ -73,12 +73,12 @@ void SparqlParser::addPrefix(const string& str, ParsedQuery& query) {
}
SparqlPrefix p{ad_utility::strip(parts[1], " :\t\n"), uri};
query._prefixes.emplace_back(p);
};
}

// _____________________________________________________________________________
void SparqlParser::parseSelect(const string& str, ParsedQuery& query) {
assert(ad_utility::startsWith(str, "SELECT"));
auto vars = ad_utility::split(str, ' ');
auto vars = ad_utility::splitWs(str);
size_t i = 1;
if (vars.size() > i && vars[i] == "DISTINCT") {
++i;
Expand Down
31 changes: 31 additions & 0 deletions src/util/StringUtils.h
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
#include <stdlib.h>
#include <assert.h>
#include <iostream>
#include <cctype>
#include <clocale>
#include <cstring>
#include <cwchar>
Expand Down Expand Up @@ -108,6 +109,9 @@ inline string rstrip(const string& text, const char *s);
//! Splits a string at the separator, kinda like python.
inline vector<string> split(const string& orig, const char sep);

//! Splits a string at any maximum length sequence of whitespace
inline vector<string> splitWs(const string &orig);

//! Splits a string a any character inside the seps string.
inline vector<string> splitAny(const string& orig, const char *seps);

Expand Down Expand Up @@ -381,6 +385,33 @@ vector<string> split(const string& orig, const char sep) {
return result;
}

// _____________________________________________________________________________
vector<string> splitWs(const string &orig) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a weird mix of type& name and type &name in this file. I strongly prefer type& name because to me being a reference is part of the type. Google allows both giving no preference while Stroustrup prefers the former. I propose we keep it this way and I'll do a clang format over the whole file after the merge.

vector<string> result;
if (orig.size() > 0) {
size_t start = 0;
size_t pos = 0;
while (pos < orig.size()) {
if (isspace(orig[pos])) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is std::isspace() from cctype, so if we use this would make sense to write the std:: explicitly. Then again we may actually want to use ::isspace() globally imported from #include <ctype.h> because unlike std::isspace() this doesn't depend on the locale and is thus probably a bit faster. Since we're dealing with what is basically always UTF-8 encoded but in the ASCII subset we don't really need locale support I think.

Copy link
Member

@niklas88 niklas88 Feb 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, both functions have an additional pitfall in taking ints that represent unsigned
char. Since on x86_64 char is unsigned they need an additional cast to unsigned char.. to deal with negative char values as they appear in UTF-8 or else the behavior is technically undefined though I guess it does the right thing in most actual implementations.

if (start != pos) {
result.emplace_back(orig.substr(start, pos - start));
}
// skip any whitespace
while (pos < orig.size() && isspace(orig[pos])) {
pos++;
}
start = pos;
}
pos++;
}
// avoid adding whitespace at the back of the string
if (!isspace(orig[orig.size() - 1])) {
result.emplace_back(orig.substr(start, pos - start));
}
}
return result;
}

// _____________________________________________________________________________
vector<string> splitAny(const string& orig, const char *seps) {
return splitAny(orig, string(seps));
Expand Down
39 changes: 39 additions & 0 deletions test/StringUtilsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,45 @@ TEST(StringUtilsTest, strip) {
ASSERT_EQ(u8"äö", strip(u8"xxaxaxaxaäöaaaxxx", "xa"));
ASSERT_EQ("xxaxaxaxa♥", rstrip("xxaxaxaxa♥aaaxxx", "xa"));
}


TEST(StringUtilsTest, splitWs) {
string s1 = " this\nis\t \nit ";
string s2 = "\n \t \n \t";
string s3 = "thisisit";
string s4 = "this is\nit";
string s5 = "a";
auto v1 = splitWs(s1);
ASSERT_EQ(size_t(3), v1.size());
ASSERT_EQ("this", v1[0]);
ASSERT_EQ("is", v1[1]);
ASSERT_EQ("it", v1[2]);

auto v2 = splitWs(s2);
ASSERT_EQ(size_t(0), v2.size());


auto v3 = splitWs(s3);
ASSERT_EQ(size_t(1), v3.size());
ASSERT_EQ("thisisit", v3[0]);

auto v4 = splitWs(s4);
ASSERT_EQ(size_t(3), v4.size());
ASSERT_EQ("this", v4[0]);
ASSERT_EQ("is", v4[1]);
ASSERT_EQ("it", v4[2]);

auto v5 = splitWs(s5);
ASSERT_EQ(size_t(1), v5.size());
ASSERT_EQ("a", v5[0]);

// and with unicode
string s6 = u8"Spaß \t ❤ \n漢字 ";
auto v6 = splitWs(s6);
ASSERT_EQ(u8"Spaß", v6[0]);
ASSERT_EQ(u8"❤", v6[1]);
ASSERT_EQ(u8"漢字", v6[2]);
}
} // namespace


Expand Down