Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: add for_each_substr() for cheap string split #18798

Merged
merged 4 commits into from Nov 14, 2017

Conversation

cbodley
Copy link
Contributor

@cbodley cbodley commented Nov 7, 2017

adds a new templated for_each_substr() function for string split:

  • using boost::string_view avoids copies (and potential allocation) of the substrings
  • using a callback model avoids baking in the container type, along with the copies and allocations required for insertion
  • this interface is ideal for cases where you just want to iterate over the substrings, and don't actually need to store them in a container

example usage:

const char* input = "a b c d e f";
ceph::for_each_substr(input, " ",
  [] (boost::string_view s) {
    std::cout << "token: " << s << std::endl;
  });

by reimplementing get_str_list() and get_str_vec() in terms of for_each_substr(), benchmark results show a speedup of ~30% (where the Small variants use tokens small enough for the small string optimization)

before:

Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BenchStrList/SmallStrList        882 ns        880 ns     749334
BenchStrList/LargeStrList       2015 ns       2012 ns     345552
BenchStrVec/SmallStrVec          600 ns        599 ns    1146149
BenchStrVec/LargeStrVec         1813 ns       1810 ns     381603

after:

Benchmark                          Time           CPU Iterations    [Speedup]
-----------------------------------------------------------------
BenchStrList/SmallStrList        650 ns        648 ns     929445    1.35
BenchStrList/LargeStrList       1498 ns       1495 ns     441588    1.35
BenchStrVec/SmallStrVec          445 ns        444 ns    1520766    1.34
BenchStrVec/LargeStrVec         1419 ns       1416 ns     462015    1.27

(this pr adds a submodule for the google benchmark library, https://github.com/google/benchmark. we don't necessarily need to merge that or the benchmark itself)

Copy link
Contributor

@adamemerson adamemerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lexically Germane Token Mastery

@@ -62,6 +62,13 @@ void get_str_list(const string& str, list<string>& str_list)
return get_str_list(str, delims, str_list);
}

list<string> get_str_list(const string& str, const char *delims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@@ -104,3 +118,10 @@ void get_str_set(const string& str, set<string>& str_set)
const char *delims = ";,= \t";
return get_str_set(str, delims, str_set);
}

set<string> get_str_set(const string& str, const char *delims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a certain point I feel like this is asking to be a template template.

@@ -46,6 +49,8 @@ extern void get_str_vec(const std::string& str,
const char *delims,
std::vector<std::string>& str_vec);

std::vector<std::string> get_str_vec(const std::string& str,
const char *delims = ";,= \t");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about delims being a flat set and/or just taking any sequence of characters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm guessing the vast majority of cases will pass them a string literal, so i'd prefer not to copy them into something like a flat_set. but taking them as a string_view would be an easy option

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i decided to leave it as const char* here, because changing it to string_view would require changing the existing signatures too - and i don't think there's any real benefit to doing that

/// Split a string using the given delimiters, passing each piece as a
/// (non-null-terminated) boost::string_view to the callback.
template <typename Func> // where Func(boost::string_view) is a valid call
void for_each_substr(boost::string_view s, const char *delims, Func&& f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve.

}
}
}
for_each_substr(str, delims, [&str_list] (boost::string_view token) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we pass token by reference? like:

for_each_substr(str, delims, [&str_list] (const boost::string_view& token) {
 ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string_view is just a pointer/len pair, so it's trivial to copy and can be passed in registers. some advice about this from the cpp core guidelines:

F.16: For “in” parameters, pass cheaply-copied types by value and others by reference to const
...
What is “cheap to copy” depends on the machine architecture, but two or three words (doubles, pointers, references) are usually best passed by value. When copying is cheap, nothing beats the simplicity and safety of copying, and for small objects (up to two or three words) it is also faster than passing by reference because it does not require an extra indirection to access from the function.

void for_each_substr(boost::string_view s, const char *delims, Func&& f)
{
auto pos = s.find_first_not_of(delims);
while (pos != boost::string_view::npos) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, could just put

while (pos != s.npos) {
...
}

less type.

.gitmodules Outdated
@@ -55,3 +55,6 @@
[submodule "src/rapidjson"]
path = src/rapidjson
url = https://github.com/ceph/rapidjson
[submodule "src/benchmark"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding google/benchmark as a submodule, can we add it using ExternalProject_Add()? and do the git clone on the fly? just like how we add nvml.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, np. i'm not convinced that it's worth pulling in a new library for this dumb little benchmark though, unless we plan to make further use of it. do you think it's worth merging those pieces?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, makes sense. after a second thought, neither am i convinced now. it'd be helpful in the case that we have multiple comparable backend/implementations of a certain feature like lz4,snappy,zlib and zstd, and user/developer can use the benchmark tool to evaluate them with different datasets/parameters to choose a backend and a set of parameters. but apparently, for_each_substr() does not have its alternative(s) in Ceph, so probably we should drop this benchmark tool.

the simpler interfaces rely on return value optimization to avoid
copying the result. removing the container from the argument list makes
it easier to provide a default argument for 'delims'

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
using boost::string_view avoids copies (and potential allocation) of
the substrings

using a callback model avoids baking in the container type, along with
the copies and allocations required for insertion

this interface is ideal for cases where you just want to iterate over
the substrings, and don't actually need to store them in a container

Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
@cbodley
Copy link
Contributor Author

cbodley commented Nov 10, 2017

updated. i removed the benchmark stuff, but those commits are still available in https://github.com/cbodley/ceph/commits/wip-str-list-view-bench if anyone wants to reproduce

@tchaikov tchaikov merged commit c7df576 into ceph:master Nov 14, 2017
@cbodley cbodley deleted the wip-str-list-view branch November 14, 2017 14:26
@cbodley
Copy link
Contributor Author

cbodley commented Nov 14, 2017

thanks @tchaikov!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants