New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: add for_each_substr() for cheap string split #18798
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lexically Germane Token Mastery
src/common/str_list.cc
Outdated
@@ -62,6 +62,13 @@ void get_str_list(const string& str, list<string>& str_list) | |||
return get_str_list(str, delims, str_list); | |||
} | |||
|
|||
list<string> get_str_list(const string& str, const char *delims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good.
src/common/str_list.cc
Outdated
@@ -104,3 +118,10 @@ void get_str_set(const string& str, set<string>& str_set) | |||
const char *delims = ";,= \t"; | |||
return get_str_set(str, delims, str_set); | |||
} | |||
|
|||
set<string> get_str_set(const string& str, const char *delims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a certain point I feel like this is asking to be a template template.
@@ -46,6 +49,8 @@ extern void get_str_vec(const std::string& str, | |||
const char *delims, | |||
std::vector<std::string>& str_vec); | |||
|
|||
std::vector<std::string> get_str_vec(const std::string& str, | |||
const char *delims = ";,= \t"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about delims being a flat set and/or just taking any sequence of characters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm guessing the vast majority of cases will pass them a string literal, so i'd prefer not to copy them into something like a flat_set. but taking them as a string_view would be an easy option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i decided to leave it as const char*
here, because changing it to string_view
would require changing the existing signatures too - and i don't think there's any real benefit to doing that
/// Split a string using the given delimiters, passing each piece as a | ||
/// (non-null-terminated) boost::string_view to the callback. | ||
template <typename Func> // where Func(boost::string_view) is a valid call | ||
void for_each_substr(boost::string_view s, const char *delims, Func&& f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approve.
143911a
to
97fe0e8
Compare
} | ||
} | ||
} | ||
for_each_substr(str, delims, [&str_list] (boost::string_view token) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pass token by reference? like:
for_each_substr(str, delims, [&str_list] (const boost::string_view& token) {
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string_view is just a pointer/len pair, so it's trivial to copy and can be passed in registers. some advice about this from the cpp core guidelines:
F.16: For “in” parameters, pass cheaply-copied types by value and others by reference to const
...
What is “cheap to copy” depends on the machine architecture, but two or three words (doubles, pointers, references) are usually best passed by value. When copying is cheap, nothing beats the simplicity and safety of copying, and for small objects (up to two or three words) it is also faster than passing by reference because it does not require an extra indirection to access from the function.
src/include/str_list.h
Outdated
void for_each_substr(boost::string_view s, const char *delims, Func&& f) | ||
{ | ||
auto pos = s.find_first_not_of(delims); | ||
while (pos != boost::string_view::npos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, could just put
while (pos != s.npos) {
...
}
less type.
.gitmodules
Outdated
@@ -55,3 +55,6 @@ | |||
[submodule "src/rapidjson"] | |||
path = src/rapidjson | |||
url = https://github.com/ceph/rapidjson | |||
[submodule "src/benchmark"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding google/benchmark as a submodule, can we add it using ExternalProject_Add()
? and do the git clone on the fly? just like how we add nvml
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, np. i'm not convinced that it's worth pulling in a new library for this dumb little benchmark though, unless we plan to make further use of it. do you think it's worth merging those pieces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, makes sense. after a second thought, neither am i convinced now. it'd be helpful in the case that we have multiple comparable backend/implementations of a certain feature like lz4,snappy,zlib and zstd, and user/developer can use the benchmark tool to evaluate them with different datasets/parameters to choose a backend and a set of parameters. but apparently, for_each_substr()
does not have its alternative(s) in Ceph, so probably we should drop this benchmark tool.
the simpler interfaces rely on return value optimization to avoid copying the result. removing the container from the argument list makes it easier to provide a default argument for 'delims' Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
using boost::string_view avoids copies (and potential allocation) of the substrings using a callback model avoids baking in the container type, along with the copies and allocations required for insertion this interface is ideal for cases where you just want to iterate over the substrings, and don't actually need to store them in a container Signed-off-by: Casey Bodley <cbodley@redhat.com>
Signed-off-by: Casey Bodley <cbodley@redhat.com>
97fe0e8
to
c4e54b7
Compare
updated. i removed the benchmark stuff, but those commits are still available in https://github.com/cbodley/ceph/commits/wip-str-list-view-bench if anyone wants to reproduce |
thanks @tchaikov! |
adds a new templated
for_each_substr()
function for string split:boost::string_view
avoids copies (and potential allocation) of the substringsexample usage:
by reimplementing
get_str_list()
andget_str_vec()
in terms offor_each_substr()
, benchmark results show a speedup of ~30% (where theSmall
variants use tokens small enough for the small string optimization)before:
after:
(this pr adds a submodule for the google benchmark library, https://github.com/google/benchmark. we don't necessarily need to merge that or the benchmark itself)