common: add for_each_substr() for cheap string split #18798

cbodley · 2017-11-07T20:03:10Z

adds a new templated for_each_substr() function for string split:

using boost::string_view avoids copies (and potential allocation) of the substrings
using a callback model avoids baking in the container type, along with the copies and allocations required for insertion
this interface is ideal for cases where you just want to iterate over the substrings, and don't actually need to store them in a container

example usage:

const char* input = "a b c d e f";
ceph::for_each_substr(input, " ",
  [] (boost::string_view s) {
    std::cout << "token: " << s << std::endl;
  });

by reimplementing get_str_list() and get_str_vec() in terms of for_each_substr(), benchmark results show a speedup of ~30% (where the Small variants use tokens small enough for the small string optimization)

before:

Benchmark                          Time           CPU Iterations
-----------------------------------------------------------------
BenchStrList/SmallStrList        882 ns        880 ns     749334
BenchStrList/LargeStrList       2015 ns       2012 ns     345552
BenchStrVec/SmallStrVec          600 ns        599 ns    1146149
BenchStrVec/LargeStrVec         1813 ns       1810 ns     381603

after:

Benchmark                          Time           CPU Iterations    [Speedup]
-----------------------------------------------------------------
BenchStrList/SmallStrList        650 ns        648 ns     929445    1.35
BenchStrList/LargeStrList       1498 ns       1495 ns     441588    1.35
BenchStrVec/SmallStrVec          445 ns        444 ns    1520766    1.34
BenchStrVec/LargeStrVec         1419 ns       1416 ns     462015    1.27

(this pr adds a submodule for the google benchmark library, https://github.com/google/benchmark. we don't necessarily need to merge that or the benchmark itself)

adamemerson

Lexically Germane Token Mastery

adamemerson · 2017-11-07T20:18:57Z

src/common/str_list.cc

@@ -62,6 +62,13 @@ void get_str_list(const string& str, list<string>& str_list)
  return get_str_list(str, delims, str_list);
 }

+list<string> get_str_list(const string& str, const char *delims)


adamemerson · 2017-11-07T20:19:37Z

src/common/str_list.cc

@@ -104,3 +118,10 @@ void get_str_set(const string& str, set<string>& str_set)
  const char *delims = ";,= \t";
  return get_str_set(str, delims, str_set);
 }
+
+set<string> get_str_set(const string& str, const char *delims)


At a certain point I feel like this is asking to be a template template.

adamemerson · 2017-11-07T20:20:24Z

src/include/str_list.h

@@ -46,6 +49,8 @@ extern void get_str_vec(const std::string& str,
                         const char *delims,
 			 std::vector<std::string>& str_vec);

+std::vector<std::string> get_str_vec(const std::string& str,
+                                     const char *delims = ";,= \t");


How do you feel about delims being a flat set and/or just taking any sequence of characters?

i'm guessing the vast majority of cases will pass them a string literal, so i'd prefer not to copy them into something like a flat_set. but taking them as a string_view would be an easy option

i decided to leave it as const char* here, because changing it to string_view would require changing the existing signatures too - and i don't think there's any real benefit to doing that

adamemerson · 2017-11-07T20:23:03Z

src/include/str_list.h

+/// Split a string using the given delimiters, passing each piece as a
+/// (non-null-terminated) boost::string_view to the callback.
+template <typename Func> // where Func(boost::string_view) is a valid call
+void for_each_substr(boost::string_view s, const char *delims, Func&& f)


tchaikov · 2017-11-08T03:28:00Z

src/common/str_list.cc

-      }
-    }
-  }
+  for_each_substr(str, delims, [&str_list] (boost::string_view token) {


can we pass token by reference? like:

for_each_substr(str, delims, [&str_list] (const boost::string_view& token) { ... }

string_view is just a pointer/len pair, so it's trivial to copy and can be passed in registers. some advice about this from the cpp core guidelines:

F.16: For “in” parameters, pass cheaply-copied types by value and others by reference to const
...
What is “cheap to copy” depends on the machine architecture, but two or three words (doubles, pointers, references) are usually best passed by value. When copying is cheap, nothing beats the simplicity and safety of copying, and for small objects (up to two or three words) it is also faster than passing by reference because it does not require an extra indirection to access from the function.

tchaikov · 2017-11-08T03:30:33Z

src/include/str_list.h

+void for_each_substr(boost::string_view s, const char *delims, Func&& f)
+{
+  auto pos = s.find_first_not_of(delims);
+  while (pos != boost::string_view::npos) {


nit, could just put

while (pos != s.npos) { ... }

less type.

tchaikov · 2017-11-08T09:17:00Z

.gitmodules

@@ -55,3 +55,6 @@
 [submodule "src/rapidjson"]
 	path = src/rapidjson
 	url = https://github.com/ceph/rapidjson
+[submodule "src/benchmark"]


instead of adding google/benchmark as a submodule, can we add it using ExternalProject_Add()? and do the git clone on the fly? just like how we add nvml.

yeah, np. i'm not convinced that it's worth pulling in a new library for this dumb little benchmark though, unless we plan to make further use of it. do you think it's worth merging those pieces?

yeah, makes sense. after a second thought, neither am i convinced now. it'd be helpful in the case that we have multiple comparable backend/implementations of a certain feature like lz4,snappy,zlib and zstd, and user/developer can use the benchmark tool to evaluate them with different datasets/parameters to choose a backend and a set of parameters. but apparently, for_each_substr() does not have its alternative(s) in Ceph, so probably we should drop this benchmark tool.

the simpler interfaces rely on return value optimization to avoid copying the result. removing the container from the argument list makes it easier to provide a default argument for 'delims' Signed-off-by: Casey Bodley <cbodley@redhat.com>

Signed-off-by: Casey Bodley <cbodley@redhat.com>

using boost::string_view avoids copies (and potential allocation) of the substrings using a callback model avoids baking in the container type, along with the copies and allocations required for insertion this interface is ideal for cases where you just want to iterate over the substrings, and don't actually need to store them in a container Signed-off-by: Casey Bodley <cbodley@redhat.com>

Signed-off-by: Casey Bodley <cbodley@redhat.com>

cbodley · 2017-11-10T16:12:31Z

updated. i removed the benchmark stuff, but those commits are still available in https://github.com/cbodley/ceph/commits/wip-str-list-view-bench if anyone wants to reproduce

tchaikov · 2017-11-14T06:58:07Z

cbodley · 2017-11-14T14:26:51Z

thanks @tchaikov!

cbodley added the common label Nov 7, 2017

cbodley requested review from tchaikov and adamemerson November 7, 2017 20:11

adamemerson approved these changes Nov 7, 2017

View reviewed changes

cbodley force-pushed the wip-str-list-view branch from 143911a to 97fe0e8 Compare November 7, 2017 21:50

tchaikov reviewed Nov 8, 2017

View reviewed changes

cbodley added 4 commits November 10, 2017 11:05

common: add simplified interfaces to get_str_*

0b39e47

the simpler interfaces rely on return value optimization to avoid copying the result. removing the container from the argument list makes it easier to provide a default argument for 'delims' Signed-off-by: Casey Bodley <cbodley@redhat.com>

test/common: extend str_list tests to include set

a942dfd

Signed-off-by: Casey Bodley <cbodley@redhat.com>

common: get_str_vec and friends use for_each_substr

c4e54b7

Signed-off-by: Casey Bodley <cbodley@redhat.com>

cbodley force-pushed the wip-str-list-view branch from 97fe0e8 to c4e54b7 Compare November 10, 2017 16:09

tchaikov approved these changes Nov 10, 2017

View reviewed changes

tchaikov added needs-qa wip-kefu-testing labels Nov 10, 2017

tchaikov merged commit c7df576 into ceph:master Nov 14, 2017

cbodley deleted the wip-str-list-view branch November 14, 2017 14:26

changchengx mentioned this pull request Jul 27, 2020

use for_each_substr to avoid redundant operation by calling get_str_set #35613

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common: add for_each_substr() for cheap string split #18798

common: add for_each_substr() for cheap string split #18798

cbodley commented Nov 7, 2017 •

edited

adamemerson left a comment

adamemerson Nov 7, 2017

adamemerson Nov 7, 2017

adamemerson Nov 7, 2017

cbodley Nov 7, 2017

cbodley Nov 10, 2017

adamemerson Nov 7, 2017

tchaikov Nov 8, 2017

cbodley Nov 8, 2017

tchaikov Nov 8, 2017

tchaikov Nov 8, 2017

cbodley Nov 8, 2017

tchaikov Nov 10, 2017

cbodley commented Nov 10, 2017

tchaikov commented Nov 14, 2017

cbodley commented Nov 14, 2017

common: add for_each_substr() for cheap string split #18798

common: add for_each_substr() for cheap string split #18798

Conversation

cbodley commented Nov 7, 2017 • edited

adamemerson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbodley commented Nov 10, 2017

tchaikov commented Nov 14, 2017

cbodley commented Nov 14, 2017

cbodley commented Nov 7, 2017 •

edited