rgw: multisite log tracing #16492

yehudasa · 2017-07-21T21:22:59Z

A new framework that tracks in memory of current rgw sync process. The new system allows following a specific sync entity (vs. the current flat log dump). Each entity has an id that is a concatenation of the path within the execution tree. An entity would roughly reflect a sync role (meta, meta shard, meta entry, data, data shard, bucket shard, object), and it is possible to look at the history of that entity's point of view. New admin socket commands were added:

    "sync trace active": "show active multisite sync entities information"
    "sync trace active_short": "show active multisite sync entities entries"
    "sync trace history": "sync trace history [filter_str]: show history of multisite tracing information"
    "sync trace show": "sync trace show [filter_str]: show current multisite tracing information"

All commands can get an extra param that is a regex that can be used to search for a specific entity (matching the history of that entity). E.g.,

$ ceph --admin-daemon=/home/yehudasa/ceph/build/run/c2/out/radosgw.8001.asok sync trace show meta.*shard.13
{
    "running": [
        {
            "status": "meta:shard[13]: took lease"
        }
    ],
    "complete": []
}

We keep info about all current running nodes, and also keep some history of complete nodes. We can see a view of current nodes that are marked as active (where we identified and deal with actual meta/data sync):

$ ceph --admin-daemon=/home/yehudasa/ceph/build/run/c1/out/radosgw.8000.asok sync trace active
{
    "running": [
        {
            "status": "data:sync:shard[115]:entry[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]:bucket[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]:inc_sync[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]: listing bilog for incremental sync"
        },
        {
            "status": "data:sync:shard[115]:entry[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]:bucket[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]:inc_sync[buck:a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5]:entry[fff3]: bucket sync: sync obj: 8ef1cd9f-d0f2-485d-ba1e-c02f1f085b1c/buck[a6fabc5f-23bc-473c-9b05-ae8d1948dc08.4185.5])/fff3[0]"
        }
    ],
    "complete": []
}

There's a active_short option that can be used to just get a plain name of the entities that are current syncing (e.g., list of /). This list is being sent to the service map periodically and can be retrieved via the ceph service status command.

mattbenjamin · 2017-07-24T23:10:47Z

@yehudasa I spent some time trying to find something worrisome in this; I haven't found it so far. I don't -think- sending updates to the mgr at 10s intervals should be costly at either end? I did wonder whether we have a modern idiom replacing RWLock?

mattbenjamin · 2017-07-24T23:45:19Z

@yehudasa ok, one practical question; this change creates
+class RGWSyncTraceServiceMapThread : public RGWRadosThread {
which seems to mainly manage periodic mgr updates; would it make sense to consolidate this with other periodic mgr updates?

yehudasa · 2017-08-14T07:11:51Z

@mattbenjamin while we update the status every 10 seconds, it is librados that decides when to send these updates, following its own internal config

yehudasa · 2017-08-14T07:13:32Z

would it make sense to consolidate this with other periodic mgr updates?

@mattbenjamin I don't recall any such periodic mgr thread that can be used. I do think that it'd be nice to have something we could hook into, but maybe that's out of scope.

yehudasa · 2017-08-14T07:56:12Z

@mattbenjamin rebased

amitkumar50 · 2017-08-14T11:07:08Z

src/rgw/rgw_sync_trace.h

+};
+
+/*
+ * a container to RGWSyncTraceNodeRef, responsible to keep track


Let's start sentence with capital letter. Let's make a capital.

cbodley · 2017-08-15T20:41:32Z

src/rgw/rgw_sync_trace.h

+ * of live nodes, and when last ref is dropped, calls ->finish()
+ * so that node moves to the retired list in the manager
+ */
+class RGWSyncTraceNodeContainer {


this wrapper adds an extra allocation to create and indirection to access. consider using a custom deleter with shared_ptr instead? this works for me:

-using RGWSTNCRef = std::shared_ptr<RGWSyncTraceNodeContainer>; -RGWSTNCRef RGWSyncTraceManager::add_node(RGWSyncTraceNode *node) +RGWSyncTraceNodeRef RGWSyncTraceManager::add_node(RGWSyncTraceNode *node) { RWLock::WLocker wl(lock); RGWSyncTraceNodeRef& ref = nodes[node->handle]; ref.reset(node); - return RGWSTNCRef(new RGWSyncTraceNodeContainer(ref)); + // return a separate shared_ptr that calls finish() on the node instead of + // deleting it. the lambda capture holds a reference to the original 'ref' + auto deleter = [ref] (RGWSyncTraceNode *node) { node->finish(); }; + return {node, deleter}; }

the full commit that replaces uses of RGWSTNCRef with RGWSyncTraceNodeRef is available here: cbodley@af246d3

cbodley · 2017-08-15T20:45:52Z

src/rgw/rgw_data_sync.cc

-    ldout(sync_env->cct, 20) << __func__ << "(): sync status for bucket "
-        << bucket_shard_str{bs} << ": " << sync_status.state << dendl;
+    tn->log(20, SSTR("sync status for bucket "
+        << bucket_shard_str{bs} << ": " << sync_status.state));


i think we should leave bucket_shard_str part out of these messages, now that it's already included in the log message. for example, this one looks like:

data:sync:shard[75]:entry[eslbge-29:d2560962-42d3-4295-ac0c-efbbce6b79d8.4134.9]:bucket[eslbge-29:d2560962-42d3-4295-ac0c-efbbce6b79d8.4134.9]: sync status for bucket eslbge-29:d2560962-42d3-4295-ac0c-efbbce6b79d8.4134.9: 0

amitkumar50 · 2017-08-16T12:14:16Z

src/common/legacy_config_opts.h

@@ -1537,6 +1537,9 @@ OPTION(rgw_sync_log_trim_interval, OPT_INT) // time in seconds between attempts

 OPTION(rgw_sync_data_inject_err_probability, OPT_DOUBLE) // range [0, 1]
 OPTION(rgw_sync_meta_inject_err_probability, OPT_DOUBLE) // range [0, 1]
+OPTION(rgw_sync_trace_history_size, OPT_INT) // max number of complete sync trace entries to keep


@yehudasa How good it would be to start comment with Capital letters?
// Max number of complete sync trace entries to keep

@amitkumar50 on a scale of 0 to 100? 3

yehudasa · 2017-08-21T09:15:58Z

@cbodley incorporated your changes, comments. @mattbenjamin replaced RWLock with ceph::shunique_lock.

theanalyst · 2017-08-21T09:21:26Z

src/rgw/rgw_sync_trace.h

+  mutable boost::shared_mutex lock;
+  using shunique_lock = ceph::shunique_lock<decltype(lock)>;
+
+  std::atomic<uint64_t> count;


can we init count to 0 here

theanalyst · 2017-08-21T09:27:52Z

src/rgw/rgw_data_sync.cc

+struct bucket_str_noinstance {
+  const rgw_bucket& b;
+  bucket_str_noinstance(const rgw_bucket& b) : b(b) {}
+};


nit: newline after struct definition

theanalyst · 2017-08-21T09:30:49Z

src/rgw/rgw_sync.h

@@ -420,18 +426,13 @@ class RGWMetaSyncSingleEntryCR : public RGWCoroutine {

  bool error_injection;

+  RGWSyncTraceNodeRef tn;
+
 public:
  RGWMetaSyncSingleEntryCR(RGWMetaSyncEnv *_sync_env,
 		           const string& _raw_key, const string& _entry_marker,
                           const RGWMDLogStatus& _op_status,


while you're at it can you fix the alignment here, (not a part of this changeset)

theanalyst · 2017-08-21T09:31:41Z

src/rgw/rgw_sync_trace.cc

+  } catch (boost::bad_expression const& e) {
+    ldout(cct, 5) << "NOTICE: sync trace: bad expression: bad regex search term" << dendl;
+  } catch (...) {
+    ldout(cct, 5) << "NOTICE: sync trace: regex_search() through exception" << dendl;