Added Stateful processing related message handling in TMasterClient #2075

srkukarni · 2017-07-17T21:23:08Z

While running in stateful mode, TMaster sends the following messages to the stmgr

Periodic Checkpoint messages(StartStatefulCheckpoint) so that stmgr can send them to local instances for checkpointing their state
Restore Topology messages(RestoreTopologyStateRequest) so that stmgr can start restoring the state of the local instances to a specified checkpoint_id
Start Processing message(StartStmgrStatefulProcessing) to start the processing when all stmgrs have restored their local instances' state.
This change adds the ability for TMasterClient in stmgr to handle those messages and call the appropriate stmgr callback. Note that the stmgr callbacks themselves are not yet implemented, that will be coming in the next set of prs.

nlu90 · 2017-07-18T00:00:31Z

heron/stmgr/src/cpp/manager/tmaster-client.cpp

@@ -217,5 +235,48 @@ void TMasterClient::SendHeartbeatRequest() {
  return;
 }

+void TMasterClient::SavedInstanceState(const proto::system::Instance& _instance,
+                                       const std::string& _checkpoint_id) {
+  proto::ckptmgr::InstanceStateStored message;


should we use the global mempool here and any following places?

My feeling is that its not necessary since its on control path

nlu90 · 2017-07-18T00:09:31Z

heron/stmgr/src/cpp/manager/tmaster-client.cpp

+  __global_protobuf_pool_release__(_message);
+}
+
+void TMasterClient::HandleStartStmgrStatefulProcessing(


Why do we need this start_stateful_processing callback? Whether a topology is stateful or not is configured in user's topology code. Once a stmgr is started, it should already know if it's stateful or not by reading the topology config. And thus can behaves accordingly.

So topologies not doing exactly once can start processing right away as soon as assignment are propagated to all stmgrs/instances. However for exactly once semantics, we have to wait until all of the stmgrs/instances are restored to a certain globally consistent checkpoint before proceeding. Once that determination is done by tmaster, this is the message that it sends to the stmgr to start the actual processing

nlu90 · 2017-07-18T00:10:03Z

heron/proto/ckptmgr.proto

  optional string dead_stmgr = 1;
+  // Was there a dead/recovered local instance connection that was the reason
+  // for this request?
  optional int32 dead_taskid = 2;


should we allow multiple tasks death?

Not necessary. We issue these messages to the tmaster as soon as we notice any instance failure. The same applies for any stmgr client connection failures

nlu90 · 2017-07-18T18:26:36Z

👍

billonahill · 2017-07-18T18:41:34Z

heron/stmgr/src/cpp/manager/stmgr.h

@@ -138,6 +138,20 @@ class StMgr {
  void BroadcastTmasterLocation(proto::tmaster::TMasterLocation* tmasterLocation);
  void BroadcastMetricsCacheLocation(proto::tmaster::MetricsCacheLocation* tmasterLocation);

+  // Called when TMaster sends a InitiateStatefulCheckpoint message with a checkpoint_id
+  // This will send intiate checkpoint messages to local instances to capture their state.


typo: intiate

billonahill · 2017-07-18T18:42:33Z

heron/stmgr/src/cpp/manager/stmgr.h

+  // Invoked when TMaster sends the StartStatefulProcessing request to kick
+  // start the computation. We send the StartStatefulProcessing to all our
+  // local instances so that they can start the processing.
+  void StartStatefulProcessing(sp_string _checkpoint_id);


why is the checkpoint id necessary to start processing?

Mostly for correctness checking purposes. Instances will check that they have recovered to this checkpoint id before starting. They should die if this checkpoint id does not match theirs.

…2075)

Added Stateful processing related message handling in TMasterClient

d7f7daa

srkukarni requested review from billonahill, kramasamy, objmagic and huijunw July 17, 2017 21:23

huijunw approved these changes Jul 17, 2017

View reviewed changes

nlu90 reviewed Jul 18, 2017

View reviewed changes

srkukarni merged commit db60686 into apache:master Jul 18, 2017

srkukarni deleted the sanjeevk/ext1_tmasterclient branch July 18, 2017 05:10

billonahill reviewed Jul 18, 2017

View reviewed changes

nicknezis pushed a commit that referenced this pull request Sep 14, 2020

Added Stateful processing related message handling in TMasterClient (#…

7e838cd

…2075)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Stateful processing related message handling in TMasterClient #2075

Added Stateful processing related message handling in TMasterClient #2075

srkukarni commented Jul 17, 2017

nlu90 Jul 18, 2017

srkukarni Jul 18, 2017

nlu90 Jul 18, 2017

srkukarni Jul 18, 2017

nlu90 Jul 18, 2017

srkukarni Jul 18, 2017

nlu90 commented Jul 18, 2017

billonahill Jul 18, 2017

billonahill Jul 18, 2017

srkukarni Jul 18, 2017

Added Stateful processing related message handling in TMasterClient #2075

Added Stateful processing related message handling in TMasterClient #2075

Conversation

srkukarni commented Jul 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nlu90 commented Jul 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment