399 - Speculative execution #323

mpenick · 2016-09-21T23:51:00Z

No description provided.

…tions

stamhankar999

I had a few questions/nits. Let's address those, but overall it looks great.

stamhankar999 · 2016-10-05T21:45:48Z

examples/perf/perf.c

-  cass_future_wait(close_future);
-  cass_future_free(close_future);
  cass_cluster_free(cluster);
+  cass_session_free(session);


Should the session be freed before the cluster that presumably owns it?

It doesn't matter. Session copies what it needs from cluster. In general the driver API uses the pattern "const *" for a parameter when it's going to either copy or keep a const reference to the object.

stamhankar999 · 2016-10-06T00:13:22Z

src/cluster.cpp

+
+CassError cass_cluster_set_no_speculative_execution_policy(CassCluster* cluster) {
+  cluster->config().set_speculative_execution_policy(
+        new cass::NoSpeculativeExecutionPolicy());


Would it make sense to have a singleton NoSpeculativeExecutionPolicy that can be set into any cluster object?

In general, the driver avoids allocating memory during static initialization (Actually, we try to avoid any sort of complex static initialization) The main reason being that it prevents up from using a custom allocator to make those allocations when the default allocator is overridden. We've had several users ask for this feature: https://datastax-oss.atlassian.net/browse/CPP-360

A potentially better optimization is to keep returning the same instance of NoSpeculativeExecutionPlan (vs policy) since it is allocated for every request. I've attempted to do this (using custom operator new and operator delete) without luck because multiple threads attempt to write the vtable simultaneously. In the end, that tiny allocation doesn't add very much overhead because is likely cached in the allocator anyway.

Agreed, not a huge deal here. That said, you could create a NoSpeculativeExecutionPlan data member in NoSpeculativeExecutionPolicy's constructor and return that one plan object anytime a user asks for a plan from this policy object. So, you won't have one NoSpeculativeExecutionPlan object in the whole application, but rather O(# NoSpeculativeExecutionPolicy objects), which is still better than O(# active requests). In fact, most of the time you'll only have one cluster object, so # of NoSpeculativeExecutionPolicy is often 1.

That can't be done because the calling code is responsible for deleting the plan and would incorrectly delete the shared instance. We could use a ref-counted object for plans or figure out some custom allocation solution as I attempted before, but I think it's overkill.

Agreed; not worth the trouble if it's even remotely non-trivial.

stamhankar999 · 2016-10-06T16:49:20Z

src/io_worker.cpp

+bool IOWorker::execute(const RequestHandler::Ptr& request_handler) {
+  request_handler->inc_ref(); // Queue reference
+  if (!request_queue_.enqueue(request_handler.get())) {
+    request_handler->dec_ref();


I notice several object types in the driver are ref-counted, which makes sense so that objects can be shared indiscriminately and only cleaned up when the last ref is lost. However, imo having the "user level code" be responsible for incrementing and decrementing the ref-count as objects are shared leaves room for developer error. Have you considered passing shared-refs by value and have the increment/decrement count logic in the constructor/destructor? I understand there's a cost to that -- constructing and destroying intermediate copies, but I think implementing in that way also makes you resilient to uncaught exceptions. For example, in the above code, if enqueue could potentially throw an exception, the request-handler ref-count will not be decreased.

Actually a pass-by-ref when calling methods may be fine most of the time, but store the shared-ptr object by value, so that the ref-count increases. So in the above case, enqueue could take a ref, but in its impl it adds a copy to the queue. Something like that.

I do agree with the error proneness of manually inc/dec ref-counted object. I think we try to avoid this in most parts of the driver and I've attempted to make improvements in this release. If there are specific cases where you think it can be fixed let me know. There are a few cases where this might not be avoided:

Returning a reference to the external API. This is because the fact that it's a ref-counted type is lost.

In async code where the type is erased or it's added to a data structure that doesn't manage lifetime (e.g. an intrusively linked list). There's likely some improvement that could be made here in some places in Connection.

I've attempted to use MCMPQueue<RequestHandler::Ptr> instead of MSMPQueue<RequestHandler*>. The problem is the queue holds on to request memory until another request overwrites it or until the queue is destroyed. One solution is to proactively set the item to an empty object, T(), when dequeuing it, however, I am almost certain that would cause data races. The problem with keeping the memory alive is the size of requests are determine by the integrating application and could be quite large and there's an expectation that when a request is done that memory will be reclaimed.

The driver does not use exceptions and limits the use of std library methods that throw exceptions.

I think I'd like to have a longer discussion about this later; this is a broader question that isn't particularly related to speculative execution, so I'm cool with what you've got here.

stamhankar999 · 2016-10-06T17:18:33Z

src/io_worker.cpp

-  while (remaining != 0 && io_worker->request_queue_.dequeue(request_handler)) {
-    if (request_handler != NULL) {
+  while (remaining != 0 && io_worker->request_queue_.dequeue(temp)) {
+    RequestHandler::Ptr request_handler(temp);


Since request_handler isn't used beyond this if, you could create it inside the if, and have the if condition be
if (temp) {

See above for why the queue uses RequestHandler* instead of RequestHandler::Ptr.

The goal was to get the RequestHandler into a shared ptr as soon as possible to avoid error (in future changes). Also, the SepeculativeExecutionconstructor takes a shared ptr.

stamhankar999 · 2016-10-06T17:24:36Z

src/io_worker.cpp

-      request_handler->set_io_worker(io_worker);
-      request_handler->retry();
+      request_handler->start_request(io_worker);
+      SpeculativeExecution::Ptr speculative_execution(new SpeculativeExecution(request_handler,


You decreased the ref-count on request_handler above, but you don't increment it before sending it to the SpeculativeExecution constructor. And the constructor doesn't increase the ref count either. Is this a bug?

It's been added to a shared pointer which increments and tracks the reference.

Ahh, I see that now.

stamhankar999 · 2016-10-06T22:46:17Z

src/request_handler.cpp

+    case CQL_OPCODE_RESULT: {
+      ResultResponse* result =
+          static_cast<ResultResponse*>(response->response_body().get());
+      if (result->kind() == CASS_RESULT_KIND_PREPARED) {


Is there a scenario where we're simply happy with the response and don't retry on current nor next host? I would've thought that if the result is a PREPARED, then we're actually good, since presumably a prepare request led to this PrepareCallback.

The request has been chained and we're retrying the original request. We've attempted to run a prepared query and the server responded with UNPREPARED so we've wrapped the original request in a prepared request and are now returning to the original request.

stamhankar999 · 2016-10-07T00:53:47Z

src/request_handler.cpp

-                                                           error->required(),
-                                                           error->data_present() > 0,
-                                                           num_retries_));
+      decision =  request_handler_->retry_policy()->on_read_timeout(request(),


I notice that you do an idempotent check for write-timeout below, but not for read-timeout here. Shouldn't the check be here also, since if a statement is not idempotent, we don't want to retry regardless?

Or is it the retry policy's responsibility to account for the idempotency of the request? If that's the case, then the write-timeout handling below should not check for idempotency.

I'm pretty sure that Cassandra read requests are always idempotent. That is, reads don't change data so they can be retried without unintended side effects.

I believe read-timeout error means a socket read timeout occurred between the coordinator and node processing the statement, so it can happen for any kind of statement.

"Read_timeout: Timeout exception during a read request."

https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v3.spec#L982

It explicitly says "write request" and "read request". I'm pretty sure it doesn't have to do we a failed acknowledgement from a replica write as that would manifest as a Write_timeout.

Makes sense.

stamhankar999 · 2016-10-07T01:16:04Z

src/session.cpp

-                                                              request_handler->encoding_cache()));
+  RequestHandler* temp = NULL;
+  while (session->request_queue_->dequeue(temp)) {
+    RequestHandler::Ptr request_handler(temp);


This could move down into the if and change the if condition to be if (temp) {.

The goal was to get the RequestHandler into a shared ptr as soon as possible to avoid error (in future changes). Also, see above.

stamhankar999 · 2016-10-10T16:40:33Z

Ok, I think you've addressed everything I wondered about. Merge at will! :)

…323)

Renamed 'Handler' to 'RequestCallback'

de93318

mpenick added the wip label Sep 21, 2016

mpenick force-pushed the 399 branch from 69ccd35 to 323fbf1 Compare September 22, 2016 00:03

Improvement: SharedRefPtr<> cleanup

1cfbe00

mpenick force-pushed the 399 branch from 323fbf1 to a97955a Compare September 28, 2016 23:15

mpenick added 2 commits September 28, 2016 16:20

Improvement: Remove executing future callbacks on thread pool

1ce7719

Speculative execution

845cefb

mpenick force-pushed the 399 branch from a97955a to 845cefb Compare September 28, 2016 23:20

mpenick added 2 commits September 29, 2016 08:27

Fixed unit tests broken by specultive execution changes

8edc79a

Speculative execution fixes

ecdd469

mpenick removed the wip label Oct 4, 2016

Fix: Remove debug code and use small opt. vector for specultive execu…

b77ccce

…tions

stamhankar999 suggested changes Oct 7, 2016

View reviewed changes

mpenick closed this Oct 20, 2016

mpenick deleted the 399 branch October 20, 2016 16:56

mpenick pushed a commit that referenced this pull request Dec 10, 2019

CPP-838 - Port speculative execution tests to Google test framework (#…

806ab6a

…323)

399 - Speculative execution #323

399 - Speculative execution #323

Uh oh!

Conversation

mpenick commented Sep 21, 2016

Uh oh!

stamhankar999 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamhankar999 Oct 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamhankar999 Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpenick Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamhankar999 commented Oct 10, 2016

Uh oh!

Uh oh!

stamhankar999 Oct 6, 2016 •

edited

Loading

mpenick Oct 7, 2016 •

edited

Loading

stamhankar999 Oct 7, 2016 •

edited

Loading

mpenick Oct 7, 2016 •

edited

Loading