feat: add master lsn and journal_executed dcheck in replica via ping #2778

kostasrim · 2024-03-26T21:26:38Z

resolves #2773

add lsn number to journal ping
add periodic ping from master to replica in journal
add dcheck in replica that journal_executed == lsn
add version 4 in dfly version
add separate counter for journal_executed that has proper semantics for pinging lsn

kostasrim · 2024-03-26T21:37:16Z

@adiholden This is just a prototype so don't review -- it needs polishing and some fixing/gluing. I opened this because I have a small question:

There are two options:

Ping at period P on a separate fiber.
Ping at period P when we Record an entry in the journal (that means that if the master is idle there will be no pings, even if Period internal was reached).

I opted in for 2 (although it's easy to switch to 1). The reason is that if master is idle, then the last recorded entry (or one of the last within 2 seconds) will send PING LSN journal entry and since master won't progress anyway the last lag will show how close the replica is. The downside of this is that we won't get continuous updates on our progression on the replica side. Also, note for (1) we will need n=number_of_flows fibers whereas with (2) we don't need (it flows naturally over the flow of stable sync and journal recording).

Do you have any objections with (2)?

kostasrim · 2024-03-26T21:38:35Z

src/server/replica.cc

+      // TODO remove incrementing lsn on master side otherwise we break takeover
+      // journal_rec_executed_.fetch_add(1, std::memory_order_relaxed);
+      if (tx_data->lsn != journal_rec_executed_.load()) {
+        // TODO LOG


@adiholden we should not DCHECK here. Lagging should not crash us on debug builds (mentioning this because it was mentioned in the issue)

why not crash? if we reproduce this in tests or some run I want to know we have a bug here so we can debug this.
The check is not that replica is lagging behind master but that the count of journal changes is different between master and replica meaning some data did not reach to replica / we do not count correctly in replica the journal changes

adiholden · 2024-03-27T08:52:07Z

@adiholden This is just a prototype so don't review -- it needs polishing and some fixing/gluing. I opened this because I have a small question:

There are two options:

Ping at period P on a separate fiber.

Ping at period P when we Record an entry in the journal (that means that if the master is idle there will be no pings, even if Period internal was reached).

I opted in for 2 (although it's easy to switch to 1). The reason is that if master is idle, then the last recorded entry (or one of the last within 2 seconds) will send PING LSN journal entry and since master won't progress anyway the last lag will show how close the replica is. The downside of this is that we won't get continuous updates on our progression on the replica side. Also, note for (1) we will need n=number_of_flows fibers whereas with (2) we don't need (it flows naturally over the flow of stable sync and journal recording).

Do you have any objections with (2)?

Option 2 sounds good

kostasrim · 2024-03-27T12:29:36Z

@adiholden replication tests should fail -- I am chasing two missing LSN's 😛 In the meantime, you can leave comments :)

kostasrim · 2024-03-27T12:39:57Z

src/server/replica.cc

    } else if (tx_data->opcode == journal::Op::EXEC) {
      if (use_multi_shard_exe_sync_) {
+        records = tx_data->journal_rec_count;


I will revert this change.

adiholden · 2024-03-27T12:50:37Z

src/server/dflycmd.cc

@@ -503,7 +503,8 @@ OpStatus DflyCmd::StartStableSyncInThread(FlowInfo* flow, Context* cntx, EngineS

  if (shard != nullptr) {
    flow->streamer.reset(new JournalStreamer(sf_->journal(), cntx));
-    flow->streamer->Start(flow->conn->socket());
+    const bool should_ping = flow->version == DflyVersion::VER4;


flow->version >= DflyVersion::VER4

adiholden · 2024-03-27T12:56:57Z

src/server/journal/serializer.cc

@@ -195,7 +195,11 @@ io::Result<journal::ParsedEntry> JournalReader::ReadEntry() {
  entry.dbid = dbid_;
  entry.opcode = opcode;

-  if (opcode == journal::Op::PING || opcode == journal::Op::FIN) {
+  if (opcode == journal::Op::PING) {
+    SET_OR_UNEXPECT(ReadUInt<uint64_t>(), entry.lsn);


we currently send ping in replica takeover flow.
so you need to read entry.lsn only if master version is ver4 or higher

adiholden · 2024-03-27T13:15:46Z

src/server/replica.cc

      if (use_multi_shard_exe_sync_) {
        InsertTxDataToShardResource(std::move(*tx_data));
      } else {
        ExecuteTxWithNoShardSync(std::move(*tx_data), cntx);
      }
    }
+    journal_rec_executed_.fetch_add(records);


I wrote in the issue that we need to compare with journal_rec_executed_, but actually now after reviewing the code I understand that we need a different counter to compare to which will be incremented inside NextTxData when reading another entry from socket.
The reason for this is that under use_multi_shard_exe_sync_ we accumulate the multi transaction data and do not execute it therefor you might get different value in journal_rec_executed_ when comparing when getting a ping if it is in the middle of multi transaction

The reason for this is that under use_multi_shard_exe_sync_ we accumulate the multi transaction data and do not execute it therefor you might get different value in journal_rec_executed_

Exactly. That's why I moved the fetch_add from execute and replaced it here, so this doesn't happen. I am not 100% sure of this is correct but for now I am looking at why one of the replication tests triggers the DCHECK (and it's not because of multi). I will take care of this soon :)

but journal_rec_executed_ was incremented after executing the command on purpose. replica sends master this value so that we will know the replica lag and that we can also do replica takeover only after lag is 0 meaning we will have no records that are not executed

yes I understand this and that's why I said it's probably wrong. Ended up using a separate counter in NextTxData but we still have some issues :)

adiholden · 2024-03-27T13:36:10Z

src/server/journal/streamer.cc

-    NotifyWritten(allow_await);
-  });
+        if (with_pings) {
+          periodic_ping_.MaybePing(allow_await);


So I dont this flow is working if we have few replicas.
this class is created per replica but here you increase the lsn and send to only single replica the journal change
To make this flow work you should write this code in journal slice so that the ping will be sent to all registered callbacks

I am aware but what you suggest would also be problematic and a bug. We use Journal::RegisterOnChange in other places and we do not only register callbacks associated with StableSync. We also for example register a callback during snapshot (see snapshot.cc:65). So if we were to apply the ping in all callbacks we would send multiple pings for execution paths that are completely irrelevant. For stable sync, we register one callback per flow and each of the flows should periodically ping its lsn.

how was this resolved?

I am fairly certain that what @adiholden suggests here is a bug. We should NOT send a ping for every registered callback because some of the registered callbacks are completely irrelevant with StableSync. An example of that is snapshot, that does call RegisterOnChange and most certainty you wouldn't want to ping there.

Each shard on master has a local JournalSlice. Even if we have less shards available on the replica, we still have master number of flows. So for example:

master(3 shards) [shard 1] [shard 2] [shard 3] | / | replica(2 shards) [shard 1] [shard 2]

Here we have 3 flows each having their local lsn's retrieved from the thread/shard local variable JournalSlice. However, on the replica side, we track LSN's via a local variable of the flow. So, when shard2 sends its lsn to replica shard 1, it does not overwrite a thread local on the replica side, but a member variable that tracks the lsn's for that flow and therefore we should be safe.

@kostasrim you resolved this now by not sending the LSN but a local variable you keep track of in the streamer class - you named it total_records_.
With this change we might miss something when trying to understand where is our bug in replica takeover
We conclude replica takeover when the LSN for replica is the same as master
If we missed sending some data from master to replica but it did get to the journal we will not see it. If we sent wrong lsn value when full sync finished we will not see this.
Also regarding my suggestion above it is not a bug, the only think is that we will write a ping to journal when sometimes it is not needed i.e when snapshotting from SAVE command

journal_rec_executed and tx_reader.NextTxData are on the replica side and I dont see how they are relevant here. Inorder to make sure we get all the journal changes in replica side and that we are in sync with lsn we must send the lsn from master. Because the lsn is saved in the journal_slice we need to write the lsn value in hold by this class.
I do think that changing journal slice write entry Op::PING with the lsn is the right way. you can ignore the Op::PING in snapshoting as we do for Op::EXEC and for Op::NOOP. I dont see this as mixing irrelevant flows. We want to be able to track lsn so we must record it. Weather we want to track it is the callback which is writing the sync to decide, in snapshot no in steamer yes.
This will also simplify the flow in replica cause you will always increase the lsn on PING on master side you will always increase journal_rec_executed on replica side

I made the changes but now we trigger the DCHECK almost every time on test_replication_all which IMO I think it should not happen. So there is either something that we are missing or a bug.

This will also simplify the flow in replica cause you will always increase the lsn on PING on master side you will always increase journal_rec_executed on replica side

We still need a separate variable for this because journal_rec_executed gets incremented interleaved. See also your comment above:

I wrote in the issue that we need to compare with journal_rec_executed_, but actually now after reviewing the code I understand that we need a different counter to compare to which will be incremented inside NextTxData when reading another entry from socket.
The reason for this is that under use_multi_shard_exe_sync_ we accumulate the multi transaction data and do not execute it therefor you might get different value in journal_rec_executed_ when comparing when getting a ping if it is in the middle of multi transaction

The changes you did are again sending pings from the streamer and not from journal slice.. this was my first comment in this thread that this will not work because when having several replicas you send the ping to only one and increase the lag for all

I see. I made the changes, however the DCHECK still triggers. I will investigate. Let me know if you got any ideas as well

romange · 2024-03-27T17:15:43Z

src/server/journal/journal.cc

@@ -83,6 +83,10 @@ LSN Journal::GetLsn() const {
  return journal_slice.cur_lsn();
 }

+LSN Journal::PostIncrLsn() {


what's a postincrlsn?

PostIncrementLsn -> PostIncrLsn

src/server/journal/streamer.h

romange · 2024-03-28T14:34:50Z

src/server/replica.h

@@ -234,6 +234,7 @@ class DflyShardReplica : public ProtocolClient {
  // **executed** records, which might be received interleaved when commands
  // run out-of-order on the master instance.
  std::atomic_uint64_t journal_rec_executed_ = 0;
+  std::atomic_uint64_t lsn_ = 0;


why it is atomic?

It shouldn't be, just like journal_rec_executed and some other atomics. I will patch this, give me a second

please change only lsn_ though.

romange · 2024-03-28T14:37:07Z

src/server/journal/streamer.cc

+void JournalStreamer::PeriodicPing::MaybePing(bool allow_await) {
+  const auto now = std::chrono::system_clock::now();
+  const auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(now - start_time_);
+  if (elapsed > kLimit) {


where kLimit is defined?

static member of JournalStreamer::PeriodicPing

romange · 2024-03-28T14:37:53Z

src/server/journal/streamer.cc

+          return AwaitIfWritten();
+        }
+        ++total_records_;
+        LOG(INFO) << "TOTAL RECORDS " << total_records_;


should we remove this?

romange · 2024-03-28T15:05:08Z

src/server/journal/streamer.h

+    void MaybePing(bool allow_await);
+    void Start();
+
+    static constexpr std::chrono::seconds kLimit{2};


probably better to move to cc, and have a better name

romange · 2024-03-28T15:05:18Z

src/server/journal/streamer.h

+    void Start();
+
+    static constexpr std::chrono::seconds kLimit{2};
+    friend JournalStreamer;


why do we need a friend here?

we don't, it was a leftover

romange · 2024-03-28T15:06:47Z

src/server/replica.cc

+      if (tx_data->lsn != 0) {
+        const uint64_t expect = lsn_.load();
+        const bool is_expected = tx_data->lsn == expect;
+        LOG(INFO) << "tx_data->lsn=" << tx_data->lsn << " lsn_=" << expect;


if we have this situation, the logs will continuously output this and will fill the volume.
please use LOG_FIRST_N (..1000) for that

src/server/replica.cc

romange · 2024-03-28T17:54:23Z

@kostasrim should I review?

kostasrim · 2024-03-28T19:04:29Z

@romange yes I think your comments are addressed. Let me know :)

src/server/journal/serializer.cc

adiholden · 2024-03-31T08:19:54Z

src/server/journal/tx_executor.h

@@ -65,6 +66,7 @@ struct TransactionReader {
  // Stores ongoing multi transaction data.
  absl::flat_hash_map<TxId, TransactionData> current_;
  bool accumulate_multi_ = false;
+  int64_t total_ = 0;


is this used?

adiholden · 2024-03-31T08:53:38Z

src/server/replica.cc

+        LOG_FIRST_N(INFO, 10) << "tx_data->lsn=" << tx_data->lsn << " lsn_=" << expect;
+        DCHECK(is_expected) << "tx_data->lsn=" << tx_data->lsn << "lsn=" << expect;
+      } else {
+        journal_rec_executed_.fetch_add(1);


This flow is so confusing. In replica takeover we send Op::PING but you dont fill the entry.lsn so it is 0 and therefor you end up increasing journal_rec_executed_

yes, it's not ideal. If it's 0 it means that it's a Ping coming from REPLTAKEOVER. LSN start at 1, so 0 is used here to denote that the PING is used without the extension, that is PING 0 == old PING and PING >=1 == PING LSN.

and also I treat PING LSNas separate and that's why they don't participate in incrementing journal_rec_executed.

adiholden · 2024-03-31T08:56:55Z

src/server/replica.cc

+        const uint64_t expect = lsn_;
+        const bool is_expected = tx_data->lsn == expect;
+        LOG_FIRST_N(INFO, 10) << "tx_data->lsn=" << tx_data->lsn << " lsn_=" << expect;
+        DCHECK(is_expected) << "tx_data->lsn=" << tx_data->lsn << "lsn=" << expect;


DCHECK_EQ(tx_data->lsn, lsn_)

adiholden · 2024-03-31T08:59:24Z

src/server/replica.cc

+      if (tx_data->lsn != 0) {
+        const uint64_t expect = lsn_;
+        const bool is_expected = tx_data->lsn == expect;
+        LOG_FIRST_N(INFO, 10) << "tx_data->lsn=" << tx_data->lsn << " lsn_=" << expect;


LOG_IF_EVERY_N(WARNING, tx_data->lsn != lsn_, 1000)

src/server/replica.cc

kostasrim · 2024-04-01T08:43:04Z

src/server/journal/tx_executor.cc

@@ -54,8 +54,10 @@ void TransactionData::AddEntry(journal::ParsedEntry&& entry) {
  opcode = entry.opcode;

  switch (entry.opcode) {
-    case journal::Op::PING:
+    case journal::Op::LSN:


But we said we would extend PING. I would have done the same :/

I think that extending PING instead of adding another opcode would be the right way. BUT when going into the code I understand that serializer writer can not decide weather to add the lsn data to ping because it is used for all registered callbacks and some of them may support the new format and some may not. f.e we can have 2 replicas one with ver4 and other with vers3. so serializer which is used in journal silce which first serializes the entry and than call all the register callbacks can not decide if to add the lsn data or not. There for introducing new Opcode is the best solution

Signed-off-by: adi_holden <adi@dragonflydb.io>

…nfly ( v1.15.1 → v1.16.0 ) (#3354) This PR contains the following updates: | Package | Update | Change | |---|---|---| | [docker.dragonflydb.io/dragonflydb/dragonfly](https://togithub.com/dragonflydb/dragonfly) | minor | `v1.15.1` -> `v1.16.0` | --- ### Release Notes <details> <summary>dragonflydb/dragonfly (docker.dragonflydb.io/dragonflydb/dragonfly)</summary> ### [`v1.16.0`](https://togithub.com/dragonflydb/dragonfly/releases/tag/v1.16.0) [Compare Source](https://togithub.com/dragonflydb/dragonfly/compare/v1.15.1...v1.16.0) ##### Dragonfly v1.16.0 Our spring release. We are getting closer to 2.0 with some very exciting features ahead. Stay tuned! Some prominent changes include: - Improved memory accounting of client connections ([#2710](https://togithub.com/dragonflydb/dragonfly/issues/2710) [#2755](https://togithub.com/dragonflydb/dragonfly/issues/2755) and [#2692](https://togithub.com/dragonflydb/dragonfly/issues/2692) ) - FT.AGGREGATE call ([#2413](https://togithub.com/dragonflydb/dragonfly/issues/2413)) - Properly handle and replicate Memcache flags ([#2787](https://togithub.com/dragonflydb/dragonfly/issues/2787) [#2807](https://togithub.com/dragonflydb/dragonfly/issues/2807)) - Intoduce BF.AGGREGATE BD.(M)ADD and BF.(M)EXISTS methods ([#2801](https://togithub.com/dragonflydb/dragonfly/issues/2801)). Note, that it does not work with snapshots and replication yet. - Dragonfly builds natively on MacOS. We would love some help with extending the release pipeline to create a proper macos binary. - Following the requests from the Edge developers community, we added a basic HTTP API support! Try running Dragonfly with: `--expose_http_api` flag and then call `curl -X POST -d '["ping"]' localhost:6379/api`. We will follow up with more extensive docs later this month. - Lots of stability fixes, especially around Sidekiq and BullMQ workloads. ##### What's Changed - chore: make usan asan optional and enable them on CI by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2631 - fix: missing manual trigger for daily sanitizers by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2682 - bug(tiering): fix overflow in page offset calculation and wrong hash offset calculation by [@theyueli](https://togithub.com/theyueli) in [dragonflydb/dragonfly#2683 - Chore: Fixed Docker Health Check by [@manojks1999](https://togithub.com/manojks1999) in [dragonflydb/dragonfly#2659 - chore: Increase disk space in the coverage runs by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2684 - fix(flushall): Decommit memory after releasing tables. by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2691 - feat(server): Account for serializer's temporary buffer size by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2689 - refactor: remove FULL-SYNC-CUT cmd [#2687](https://togithub.com/dragonflydb/dragonfly/issues/2687) by [@BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2688 - chore: add malloc-based stats and decommit by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2692 - feat(cluster): automatic slot migration finalization [#2697](https://togithub.com/dragonflydb/dragonfly/issues/2697) by [@BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2698 - Basic FT.AGGREGATE by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2413 - chore: little transaction cleanup by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2608 - fix(channel store): add acquire/release pair in fast update path by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2704 - chore: add ubuntu22 devcontainer by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2700 - feat(cluster): Add `--cluster_id` flag by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2695 - feat(server): Use mimalloc in SSL calls by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2710 - chore: Pull helio with new BlockingCounter by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2711 - chore(transaction): Simplify PollExecution by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2712 - chore(transaction): Don't call GetLocalMask from blocking controller by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2715 - chore: improve compatibility of EXPIRE functions with Redis by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2696 - chore: disable flaky fuzzy migration test by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2716 - chore: update sanitizers workflow by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2686 - chore: Use c-ares for resolving hosts in `ProtocolClient` by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2719 - Remove check-fail in ExpireIfNeeded and introduce DFLY LOAD by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2699 - chore: Record cmd stat from invoke by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2720 - fix(transaction): nullptr access on non transactional commands by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2724 - fix(BgSave): async from sync by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2702 - chore: remove core/fibers by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2723 - fix(transaction): Replace with armed sync point by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2708 - feat(json): Deserialize ReJSON format by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2725 - feat: add flag masteruser by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2693 - refactor: block list refactoring [#2580](https://togithub.com/dragonflydb/dragonfly/issues/2580) by [@BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2732 - chore: fix DeduceExecMode by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2733 - fix(cluster): Reply with correct `\n` / `\r\n` from `CLUSTER` sub cmd by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2731 - chore: Introduce fiber stack allocator by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2730 - fix(cluster): Save replica ID per replica by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2735 - fix(ssl): Proper cleanup by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2742 - chore: add skeleton files for flat_dfs code by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2738 - chore: better error reporting when connecting to tls with plain socket by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2740 - chore: Support json paths without root selector by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2747 - chore: journal cleanup by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2749 - refactor: remove start-slot-migration cmd [#2727](https://togithub.com/dragonflydb/dragonfly/issues/2727) by [@BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2728 - feat(server): Add TLS usage to /metrics and `INFO MEMORY` by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2755 - chore: preparations for adding flat json support by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2752 - chore(transaction): Introduce RunCallback by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2760 - feat(replication): Do not auto replicate different master by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2753 - Improve Helm chart to be rendered locally and on machines where is not the application target by [@fafg](https://togithub.com/fafg) in [dragonflydb/dragonfly#2706 - chore: preparation for basic http api by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2764 - feat(server): Add metric for RDB save duration. by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2768 - chore: fix flat_dfs read tests. by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2772 - fix(ci): do not overwrite last_log_file among tests by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2759 - feat(server): support cluster replication by [@adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2748 - fix: fiber preempts on read path and OnCbFinish() clears fetched_items\_ by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2763 - chore(ci): open last_log_file in append mode by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2776 - doc(README): fix outdated expiry ranges description by [@enjoy-binbin](https://togithub.com/enjoy-binbin) in [dragonflydb/dragonfly#2779 - Benchmark runner by [@adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2780 - chore(replication-tests): add cache_mode on test replication all by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2685 - feat(tiering): DiskStorage by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2770 - chore: add a boilerplate for bloom filter family by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2782 - chore: introduce conversion routines between JsonType and FlatJson by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2785 - chore: Fix memcached flags not updated by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2787 - chore: remove duplicate code from dash and simplify by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2765 - chore: disable test_cluster_slot_migration by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2788 - fix: new\[] delete\[] missmatch in disk_storage by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2792 - fix: sanitizers clang build and clean up some warnings by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2793 - chore: add bloom filter class by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2791 - chore: add SBF data structure by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2795 - chore(tiering): Disable compilation for MacOs by [@dranikpg](https://togithub.com/dranikpg) in [dragonflydb/dragonfly#2799 - chore: fix daily build by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2798 - chore: expose SBF via compact_object by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2797 - fix(ci): malloc trim on sanitizers workflow by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2794 - fix(cluster): Don't miss updates in FLUSHSLOTS by [@chakaz](https://togithub.com/chakaz) in [dragonflydb/dragonfly#2783 - feat: add bf.(m)add and bf.(m)exists commands by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2801 - fix: SBF memory leaks by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2803 - chore: refactor StringFamily::Set to use CmdArgParser by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2800 - fix: propagate memcached flags to replica by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2807 - DFLYMIGRATE ACK refactoring by [@BorysTheDev](https://togithub.com/BorysTheDev) in [dragonflydb/dragonfly#2790 - feat: add master lsn and journal_executed dcheck in replica via ping by [@kostasrim](https://togithub.com/kostasrim) in [dragonflydb/dragonfly#2778 - fix: correct json response for errors by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2813 - chore: bloom test - cover corner cases by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2806 - bug(server): do not write lsn opcode to journal by [@adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2814 - chore: Fix build by disabling the tests. by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2821 - fix(replication): replication with multi shard sync enabled lagging by [@adiholden](https://togithub.com/adiholden) in [dragonflydb/dragonfly#2823 - fix: io_uring/fibers bug in DnsResolve by [@romange](https://togithub.com/romange) in [dragonflydb/dragonfly#2825 ##### New Contributors - [@manojks1999](https://togithub.com/manojks1999) made their first contribution in [dragonflydb/dragonfly#2659 - [@fafg](https://togithub.com/fafg) made their first contribution in [dragonflydb/dragonfly#2706 - [@enjoy-binbin](https://togithub.com/enjoy-binbin) made their first contribution in [dragonflydb/dragonfly#2779 ##### Huge thanks to all the contributors! ❤️ 🇮🇱 🇺🇦 **Full Changelog**: dragonflydb/dragonfly@v1.15.0...v1.16.0 </details>  Co-authored-by: repo-jeeves[bot] <106431701+repo-jeeves[bot]@users.noreply.github.com>

kostasrim self-assigned this Mar 26, 2024

kostasrim requested a review from adiholden March 26, 2024 21:27

kostasrim commented Mar 26, 2024

View reviewed changes

kostasrim marked this pull request as ready for review March 27, 2024 12:31

kostasrim changed the title ~~feat: add master lag check in replica~~ feat: add master lsn and journal_executed dcheck in replica via ping Mar 27, 2024

kostasrim commented Mar 27, 2024

View reviewed changes

adiholden reviewed Mar 27, 2024

View reviewed changes

romange reviewed Mar 27, 2024

View reviewed changes

src/server/journal/streamer.h Outdated Show resolved Hide resolved

romange reviewed Mar 28, 2024

View reviewed changes

src/server/replica.cc Outdated Show resolved Hide resolved

romange previously approved these changes Mar 28, 2024

View reviewed changes

adiholden reviewed Mar 31, 2024

View reviewed changes

src/server/journal/serializer.cc Show resolved Hide resolved

adiholden reviewed Mar 31, 2024

View reviewed changes

kostasrim dismissed romange’s stale review via 9bc5394 March 31, 2024 11:22

kostasrim commented Apr 1, 2024

View reviewed changes

kostasrim and others added 11 commits April 1, 2024 13:03

feat: add master lag check in replica

78da921

clean u

bba6db3

add proper journal counter semantics

a1d4581

clean up

80f080a

address gh comments

4b37f27

address gh comments

2513ef0

address gh comments

a1c44c1

add proper init and revert test changes

cb798c2

revert

e116d34

Send LSN opcode

cc10754

Signed-off-by: adi_holden <adi@dragonflydb.io>

fixes move lsn check to tx executor

0009bbb

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden force-pushed the master_lag_check branch from 05b0913 to 0009bbb Compare April 1, 2024 10:11

adiholden added 3 commits April 1, 2024 13:12

remove comment

df0a045

Signed-off-by: adi_holden <adi@dragonflydb.io>

make lag check optional in tx_executor

435b5cd

Signed-off-by: adi_holden <adi@dragonflydb.io>

fix replication test

793d97e

Signed-off-by: adi_holden <adi@dragonflydb.io>

romange previously approved these changes Apr 1, 2024

View reviewed changes

move the entry

8999a5f

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden dismissed romange’s stale review via 8999a5f April 1, 2024 13:40

adiholden enabled auto-merge (squash) April 1, 2024 14:00

romange previously approved these changes Apr 1, 2024

View reviewed changes

remove partial sync check from pytest

d0c9a19

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden dismissed romange’s stale review via d0c9a19 April 1, 2024 14:26

adiholden approved these changes Apr 1, 2024

View reviewed changes

adiholden merged commit b2e2ad6 into main Apr 1, 2024
10 checks passed

adiholden deleted the master_lag_check branch April 1, 2024 14:51

feat: add master lsn and journal_executed dcheck in replica via ping #2778

feat: add master lsn and journal_executed dcheck in replica via ping #2778

Conversation

kostasrim commented Mar 26, 2024 • edited

kostasrim commented Mar 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden commented Mar 27, 2024

kostasrim commented Mar 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim Mar 27, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden Mar 31, 2024 • edited

Choose a reason for hiding this comment

kostasrim Mar 31, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romange Mar 28, 2024 • edited

Choose a reason for hiding this comment

romange commented Mar 28, 2024

kostasrim commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kostasrim commented Mar 26, 2024 •

edited

kostasrim Mar 27, 2024 •

edited

adiholden Mar 31, 2024 •

edited

kostasrim Mar 31, 2024 •

edited

romange Mar 28, 2024 •

edited