DR constantly growing disk space #11269

doublex · 2024-03-27T21:16:50Z

FDB 7.1 server with disaster recovery:
Used disk space is constantly growing.

Statistics on the machine with DR:

Sum of key-value sizes - 339.880 GB
Disk space used        - 441.932 GB

DR target (same numbers if restoring from backup)

Sum of key-value sizes - 127.438 GB
Disk space used        - 181.399 GB

A similar problem has been reported:
https://forums.foundationdb.org/t/key-value-sizes-at-dr-source-and-destination-have-a-big-difference/3351

fdbcli --exec status (truncated):

{
  "cluster" : {
    [...]
    "layers" : {
      "_valid" : true,
      "backup" : {
        "blob_recent_io" : {
          "bytes_per_second" : 0,
          "bytes_sent" : 0,
          "requests_failed" : 0,
          "requests_successful" : 0
        },
        "instances" : {
          "f9f2d06cd5ded70cc0d60baf4e1ea6d8" : {
            "blob_stats" : {
              "recent" : {
                "bytes_per_second" : 0,
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              },
              "total" : {
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              }
            },
            "configured_workers" : 10,
            "id" : "f9f2d06cd5ded70cc0d60baf4e1ea6d8",
            "last_updated" : 1711573622.9675598,
            "main_thread_cpu_seconds" : 616332.49406500009,
            "memory_usage" : 141631488,
            "process_cpu_seconds" : 623611.24382099998,
            "resident_size" : 25047040,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573622.9675598,
        "paused" : false,
        "tags" : {
          "default" : {
            "current_container" : "file:///media/hdd4000/database/backup-2024-02-03-05-00-01.415519",
            "current_status" : "has been started",
            "mutation_log_bytes_written" : 0,
            "mutation_stream_id" : "8a41d0171e2fd8060cc8b682788c23a0",
            "range_bytes_written" : 0,
            "running_backup" : true,
            "running_backup_is_restorable" : false
          }
        },
        "total_workers" : 10
      },
      "dr_backup" : {
        "instances" : {
          "09f8ae181b62a48f843fd5be73881577" : {
            "configured_workers" : 10,
            "id" : "09f8ae181b62a48f843fd5be73881577",
            "last_updated" : 1711573633.2240255,
            "main_thread_cpu_seconds" : 332468.32462600002,
            "memory_usage" : 841007104,
            "process_cpu_seconds" : 336727.93240300001,
            "resident_size" : 724078592,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573633.2240255,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 218154876973,
            "mutation_stream_id" : "d04069c450c9ebea7158b3582ffc0be2",
            "range_bytes_written" : 115953778604,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : 0.61766700000000008
          }
        },
        "total_workers" : 10
      },
      "dr_backup_dest" : {
        "instances" : {
          "ba2549dfed10b9e11a2d8f6ee32be230" : {
            "configured_workers" : 10,
            "id" : "ba2549dfed10b9e11a2d8f6ee32be230",
            "last_updated" : 1711573727.3237493,
            "main_thread_cpu_seconds" : 8302.7225830000007,
            "memory_usage" : 198774784,
            "process_cpu_seconds" : 8827.1877419999983,
            "resident_size" : 23576576,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573727.3237493,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }
        },
        "total_workers" : 10
      }
    },
    [...]
  }
}

The text was updated successfully, but these errors were encountered:

jzhou77 · 2024-03-28T01:08:12Z

This question is better to be raised on the https://forums.foundationdb.org/. GitHub issue is for tracking specific bugs or problems.

How much lag does the DR report when you run fdbdr status? When the destination cluster catches up (i.e., a few seconds lag), the data size should be about the same. If the lag is large, e.g., the destination cluster still has lots of data to copy, the big difference is expected.

The other possibility is mutation logs buffered at the source cluster, which can be estimated by the size of \xff\x02 keyspace.

jzhou77 · 2024-03-28T01:10:46Z

Oh, the status reports "backup_state" : "is differential",, so it might be the size of \xff\x02 keyspace is large, i.e., laggine a lot.

    "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }

Actually the status says lagging is large 1747510.757253 seconds behind, about 20days. Do you have DR agents running on the destination cluster?

doublex · 2024-03-28T09:31:46Z

Sorry for the inconvenience.
Yes, there are DR agents running on the destination cluster (which again runs a DR agent).
Is this an invalid deployment?

doublex · 2024-03-28T10:00:16Z

You are right. Totally my fault.
Thank you so much for your answer - and sorry for the inconvenience.

jzhou77 · 2024-03-28T17:10:47Z

Yes, there are DR agents running on the destination cluster (which again runs a DR agent).

DR agents are needed on the destination cluster. So maybe you didn't have enough number of agents and that cause the DR lag.

jzhou77 closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DR constantly growing disk space #11269

DR constantly growing disk space #11269

doublex commented Mar 27, 2024

jzhou77 commented Mar 28, 2024

jzhou77 commented Mar 28, 2024 •

edited

Loading

doublex commented Mar 28, 2024

doublex commented Mar 28, 2024

jzhou77 commented Mar 28, 2024

DR constantly growing disk space #11269

DR constantly growing disk space #11269

Comments

doublex commented Mar 27, 2024

jzhou77 commented Mar 28, 2024

jzhou77 commented Mar 28, 2024 • edited Loading

doublex commented Mar 28, 2024

doublex commented Mar 28, 2024

jzhou77 commented Mar 28, 2024

jzhou77 commented Mar 28, 2024 •

edited

Loading