Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DR constantly growing disk space #11269

Closed
doublex opened this issue Mar 27, 2024 · 5 comments
Closed

DR constantly growing disk space #11269

doublex opened this issue Mar 27, 2024 · 5 comments

Comments

@doublex
Copy link

doublex commented Mar 27, 2024

FDB 7.1 server with disaster recovery:
Used disk space is constantly growing.

Statistics on the machine with DR:

Sum of key-value sizes - 339.880 GB
Disk space used        - 441.932 GB

DR target (same numbers if restoring from backup)

Sum of key-value sizes - 127.438 GB
Disk space used        - 181.399 GB

A similar problem has been reported:
https://forums.foundationdb.org/t/key-value-sizes-at-dr-source-and-destination-have-a-big-difference/3351

fdbcli --exec status (truncated):

{
  "cluster" : {
    [...]
    "layers" : {
      "_valid" : true,
      "backup" : {
        "blob_recent_io" : {
          "bytes_per_second" : 0,
          "bytes_sent" : 0,
          "requests_failed" : 0,
          "requests_successful" : 0
        },
        "instances" : {
          "f9f2d06cd5ded70cc0d60baf4e1ea6d8" : {
            "blob_stats" : {
              "recent" : {
                "bytes_per_second" : 0,
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              },
              "total" : {
                "bytes_sent" : 0,
                "requests_failed" : 0,
                "requests_successful" : 0
              }
            },
            "configured_workers" : 10,
            "id" : "f9f2d06cd5ded70cc0d60baf4e1ea6d8",
            "last_updated" : 1711573622.9675598,
            "main_thread_cpu_seconds" : 616332.49406500009,
            "memory_usage" : 141631488,
            "process_cpu_seconds" : 623611.24382099998,
            "resident_size" : 25047040,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573622.9675598,
        "paused" : false,
        "tags" : {
          "default" : {
            "current_container" : "file:///media/hdd4000/database/backup-2024-02-03-05-00-01.415519",
            "current_status" : "has been started",
            "mutation_log_bytes_written" : 0,
            "mutation_stream_id" : "8a41d0171e2fd8060cc8b682788c23a0",
            "range_bytes_written" : 0,
            "running_backup" : true,
            "running_backup_is_restorable" : false
          }
        },
        "total_workers" : 10
      },
      "dr_backup" : {
        "instances" : {
          "09f8ae181b62a48f843fd5be73881577" : {
            "configured_workers" : 10,
            "id" : "09f8ae181b62a48f843fd5be73881577",
            "last_updated" : 1711573633.2240255,
            "main_thread_cpu_seconds" : 332468.32462600002,
            "memory_usage" : 841007104,
            "process_cpu_seconds" : 336727.93240300001,
            "resident_size" : 724078592,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573633.2240255,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 218154876973,
            "mutation_stream_id" : "d04069c450c9ebea7158b3582ffc0be2",
            "range_bytes_written" : 115953778604,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : 0.61766700000000008
          }
        },
        "total_workers" : 10
      },
      "dr_backup_dest" : {
        "instances" : {
          "ba2549dfed10b9e11a2d8f6ee32be230" : {
            "configured_workers" : 10,
            "id" : "ba2549dfed10b9e11a2d8f6ee32be230",
            "last_updated" : 1711573727.3237493,
            "main_thread_cpu_seconds" : 8302.7225830000007,
            "memory_usage" : 198774784,
            "process_cpu_seconds" : 8827.1877419999983,
            "resident_size" : 23576576,
            "version" : "7.1.49"
          }
        },
        "instances_running" : 1,
        "last_updated" : 1711573727.3237493,
        "paused" : false,
        "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }
        },
        "total_workers" : 10
      }
    },
    [...]
  }
}
@jzhou77
Copy link
Contributor

jzhou77 commented Mar 28, 2024

This question is better to be raised on the https://forums.foundationdb.org/. GitHub issue is for tracking specific bugs or problems.

How much lag does the DR report when you run fdbdr status? When the destination cluster catches up (i.e., a few seconds lag), the data size should be about the same. If the lag is large, e.g., the destination cluster still has lots of data to copy, the big difference is expected.

The other possibility is mutation logs buffered at the source cluster, which can be estimated by the size of \xff\x02 keyspace.

@jzhou77 jzhou77 closed this as completed Mar 28, 2024
@jzhou77
Copy link
Contributor

jzhou77 commented Mar 28, 2024

Oh, the status reports "backup_state" : "is differential",, so it might be the size of \xff\x02 keyspace is large, i.e., laggine a lot.

    "tags" : {
          "default" : {
            "backup_state" : "is differential",
            "mutation_log_bytes_written" : 220237696485,
            "mutation_stream_id" : "7df4fc98573a8b4ab8a019bd4e92f7b6",
            "range_bytes_written" : 123375102718,
            "running_backup" : true,
            "running_backup_is_restorable" : true,
            "seconds_behind" : -1747510.757253
          }

Actually the status says lagging is large 1747510.757253 seconds behind, about 20days. Do you have DR agents running on the destination cluster?

@doublex
Copy link
Author

doublex commented Mar 28, 2024

Sorry for the inconvenience.
Yes, there are DR agents running on the destination cluster (which again runs a DR agent).
Is this an invalid deployment?

@doublex
Copy link
Author

doublex commented Mar 28, 2024

You are right. Totally my fault.
Thank you so much for your answer - and sorry for the inconvenience.

@jzhou77
Copy link
Contributor

jzhou77 commented Mar 28, 2024

Yes, there are DR agents running on the destination cluster (which again runs a DR agent).

DR agents are needed on the destination cluster. So maybe you didn't have enough number of agents and that cause the DR lag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants