[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

kmlebedev · 2021-11-02T13:59:54Z

version

2.75

volume.list two volumes in one DC

      DataNode fast-volume-3.s3-fast-volume.service.dc1.consul:8080 hdd(volume:26/1000 active:26 free:974 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1632338775
      DataNode fast-volume-2.s3-fast-volume.service.dc1.consul:8080 hdd(volume:10/1000 active:10 free:990 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1630486821

empty replications

> lock
> volume.fix.replication -n -collectionPattern duck-main
>

The text was updated successfully, but these errors were encountered:

chrislusf · 2021-11-02T17:48:01Z

Current code does not handle this case, when the volume already has correct number of replicas.

This case should not happen during normal operations. How did it happen?

kmlebedev · 2021-11-03T07:23:41Z

This case should not happen during normal operations. How did it happen?

I think when deleting the third volume
https://github.com/chrislusf/seaweedfs/blob/5435027ff0906097c312bb1dde126ad03a05fc18/weed/shell/command_volume_fix_replication.go#L179

chrislusf · 2021-11-03T08:34:44Z

Would like to see some master logs about this volume id 97.

kmlebedev · 2021-11-03T08:54:44Z

Would like to see some master logs about this volume id 97.

It will take a while to find the last such case
Since in the normal state the volumes do not move, and there are no longer any logs along the old beams
Nevertheless, the problem is not an isolated one.

vlist="/tmp/s3_pre_volume.list";for id in $(cat $vlist | ggrep -oP 'id:\d+'| sort|uniq);do dc_count=$(cat $vlist| ggrep -e "$id " -e DataCenter| ggrep -B 1 -e "volume id"|ggrep -oP "DataCenter [\w-]+"|uniq -c| wc -l| ggrep -oP "\d+"); if [ "$dc_count" != "2" ];then echo "$id $dc_count";fi; done
id:106 1
id:62 1
id:91 1
id:97 1

chrislusf · 2021-11-03T09:02:59Z

maybe explain the script a bit?

For now, you can remove one of the two volumes in the same DC and the volume.fix.replication should kick in.

kmlebedev · 2021-11-03T09:11:10Z

maybe explain the script a bit?

It identifies such cases
The loop runs over all IDs and counts the number of unique DCs
if they are not equal to 2 then we print id

cat $vlist| ggrep -e "id:97 " -e DataCenter | ggrep -B 1 -e "volume id"
  DataCenter predc hdd(volume:88/4000 active:88 free:3912 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1632338775
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1630486821

kmlebedev · 2021-11-30T13:35:58Z

@chrislusf

logs after node down and up and runs fix.replication

Nov 30, 2021 @ 15:38:02.919  volume 423 replication 100, but over replicated +3
Nov 30, 2021 @ 15:38:03.394 volume_grpc_client_to_master.go:181] volume server fast-volume-4.dc1:8080 deletes volume 423
Nov 30, 2021 @ 15:38:03.394 store.go:459] DeleteVolume 423
Nov 30, 2021 @ 15:38:03.402 masterclient.go:142] master: fast-volume-4.dc1:8080 masterClient removes volume 423

weed shell

      DataNode fast-volume-2.dc2:8080 hdd(volume:90/900 active:90 free:810 remote:0)
          volume id:423 size:12388761136 collection:"sstore" file_count:15787 delete_count:17 deleted_byte_count:494928229 replica_placement:100 version:3 modified_at_second:1638278005
--
      DataNode fast-volume-14.dc2:8080 hdd(volume:76/900 active:53 free:824 remote:0)
          volume id:423 size:12279708968 collection:"sstore" file_count:15784 delete_count:14 deleted_byte_count:385876262 replica_placement:100 version:3 modified_at_second:1638278005

chrislusf · 2021-11-30T18:12:07Z

So when deleting over replicated volumes, the data center was not considered. This is the problem, right?

kmlebedev · 2021-11-30T18:52:38Z

So when deleting over replicated volumes, the data center was not considered. This is the problem, right?

Yes that's right

kmlebedev changed the title ~~[shell] volume.fix.replication skips fixs a volume in one DC in 100 mode~~ [shell] volume.fix.replication skips fixing a volume in one DC in 100 mode Nov 2, 2021

chrislusf closed this as completed in e6c026d Dec 6, 2021

zehweh mentioned this issue Sep 13, 2023

volume.fix.replication problems when migrating to different DC #4836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

kmlebedev commented Nov 2, 2021

chrislusf commented Nov 2, 2021

kmlebedev commented Nov 3, 2021 •

edited

chrislusf commented Nov 3, 2021

kmlebedev commented Nov 3, 2021

chrislusf commented Nov 3, 2021

kmlebedev commented Nov 3, 2021 •

edited

kmlebedev commented Nov 30, 2021

chrislusf commented Nov 30, 2021

kmlebedev commented Nov 30, 2021

[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

Comments

kmlebedev commented Nov 2, 2021

chrislusf commented Nov 2, 2021

kmlebedev commented Nov 3, 2021 • edited

chrislusf commented Nov 3, 2021

kmlebedev commented Nov 3, 2021

chrislusf commented Nov 3, 2021

kmlebedev commented Nov 3, 2021 • edited

kmlebedev commented Nov 30, 2021

chrislusf commented Nov 30, 2021

kmlebedev commented Nov 30, 2021

kmlebedev commented Nov 3, 2021 •

edited

kmlebedev commented Nov 3, 2021 •

edited