Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[shell] volume.fix.replication skips fixing a volume in one DC in 100 mode #2416

Closed
kmlebedev opened this issue Nov 2, 2021 · 9 comments
Closed

Comments

@kmlebedev
Copy link
Contributor

version

2.75

volume.list two volumes in one DC

      DataNode fast-volume-3.s3-fast-volume.service.dc1.consul:8080 hdd(volume:26/1000 active:26 free:974 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1632338775
      DataNode fast-volume-2.s3-fast-volume.service.dc1.consul:8080 hdd(volume:10/1000 active:10 free:990 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1630486821

empty replications

> lock
> volume.fix.replication -n -collectionPattern duck-main
>
@kmlebedev kmlebedev changed the title [shell] volume.fix.replication skips fixs a volume in one DC in 100 mode [shell] volume.fix.replication skips fixing a volume in one DC in 100 mode Nov 2, 2021
@chrislusf
Copy link
Collaborator

Current code does not handle this case, when the volume already has correct number of replicas.

This case should not happen during normal operations. How did it happen?

@kmlebedev
Copy link
Contributor Author

kmlebedev commented Nov 3, 2021

This case should not happen during normal operations. How did it happen?

I think when deleting the third volume
https://github.com/chrislusf/seaweedfs/blob/5435027ff0906097c312bb1dde126ad03a05fc18/weed/shell/command_volume_fix_replication.go#L179

@chrislusf
Copy link
Collaborator

Would like to see some master logs about this volume id 97.

@kmlebedev
Copy link
Contributor Author

Would like to see some master logs about this volume id 97.

It will take a while to find the last such case
Since in the normal state the volumes do not move, and there are no longer any logs along the old beams
Nevertheless, the problem is not an isolated one.

vlist="/tmp/s3_pre_volume.list";for id in $(cat $vlist | ggrep -oP 'id:\d+'| sort|uniq);do dc_count=$(cat $vlist| ggrep -e "$id " -e DataCenter| ggrep -B 1 -e "volume id"|ggrep -oP "DataCenter [\w-]+"|uniq -c| wc -l| ggrep -oP "\d+"); if [ "$dc_count" != "2" ];then echo "$id $dc_count";fi; done
id:106 1
id:62 1
id:91 1
id:97 1

@chrislusf
Copy link
Collaborator

maybe explain the script a bit?

For now, you can remove one of the two volumes in the same DC and the volume.fix.replication should kick in.

@kmlebedev
Copy link
Contributor Author

kmlebedev commented Nov 3, 2021

maybe explain the script a bit?

It identifies such cases
The loop runs over all IDs and counts the number of unique DCs
if they are not equal to 2 then we print id

cat $vlist| ggrep -e "id:97 " -e DataCenter | ggrep -B 1 -e "volume id"
  DataCenter predc hdd(volume:88/4000 active:88 free:3912 remote:0)
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1632338775
          volume id:97 size:317296 collection:"duck-main" file_count:1 replica_placement:100 version:3 compact_revision:1 modified_at_second:1630486821

@kmlebedev
Copy link
Contributor Author

@chrislusf

logs after node down and up and runs fix.replication

Nov 30, 2021 @ 15:38:02.919  volume 423 replication 100, but over replicated +3
Nov 30, 2021 @ 15:38:03.394 volume_grpc_client_to_master.go:181] volume server fast-volume-4.dc1:8080 deletes volume 423
Nov 30, 2021 @ 15:38:03.394 store.go:459] DeleteVolume 423
Nov 30, 2021 @ 15:38:03.402 masterclient.go:142] master: fast-volume-4.dc1:8080 masterClient removes volume 423

weed shell

      DataNode fast-volume-2.dc2:8080 hdd(volume:90/900 active:90 free:810 remote:0)
          volume id:423 size:12388761136 collection:"sstore" file_count:15787 delete_count:17 deleted_byte_count:494928229 replica_placement:100 version:3 modified_at_second:1638278005
--
      DataNode fast-volume-14.dc2:8080 hdd(volume:76/900 active:53 free:824 remote:0)
          volume id:423 size:12279708968 collection:"sstore" file_count:15784 delete_count:14 deleted_byte_count:385876262 replica_placement:100 version:3 modified_at_second:1638278005

@chrislusf
Copy link
Collaborator

So when deleting over replicated volumes, the data center was not considered. This is the problem, right?

@kmlebedev
Copy link
Contributor Author

So when deleting over replicated volumes, the data center was not considered. This is the problem, right?

Yes that's right

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants