After interrupting replication, we get a broken volume ID #1567

kmlebedev · 2020-10-27T20:33:17Z

Describe the bug
After interrupting replication, we get a broken volume ID

System Setup

2.06

Expected behavior
After replication, check the volume for compliance with the source

Additional context

Run upgrade seaweedfs service in k8s with sequential restart of pods
Next runs cronjob with volume.fix.replication
Since the pods are restarted, fixes random volume id and copying is interrupted

replicating volume 108 010 from 10.1.1.2:8080 to dataNode 10.1.1.5:8080 ...
error: copying from 10.1.1.2:8080 => 10.1.1.5:8080 : rpc error: code = Unavailable desc = transport is closing

Since restarts continue and overtake DataNode 10.1.1.5:8080 and broken volume id 108 stay on datanode
Next runs cronjob with volume.fix.replication

volume 108 replication 010, but over replicated +3
deleting volume 108 from 10.1.1.3:8080 ...

Finally, we have two volumes for id 108 with different sizes

      DataNode 10.1.1.5:8080 volume:19/22 active:18 free:3 remote:0
        volume id:108 size:20591935488 collection:"logs" file_count:43959 delete_count:11191 deleted_byte_count:23748792802 read_only:true replica_placement:10 version:3 compact_revision:1 modified_at_second:1603826757
      DataNode 10.1.1.2:8080 volume:19/22 active:19 free:3 remote:0
        volume id:108 size:89633122384 collection:"logs" file_count:43961 delete_count:11189 deleted_byte_count:23748752094 replica_placement:10 version:3 compact_revision:1 modified_at_second:1603762458

The text was updated successfully, but these errors were encountered:

chrislusf · 2020-10-27T22:57:44Z

Added a mechanism to avoid incomplete volume files if restarted in the middle.

chrislusf closed this as completed in 53c3aad Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After interrupting replication, we get a broken volume ID #1567

After interrupting replication, we get a broken volume ID #1567

kmlebedev commented Oct 27, 2020

chrislusf commented Oct 27, 2020

After interrupting replication, we get a broken volume ID #1567

After interrupting replication, we get a broken volume ID #1567

Comments

kmlebedev commented Oct 27, 2020

chrislusf commented Oct 27, 2020