Inconsistent logical volume files #1607

howardlau1999 · 2020-11-10T11:01:08Z

Describe the bug
Version: 2.08
Deployed 1 master, 3 volume servers and 1 filer with defaultReplication=001 and a scheduled script that fixes replication every 20 minute. One day a volume server was accidentally down for more than 20 minutes. During the downtime, some files were uploaded via filer. Later when the down volume server came back, the files uploaded during the downtime cannot be downloaded with filer randomly, reporting 404 error.

All filer operations are using HTTP API.

System Setup
A minimal reproduction can be done by following steps:

Start one master server with weed master -defaultReplication=001.
Start 3 volume servers with weed volume -max=1 -dir=/tmp/weed{0,1,2} -port={8080,8081,8082}.
Start 1 filer with weed filer.

Wait for several minutes for the cluster to be steady and upload a file. Running volume.list with weed shell should see 3 volume servers with 2 of them having a logical volume each. And the file count should not change for a while.

Then kill one of the 2 volume servers with kill -9. Run volume.fix.replication with weed shell. Now the shell should see 2 running volume servers with the logical volume before. Now upload another file with different file name.

Restart the killed volume server. Now the cluster is in a inconsistent state that the logical volumes have different file counts. Restart the other two volume servers and try to get the second file uploaded with filer. Now you should see 404 Not Found error.

Possible file distribution evolution:

First upload

Volume 1 on Server 8080: file1.txt
Volume 1 on Server 8081: file1.txt
No volume on Server 8082.

One volume server down and replication is fixed.

Volume 1 on Server 8080: file1.txt
Server 8081 down
Volume 1 on Server 8082: file1.txt

New file uploaded.

Volume 1 on Server 8080: file1.txt file2.txt
Server 8081 down
Volume 1 on Server 8082: file1.txt file2.txt

Volume server came back

Volume 1 on Server 8080: file1.txt file2.txt
Volume 1 on Server 8081: file1.txt
Volume 1 on Server 8082: file1.txt file2.txt

If the filer request for `file2.txt` goes to server 8081, the file is reported not found although there are two copies of it.

Expected behavior
The uploaded file should be retrieved without errors because the file does exist on the other two volume servers.

The text was updated successfully, but these errors were encountered:

howardlau1999 · 2020-11-10T12:54:14Z

What's worse, if the other two volume servers are permanently down afterward, the second uploaded file would be LOST.

chrislusf · 2020-11-10T19:21:12Z

there is a plan to fix missing files on read.

kmlebedev · 2020-11-10T19:28:20Z

I think fix.replication will start after and random delete Volume 1 from Servers 8080-8082.
When deleting a volume, it can look at modified_at_second and size ?

related to #1607 old is: * older compaction revision * older modified time * smaller volume size

chrislusf · 2020-11-10T20:30:58Z

Added more fix.replication logic to pickOneReplicaToDelete by compact revision, modified time, and file size.

chrislusf added a commit that referenced this issue Nov 10, 2020

delete old volume replica

de3bdd0

related to #1607 old is: * older compaction revision * older modified time * smaller volume size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent logical volume files #1607

Inconsistent logical volume files #1607

howardlau1999 commented Nov 10, 2020 •

edited

Loading

howardlau1999 commented Nov 10, 2020

chrislusf commented Nov 10, 2020

kmlebedev commented Nov 10, 2020 •

edited

Loading

chrislusf commented Nov 10, 2020 •

edited

Loading

Inconsistent logical volume files #1607

Inconsistent logical volume files #1607

Comments

howardlau1999 commented Nov 10, 2020 • edited Loading

howardlau1999 commented Nov 10, 2020

chrislusf commented Nov 10, 2020

kmlebedev commented Nov 10, 2020 • edited Loading

chrislusf commented Nov 10, 2020 • edited Loading

howardlau1999 commented Nov 10, 2020 •

edited

Loading

kmlebedev commented Nov 10, 2020 •

edited

Loading

chrislusf commented Nov 10, 2020 •

edited

Loading