Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent logical volume files #1607

Open
howardlau1999 opened this issue Nov 10, 2020 · 4 comments
Open

Inconsistent logical volume files #1607

howardlau1999 opened this issue Nov 10, 2020 · 4 comments

Comments

@howardlau1999
Copy link

howardlau1999 commented Nov 10, 2020

Describe the bug
Version: 2.08
Deployed 1 master, 3 volume servers and 1 filer with defaultReplication=001 and a scheduled script that fixes replication every 20 minute. One day a volume server was accidentally down for more than 20 minutes. During the downtime, some files were uploaded via filer. Later when the down volume server came back, the files uploaded during the downtime cannot be downloaded with filer randomly, reporting 404 error.

All filer operations are using HTTP API.

System Setup
A minimal reproduction can be done by following steps:

  1. Start one master server with weed master -defaultReplication=001.
  2. Start 3 volume servers with weed volume -max=1 -dir=/tmp/weed{0,1,2} -port={8080,8081,8082}.
  3. Start 1 filer with weed filer.

Wait for several minutes for the cluster to be steady and upload a file. Running volume.list with weed shell should see 3 volume servers with 2 of them having a logical volume each. And the file count should not change for a while.

Then kill one of the 2 volume servers with kill -9. Run volume.fix.replication with weed shell. Now the shell should see 2 running volume servers with the logical volume before. Now upload another file with different file name.

Restart the killed volume server. Now the cluster is in a inconsistent state that the logical volumes have different file counts. Restart the other two volume servers and try to get the second file uploaded with filer. Now you should see 404 Not Found error.

Possible file distribution evolution:

  1. First upload
Volume 1 on Server 8080: file1.txt
Volume 1 on Server 8081: file1.txt
No volume on Server 8082.
  1. One volume server down and replication is fixed.
Volume 1 on Server 8080: file1.txt
Server 8081 down
Volume 1 on Server 8082: file1.txt
  1. New file uploaded.
Volume 1 on Server 8080: file1.txt file2.txt
Server 8081 down
Volume 1 on Server 8082: file1.txt file2.txt
  1. Volume server came back
Volume 1 on Server 8080: file1.txt file2.txt
Volume 1 on Server 8081: file1.txt
Volume 1 on Server 8082: file1.txt file2.txt

If the filer request for `file2.txt` goes to server 8081, the file is reported not found although there are two copies of it. 

Expected behavior
The uploaded file should be retrieved without errors because the file does exist on the other two volume servers.

@howardlau1999
Copy link
Author

What's worse, if the other two volume servers are permanently down afterward, the second uploaded file would be LOST.

@chrislusf
Copy link
Collaborator

there is a plan to fix missing files on read.

@kmlebedev
Copy link
Contributor

kmlebedev commented Nov 10, 2020

I think fix.replication will start after and random delete Volume 1 from Servers 8080-8082.
When deleting a volume, it can look at modified_at_second and size ?

chrislusf added a commit that referenced this issue Nov 10, 2020
related to #1607

old is:
* older compaction revision
* older modified time
* smaller volume size
@chrislusf
Copy link
Collaborator

chrislusf commented Nov 10, 2020

Added more fix.replication logic to pickOneReplicaToDelete by compact revision, modified time, and file size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants