-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent logical volume files #1607
Comments
What's worse, if the other two volume servers are permanently down afterward, the second uploaded file would be LOST. |
there is a plan to fix missing files on read. |
I think fix.replication will start after and random delete Volume 1 from Servers 8080-8082. |
related to #1607 old is: * older compaction revision * older modified time * smaller volume size
Added more |
Describe the bug
Version: 2.08
Deployed 1 master, 3 volume servers and 1 filer with defaultReplication=001 and a scheduled script that fixes replication every 20 minute. One day a volume server was accidentally down for more than 20 minutes. During the downtime, some files were uploaded via filer. Later when the down volume server came back, the files uploaded during the downtime cannot be downloaded with filer randomly, reporting 404 error.
All filer operations are using HTTP API.
System Setup
A minimal reproduction can be done by following steps:
weed master -defaultReplication=001
.weed volume -max=1 -dir=/tmp/weed{0,1,2} -port={8080,8081,8082}
.weed filer
.Wait for several minutes for the cluster to be steady and upload a file. Running
volume.list
withweed shell
should see 3 volume servers with 2 of them having a logical volume each. And the file count should not change for a while.Then kill one of the 2 volume servers with
kill -9
. Runvolume.fix.replication
withweed shell
. Now the shell should see 2 running volume servers with the logical volume before. Now upload another file with different file name.Restart the killed volume server. Now the cluster is in a inconsistent state that the logical volumes have different file counts. Restart the other two volume servers and try to get the second file uploaded with filer. Now you should see 404 Not Found error.
Possible file distribution evolution:
Expected behavior
The uploaded file should be retrieved without errors because the file does exist on the other two volume servers.
The text was updated successfully, but these errors were encountered: