Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent volume file size may lead to persistent write error #331

Closed
hxiaodon opened this issue Jun 27, 2016 · 1 comment
Closed

inconsistent volume file size may lead to persistent write error #331

hxiaodon opened this issue Jun 27, 2016 · 1 comment

Comments

@hxiaodon
Copy link
Contributor

hxiaodon commented Jun 27, 2016

Chris:
This is an edge case that my colleague and I reproduced at our beta environment

Reproduce steps:

  1. turn on replication(at least 2 copy)
  2. benchmark tool is launched, I write a simple one against filer service
  3. stop one volume server during the stress testing(suppose the dat files include a file named 1.dat), then there will be 5-10 seconds for the master node assign a group of new volume files. The replicationWrite will lead to the 1.dat file in other volume servers still accepting the write requests during the heartbeat
  4. start the volume server stopped at step 3, then the 1.dat file could accept write request again, but their file size's difference will not be erased
  5. Finally, the 1.dat file will exceed volumeSizeLimitMB, and the 1.dat should be marked as readonly. But the problem happen, the 1.dat is readonly in one volume server while it is not readonly in another, which will make master server switch the readonly status and the replication write will never succeed as long as the master server assign a new fid at this volume when the topology status for this volume is writable( The stress testing could reproduce easily).

My local's screenshot
16 pic
17 pic

There are 2 ways to fix this issue:

  1. Store write(store.go Write method) bypass the MaxPossibleVolumeSize checking, finally all volume files against same vid at different server will all exceed MaxPossibleVolumeSize
  2. master's Heartbeat analyzing will cache each volume's max file size, if it exceed the MaxPossibleVolumeSize, directly mark this volume as readonly

What's your suggestion?

@chrislusf
Copy link
Collaborator

I used the second approach to remember any volume that has been oversized.

Actually this case should not happen often size the vacuuming usually will kick in to remove those extra files. So I guess your case only happen during your benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants