You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chris:
This is an edge case that my colleague and I reproduced at our beta environment
Reproduce steps:
turn on replication(at least 2 copy)
benchmark tool is launched, I write a simple one against filer service
stop one volume server during the stress testing(suppose the dat files include a file named 1.dat), then there will be 5-10 seconds for the master node assign a group of new volume files. The replicationWrite will lead to the 1.dat file in other volume servers still accepting the write requests during the heartbeat
start the volume server stopped at step 3, then the 1.dat file could accept write request again, but their file size's difference will not be erased
Finally, the 1.dat file will exceed volumeSizeLimitMB, and the 1.dat should be marked as readonly. But the problem happen, the 1.dat is readonly in one volume server while it is not readonly in another, which will make master server switch the readonly status and the replication write will never succeed as long as the master server assign a new fid at this volume when the topology status for this volume is writable( The stress testing could reproduce easily).
My local's screenshot
There are 2 ways to fix this issue:
Store write(store.go Write method) bypass the MaxPossibleVolumeSize checking, finally all volume files against same vid at different server will all exceed MaxPossibleVolumeSize
master's Heartbeat analyzing will cache each volume's max file size, if it exceed the MaxPossibleVolumeSize, directly mark this volume as readonly
What's your suggestion?
The text was updated successfully, but these errors were encountered:
I used the second approach to remember any volume that has been oversized.
Actually this case should not happen often size the vacuuming usually will kick in to remove those extra files. So I guess your case only happen during your benchmarking.
Chris:
This is an edge case that my colleague and I reproduced at our beta environment
Reproduce steps:
My local's screenshot
There are 2 ways to fix this issue:
What's your suggestion?
The text was updated successfully, but these errors were encountered: