-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve FileInfoBackingCache #1284
Conversation
retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces a race. (see comment)
Could you give a sequence of events for how the other race occurs?
Has the locking shown to be a problem in benching?
We can get rid of it in other ways, such as:
- computeIfAbsent() // lock is at section level
- Make CachedFileInfo load and close. Presence of a CachedFileInfo in fileInfos should work as a lock. Only load after a successful put, only remove after marking dead and flushing.
@@ -92,8 +107,20 @@ private void releaseFileInfo(long ledgerId, CachedFileInfo fileInfo) { | |||
} | |||
|
|||
void closeAllWithoutFlushing() throws IOException { | |||
for (Map.Entry<Long, CachedFileInfo> entry : fileInfos.entrySet()) { | |||
entry.getValue().close(false); | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of this change. I'm pretty sure used a foreach loop in first place to avoid something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no entrySet
in ConcurrentLongHashMap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, saw that later. Better to use UncheckedIOException in any case. Maybe even change FileInfo close to throw it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change the signature to throw unchecked io exception. checked exception is well handled in the whole path, but not unchecked exception.
The logic here is just to getting around this interface doesn't throw checked exception, so I am able to throw checked exception. there is no really matter it is unchecked execution exception or unchecked io exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no really matter it is unchecked execution exception or unchecked io exception.
The difference is the instanceof. With UncheckedIOException:
try {
// blah
} catch (UncheckedIOException uio) {
throw uio.getCause(); // returns a IOException
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay will change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. changed to use UncheckedIOException
} | ||
} finally { | ||
lock.readLock().unlock(); | ||
} | ||
|
||
File backingFile = fileLoader.load(ledgerId, masterKey != null); | ||
CachedFileInfo newFi = new CachedFileInfo(ledgerId, backingFile, masterKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the load inside the lock for a reason. This change introduces a race.
| ThreadA | ThreadB |
| ====================================== | ====================================== |
| calls loadInfoInfo | |
| - checks fileInfos under read lock | |
| - loads fileinfo for file | |
| | calls loadFileInfo |
| | - checks fileInfos under read lock |
| | - loads fileinfo for file |
| | - puts into fileInfos under write lock |
| | |
| | // does something else, maybe fencing |
| | |
| | calls releaseFileInfo |
| | - flushes out fencing flag |
| | - removes from fileInfo |
| - puts info fileInfos under write lock | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure the race condition you describe exists.
1), this method is only called by guava cache when loading the file info. guava cache guarantees awaiting pending load for a given key. so thread a and thread b can not happen on one guava cache. **so one has to be loading from write fi cache, the other one has to be loading from read fi cache. **
2), the file info loader is basically doing file lookup and construct the file info objects. it doesn't touch any persistence state (e.g. create file). so if the file exists at disk, the file info object created by thread a and thread b are identical. so there is no race condition. so the only concern is file doesn't exist on disk, ThreadA picks a directory while ThreadB picks another directory.
now lets talk about "ThreadA picks one directory while ThreadB picks another directory".
in this case, both ThreadA and ThreadB can't find file on the disk, it means it is a first time creation. a read request can only attempt to load file info after a write request creates a file info.
https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/HandleFactoryImpl.java#L61
so this case won't actually happen.
hope this clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race exists according to the contract provided by FileInfoBackingCache. I'm pretty sure it can happen with the cache loaders also, though that should be irrelevant to the discussion, as whether a container is safe should make it irrelevant as to how it is used.
| Read Fi CacheLoader | Write Fi CacheLoader |
| ====================================== | ====================================== |
| calls loadInfoInfo | |
| - checks fileInfos under read lock | |
| - loads fileinfo for file | |
| | calls loadFileInfo |
| | - checks fileInfos under read lock |
| | - loads fileinfo for file |
| | - puts into fileInfos under write lock |
| | |
| | // does something else, maybe fencing |
| | |
| | calls releaseFileInfo |
| | - flushes out fencing flag |
| | - removes from fileInfo |
| - puts info fileInfos under write lock | |
| | calls loadFileInfo |
| | - checks fileInfos under read lock |
| | - finds fileInfo loaded from other thread |
| | - fileInfo does not appear to be fenced |
| | |
the file info object created by thread a and thread b are identical.
At the time of creation yes. But one thread can modify the state and persist it back, and the other thread would be unaware of these state changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race exists according to the contract provided by FileInfoBackingCache. I'm pretty sure it can happen with the cache loaders also, though that should be irrelevant to the discussion, as whether a container is safe should make it irrelevant as to how it is used.
this is not a general class. this is a class for the file info cache. you are taking how fileinfo and fileinfo cache work into account. your example is the exact example showing this class is related to fileinfo and fileinfo cache.
in your new example, you are trying to say "fileInfo does not appear to be fenced", how's that possible. all the fencing bit are lazily loaded via FileInfo#checkOpen. At the point Read Fi CacheLoader put the fileinfo into the cache, both write and read fi cache loader use same file info object, which will get exact fence bit.
But one thread can modify the state and persist it back, and the other thread would be unaware of these state changes.
how's possible? I only move the construction and file lookup out of the write lock, this doesn't change the original behavior. the get and put are under the write lock, read & write fi cache will reference same FiInfo when this file info exist in both case.
the only consider would only coming from file lookup is out of the lock, and I have explained this wasn't a problem in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I had it in my head that the loader actually read the file, but it should have been obvious it didn't.
This change works fine then. Why not get rid of the lock completely though?
I am using your diagram and using the existing logic
The Read Fi CacheLoader will overwrite the fileinfo created by Write Fi CacheLoader.
it is a giant write lock. so every thread is blocking waiting for file lookup (which is done by the file loader). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Read Fi CacheLoader will overwrite the fileinfo created by Write Fi CacheLoader.
Ah yes. At minimum we should be checking again for existence inside the write lock.
it is a giant write lock. so every thread is blocking waiting for file lookup (which is done by the file loader).
But again, my question is whether this has been observed to be a problem?
Anyhow, if we want to get rid of the locking, we should get rid of the big lock completely, which shouldn't be too hard with the concurrent map.
Consider the following, where the lock is at the ledger granularity
CachedFileInfo loadFileInfo(long ledgerId, byte[] masterKey) throws IOException {
while (true) {
CachedFileInfo fi = fileInfos.computeIfAbsent(
ledgerId, (ledgerId) -> new CachedFileInfo(ledgerId));
synchronized (fi) {
if (!fi.initialized()) {
File backingFile = fileLoader.load(ledgerId, masterKey != null);
fi.initialize(backingFile, masterKey);
}
}
if (fi.tryRetain()) {
return fi;
}
}
}
} | ||
} finally { | ||
lock.readLock().unlock(); | ||
} | ||
|
||
File backingFile = fileLoader.load(ledgerId, masterKey != null); | ||
CachedFileInfo newFi = new CachedFileInfo(ledgerId, backingFile, masterKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race exists according to the contract provided by FileInfoBackingCache. I'm pretty sure it can happen with the cache loaders also, though that should be irrelevant to the discussion, as whether a container is safe should make it irrelevant as to how it is used.
| Read Fi CacheLoader | Write Fi CacheLoader |
| ====================================== | ====================================== |
| calls loadInfoInfo | |
| - checks fileInfos under read lock | |
| - loads fileinfo for file | |
| | calls loadFileInfo |
| | - checks fileInfos under read lock |
| | - loads fileinfo for file |
| | - puts into fileInfos under write lock |
| | |
| | // does something else, maybe fencing |
| | |
| | calls releaseFileInfo |
| | - flushes out fencing flag |
| | - removes from fileInfo |
| - puts info fileInfos under write lock | |
| | calls loadFileInfo |
| | - checks fileInfos under read lock |
| | - finds fileInfo loaded from other thread |
| | - fileInfo does not appear to be fenced |
| | |
the file info object created by thread a and thread b are identical.
At the time of creation yes. But one thread can modify the state and persist it back, and the other thread would be unaware of these state changes.
@@ -92,8 +107,20 @@ private void releaseFileInfo(long ledgerId, CachedFileInfo fileInfo) { | |||
} | |||
|
|||
void closeAllWithoutFlushing() throws IOException { | |||
for (Map.Entry<Long, CachedFileInfo> entry : fileInfos.entrySet()) { | |||
entry.getValue().close(false); | |||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no really matter it is unchecked execution exception or unchecked io exception.
The difference is the instanceof. With UncheckedIOException:
try {
// blah
} catch (UncheckedIOException uio) {
throw uio.getCause(); // returns a IOException
}
this is one of the change I made here, checking the existence.
no at benchmark now. I hit the race condition then I noticed this behavior that file lookup is under a giant lock, which I would like to remove. because we have been working hard at SortedLedgerStorage to avoid doing any IO under lock. This FileInfoBackingCache unfortunately added this behavior back. So I have a strong concern about this and want to move the I/O operation out. However I don't want to change the locking behavior on getting FileInfo, reasons explained as below.
a lot of race conditions can come around when competing on getting the fileinfo. so check-existence and put is much safer to happen under the write lock. and they are just memory operation, which is fine to happen under the write lock. However find the directory for a ledger is an I/O operation, it is expensive, unpredictable than memory operation. so it should not happen under the lock. so the current approach is the most comfortable and safest approach I can think of. However if you really feel strong about this, feel free to start your approach. |
I'm fine with the current approach, but it seems pointless to have locks around a concurrent hashmap. I've made a PR into your branch with a lockless approach. |
well. depends on how you see stuffs. concurrent hashmap is section based implementation, it has better parallelism and less contention on accessing stuffs. especially most of the time the map is protected under a read lock and only be blocked when the map is going to modified.
I have thought about lockless approach but I didn't do it. A lockless approach moves the mutation out of the protected of a write lock, which can potentially make it exposed to race conditions that we don't think of. This area has been very tricky based on the PRs and discussions we had around it. It is a too-risky change for me. That's also the reason I explained above and the reason I don't feel comfortable about your lockless change. My immediate concern is only "don't do IO under one lock". I am much comfortable to keep the read/write lock until there is an evidence showing read/write lock is a problem. |
Sure. I'm fine with current approach. If we were using stock ConcurrentHashMap it'd be less risky, but since the hashmap is our code too, it's better to err on the safe side. I'll push the rename of FileLoader in a different PR. |
|
retest this please |
1 similar comment
retest this please |
Descriptions of the changes in this PR:
There are a couple of issues noticed in FileInfoBackingCache:
There is a race condition in loadFileInfo between get-check and put. If concurrent loading happens, there might be a FileInfo loaded into the map after get-check. This can cause incorrect reference count on FileInfo.
FileLoader is doing I/O operation which happens under a giant write lock.
assert is typically not recommended since it is disabled at production runtime typically.
Changes
Beside that, switch to use ConcurrentLongHashMap to avoid boxing and unboxing.
Related Issues:
#913 #832