-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-28420] Support partial caching in sync and async lookup runner #20439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ction API and add tests for sync lookup
…cLookupFunction and add tests for async lookup
wuchong
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution, @PatrickRen .
The biggest concern of the implementation is the partial cache invades LookupJoinRunners (4 classes). Actually, the cache implementation can be a simple wrapper on the LookupFunction, such as CachedLookupFunction. This can clearly separate the caching (partial or full), fetcher, and join runner logic, which is very important for long-term maintenance. I can imagine that the LookupRunner would be much more complex when we introduce full caching and many if else branches there.
What do you think about refactoring to a simple PartialCachedLookupFunction extends LookupFunction which wraps user LookupFunction and the LookupCache? In the future, we can support full cache in the same way which just adding a new class makes things work.
| * table for which it is serving. | ||
| */ | ||
| @Internal | ||
| public class LookupCacheManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is only used internally (flink-table-runtime), why put it in a common module?
| public synchronized LookupCache registerCache(String tableIdentifier, LookupCache cache) { | ||
| checkNotNull(cache, "Could not register null cache in the manager"); | ||
| if (cachesByTableIdentifier.containsKey(tableIdentifier)) { | ||
| return cachesByTableIdentifier.get(tableIdentifier); | ||
| } else { | ||
| cachesByTableIdentifier.put(tableIdentifier, cache); | ||
| return cache; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have several concerns about this method:
- the same dim table can be used multiple times in a SQL query with different table options (e.g. cache TTL). I think we shouldn't reuse them if the configuration differs. That means we may need to introduce a
LookupCache#getIdentifier()interface to get an identifier of a specific cache. tableIdentifieris not enough to identify a cache. Because in session mode, different jobs may use the same table identifier to refer to different external tables. Maybe we should add JobID as part of the cache id.- Therefore, the final registered key should be composite of JobID, table identifier and cache identifier.
- From the method signature, the cache parameter is the one registered, however, it maybe not. Maybe
registerCacheIfAbsentwould be better (similar toMap#putIfAbsent).
| if (provider instanceof PartialCachingLookupProvider) { | ||
| LookupCache cache = ((PartialCachingLookupProvider) provider).getCache(); | ||
| if (cache == null) { | ||
| return Optional.empty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we allow a null cache for PartialCachingLookupProvider?
- Users should use
LookupFunctionProviderif no cache is provided. PartialCachingLookupProvider#getCache()is not annotated@Nullable.- I think we also not allow null
getScanRuntimeProviderandgetCacheReloadTriggerforFullCachingLookupProvider.
| LookupJoinCodeGenerator.generateLeftTableKeyProjection( | ||
| tableConfig, | ||
| classLoader, | ||
| leftTableRowType, | ||
| keyRowType, | ||
| lookupKeys), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use org.apache.flink.table.planner.plan.utils.KeySelectorUtil#getRowDataSelector to create the key select/projection which guarantees always generates BinaryRowData for the consistent hashcode() and equals() behavior.
| leftTableRowType, | ||
| keyRowType, | ||
| lookupKeys), | ||
| new RowDataSerializer(keyRowType), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cacheKeySerializer is not needed when we use the above keyselector.
| @Nullable private final RowDataSerializer cacheKeySerializer; | ||
| @Nullable private final RowDataSerializer cacheValueSerializer; | ||
|
|
||
| private LookupCache cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
transient?
| if (cacheKeySerializer != null && cacheValueSerializer != null) { | ||
| RowData copiesKey = cacheKeySerializer.copy(key); | ||
| Collection<RowData> copiedValues = | ||
| value.stream() | ||
| .map(cacheValueSerializer::copy) | ||
| .collect(Collectors.toCollection(ArrayList::new)); | ||
| getCache().put(copiesKey, copiedValues); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why copy?
| * <p>- Input from left table (id, name): +I(1, Alice) | ||
| * | ||
| * <p>- Value return by user's fetcher (id, age, gender): +I(1, 18, female) | ||
| * | ||
| * <p>Then the entry stored in the cache would be: +I(1), +I(1, 18, female), even calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change flag can be omitted when explaining the cache strategy because the cache doesn't store any changelogs and all records should be insert-only.
+I(1), +I(1, 18, female) ==> key=(1), value=(1, 18, female)
| HEAP_BACKEND, | ||
| ENABLE_OBJECT_REUSE, | ||
| AsyncOutputMode.ALLOW_UNORDERED, | ||
| DISABLE_CACHE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need double cases for dynamic table sources. I think mixing cache cases into the existing 4 cases (for dynamic table sources) is enough.
| } | ||
| } | ||
| if (cacheHandler != null) { | ||
| cacheHandler.getCache().close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should unregister the cache as well. Otherwise, memory leaks. That means LookupCacheManager may need to track the reference count.
What is the purpose of the change
This pull request add partial caching functionality in sync and async lookup runners, which is a part of FLIP-221.
Brief change log
Verifying this change
This change modifies some existing tests, including LookupJoinHarnessTest and LookupJoinITCase.
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation