Skip to content

Multi-output PCollections are cached regardless of cacheDisabled #19992

@damccorm

Description

@damccorm

This is likely an omission. Caching logic should all be consolidated in EvaluationContext to ensure consistency.

if (outputs.size() > 1) {
StorageLevel level = StorageLevel.fromString(context.storageLevel());
if (canAvoidRddSerialization(level)) {
// if it is memory only reduce the overhead of moving to bytes
all = all.persist(level);
} else {
// Caching can cause Serialization, we need to code to bytes
// more details in https://issues.apache.org/jira/browse/BEAM-2669
Map<TupleTag<?>, Coder<WindowedValue<?>>> coderMap =
TranslationUtils.getTupleTagCoders(outputs);
all =
all.mapToPair(TranslationUtils.getTupleTagEncodeFunction(coderMap))
.persist(level)
.mapToPair(TranslationUtils.getTupleTagDecodeFunction(coderMap));
}
}

Imported from Jira BEAM-9387. Original Jira may contain additional context.
Reported by: ibzib.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions