Multi-output PCollections are cached regardless of cacheDisabled

This is likely an omission. Caching logic should all be consolidated in EvaluationContext to ensure consistency.

https://github.com/apache/beam/blob/d634674582a83b08bb9071ead7e8e5dae9fdc0bb/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java#L413-L428

Imported from Jira [BEAM-9387](https://issues.apache.org/jira/browse/BEAM-9387). Original Jira may contain additional context.
Reported by: ibzib.

	if (outputs.size() > 1) {
	StorageLevel level = StorageLevel.fromString(context.storageLevel());
	if (canAvoidRddSerialization(level)) {
	// if it is memory only reduce the overhead of moving to bytes
	all = all.persist(level);
	} else {
	// Caching can cause Serialization, we need to code to bytes
	// more details in https://issues.apache.org/jira/browse/BEAM-2669
	Map<TupleTag<?>, Coder<WindowedValue<?>>> coderMap =
	TranslationUtils.getTupleTagCoders(outputs);
	all =
	all.mapToPair(TranslationUtils.getTupleTagEncodeFunction(coderMap))
	.persist(level)
	.mapToPair(TranslationUtils.getTupleTagDecodeFunction(coderMap));
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-output PCollections are cached regardless of cacheDisabled #19992

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-output PCollections are cached regardless of cacheDisabled #19992

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions