Make the spark runner not serialize data unless spark is spilling to disk

Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. This lets Spark keep the data in memory avoiding the serialization round trip. Unfortunately the logic is fairly coarse - as soon as you switch to MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen to keep the data in memory, incurring the serialization overhead.

 

Ideally Beam would serialize the data lazily - as Spark chooses to spill to disk. This would be a change in behavior when using beam, but luckily Spark has a solution for folks that want data serialized in memory - MEMORY_AND_DISK_SER will keep the data serialized.

Imported from Jira [BEAM-5775](https://issues.apache.org/jira/browse/BEAM-5775). Original Jira may contain additional context.
Reported by: mikekap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the spark runner not serialize data unless spark is spilling to disk #19228

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Make the spark runner not serialize data unless spark is spilling to disk #19228

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions