Skip to content

[SS][WIP] Serialization using case classes/primitives/POJO based on Avro for Arbitrary State API v2.#44989

Closed
jingz-db wants to merge 15 commits intoapache:masterfrom
jingz-db:avro-serialization
Closed

[SS][WIP] Serialization using case classes/primitives/POJO based on Avro for Arbitrary State API v2.#44989
jingz-db wants to merge 15 commits intoapache:masterfrom
jingz-db:avro-serialization

Conversation

@jingz-db
Copy link
Contributor

@jingz-db jingz-db commented Feb 1, 2024

What changes were proposed in this pull request?

In the new operator for arbitrary state-v2, we cannot rely on the session/encoder being available since the initialization for the various state instances happens on the executors. Also, we can only support limited state types with the available encoders. Hence, for the state serialization, we propose to serialize primitives/case classes/POJO into avro bytes.

Why are the changes needed?

These changes are needed for providing a dedicated serializer for state-v2. Leveraging avro can speed up the serialization and comes with native support for schema evolution.
The changes are part of the work around adding new stateful streaming operator for arbitrary state mgmt that provides a bunch of new features listed in the SPIP JIRA here - https://issues.apache.org/jira/browse/SPARK-45939

Does this PR introduce any user-facing change?

TODO: depends on whether we want to ask users for value encoders.

How was this patch tested?

Unit tests for primitives, case classes, POJO separately.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot removed the AVRO label Feb 1, 2024
@jingz-db jingz-db changed the title Avro serialization [SS] Serialization using case classes/primitives/POJO based on Avro for Arbitrary State API v2. Feb 1, 2024
@jingz-db jingz-db marked this pull request as ready for review February 2, 2024 00:04
@jingz-db jingz-db changed the title [SS] Serialization using case classes/primitives/POJO based on Avro for Arbitrary State API v2. [SS][WIP] Serialization using case classes/primitives/POJO based on Avro for Arbitrary State API v2. Feb 2, 2024
@github-actions github-actions bot added the AVRO label Feb 2, 2024
* @return - instance of ValueState of type T that can be used to store state persistently
*/
def getValueState[T](stateName: String): ValueState[T]
def getValueState[T](stateName: String, valEncoder: Encoder[T]): ValueState[T]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to function comment

}

private[avro] object AvroFileFormat {
private[spark] object AvroFileFormat {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be private[sql] ?

keyRow
}

def encodeValue[S] (value: S): UnsafeRow = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this later ?

// case class -> dataType
val valSchema: StructType = valEnc.schema
// dataType -> avroType
val avroType: Schema = SchemaConverters.toAvroType(valSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert from spark SQL schema to Avro schema ?

new GenericDatumWriter[Any](avroType)
val avroData = avroSerializer.serialize(objRow)
writer.write(avroData, encoder)
encoder.flush()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do writer.close ?

@github-actions github-actions bot added the BUILD label Feb 12, 2024
@github-actions github-actions bot added the CORE label Feb 12, 2024
@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 23, 2024
@github-actions github-actions bot closed this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments