Skip to content

Conversation

@jackierwzhang
Copy link
Contributor

@jackierwzhang jackierwzhang commented Apr 28, 2025

What changes were proposed in this pull request?

Minor refactor to introduce an interface for accessing the metadata (e.g. offset / commit logs) in a streaming checkpoint.

Why are the changes needed?

To standardize the access pattern.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This is a pure refactoring, existing tests should suffice.

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the refactor. It makes access to the checkpoint outside of StreamExecution more formalized, e.g. in the State data store

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks OK - meta question: I see you changed val to lazy val in multiple places. Is that intentional, and if it is, mind explaining?

@jackierwzhang
Copy link
Contributor Author

The code looks OK - meta question: I see you changed val to lazy val in multiple places. Is that intentional, and if it is, mind explaining?

Yes, there are mainly two reason:

  1. I don't want the initialization of the class to automatically initialize the logs since they involve FS operations, only direct access should initialize the logs.
  2. Overriding a lazy val in StreamExecution is easier to reason that a plain val, which may be null when it's overridden by the subclass and hence throw NPE. For example, this id may throw null because checkpointMetadata above it is overridden.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 pending CI. Thanks for the explanation.

@HeartSaVioR
Copy link
Contributor

https://github.com/jackierwzhang/spark/actions/runs/14720115571/job/41312160109

Test failures seem to be relevant. @jackierwzhang Would you mind taking a look?

@jackierwzhang
Copy link
Contributor Author

@HeartSaVioR I actually had to revert some lazy vals in StreamExecution back to vals because this access to id within streamMetadata would actually forces initialize the subclass's lazy val but at that time the input parameter asyncWritesExecutorService here is still null.

}
}

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackierwzhang Sorry, shall we retain the code comment as it is when you move this and below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it because they are duplicated with that in the class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK you moved the explanation to new class. Never mind.

@HeartSaVioR
Copy link
Contributor

Thanks! Merging to master.

yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…adata

### What changes were proposed in this pull request?
Minor refactor to introduce an interface for accessing the metadata (e.g. offset / commit logs) in a streaming checkpoint.

### Why are the changes needed?
To standardize the access pattern.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This is a pure refactoring, existing tests should suffice.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#50746 from jackierwzhang/spark-51940-checkpoint-metadata-interface.

Authored-by: Jackie Zhang <ruowang.zhang+data@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants