Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][SS] Python streaming source #44416

Closed
wants to merge 2 commits into from

Conversation

chaoqin-li1123
Copy link
Contributor

What changes were proposed in this pull request?

POC of streaming python source.
A python runner that embed a streaming data source and handle planInputPartitions and latestOffset call

@chaoqin-li1123 chaoqin-li1123 changed the title Python streaming source POC [WIP][SS] Python streaming source Dec 19, 2023
@HyukjinKwon HyukjinKwon marked this pull request as draft December 19, 2023 23:27
self.value = self.from_json(json_str)

class DataStreamReader(ABC):
def latest_offset(self) -> DataStreamOffset:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, all Python API follows camelCase naming rule. If we could make it one word like latestoffset, that's an option too (to make it more Pythonic)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rather minor but pertinent point

I see

from pyspark.sql.session import SparkSession

or equally

from pyspark.sql import SparkSession

Although they are both valid, in practice, many developers prefer the second option (from pyspark.sql import SparkSession) as it is concise and aligns with the common convention of grouping related functionality under a common namespace or module. It also reflects the structure of the pyspark.sql module, where SparkSession is a key component.

def offset_to_json(self, offset: DataStreamOffset) -> str:
...

def json_to_offset(self, json: str) -> DataStreamOffset:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to make this method a member of DataStreamOffset, but the json string doesn't contain any type information and can't be loaded back to original class(we enforce using plaintext json because @HeartSaVioR don't want to store unreadable pickled bytes in offset log). I wonder if @HyukjinKwon have any good suggestion about the interface of deserializing json back to specific DataStreamOffset as a Python expert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants