Skip to content

Conversation

@HuangXingBo
Copy link
Contributor

What is the purpose of the change

This pull request will add the missing cache api in Python DataStream API

Brief change log

  • Add the missing cache api in Python DataStream API

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 18, 2022

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

.. versionadded:: 1.16.0
"""
return DataStream(self._j_data_stream.cache())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we may need to return CachedDataStream which has some specific API, e.g. invalidate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of that, we also introduce SideOutputDataStream which supports the cache method. I think we may need to introduce SideOutputDataStream in PyFlink and let get_side_output return SideOutputDataStream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sxnan I think we can reuse the cache method of DataStream, no need to introduce a new class SideOutputDataStream

:return: The cached DataStream that can use in later job to reuse the cached intermediate
result.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need also add close method in StreamExecutionEnvironment.

@HuangXingBo
Copy link
Contributor Author

@dianfu @Sxnan I have resolved the comments in the latest commit.

@Sxnan
Copy link
Contributor

Sxnan commented Aug 19, 2022

@HuangXingBo Thanks for your update! No more comments from my side.

@dianfu dianfu closed this in 97519d1 Aug 22, 2022
huangxiaofeng10047 pushed a commit to huangxiaofeng10047/flink that referenced this pull request Nov 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants