Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

Closed
1 of 15 tasks
wrossmorrow opened this issue Feb 5, 2023 · 1 comment
Closed
1 of 15 tasks

[Feature Request]: Support LZMA compression in python I/O SDKs #25316

wrossmorrow opened this issue Feb 5, 2023 · 1 comment
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. new feature P2 python

Comments

@wrossmorrow
Copy link
Contributor

What would you like to happen?

LZMA compression is standard in python but not one of the strategies in the beam.io.{Read,Write}FromText PTransforms. openwebtext, for example, uses this compression. I think this may be a pretty simple change. For example, I hacked up a naive "shim" here for use in Dataflow with a custom container by just overwriting apache_beam/io/filesystem.py in the site-packages. It's working (a) locally with decompression and compression (though the output filenames are malformed, the part schema follows the compression extension) and (b) in a DataflowRunner reading a GCS dump of all the openwebtext .xz archives. (Without this I've been having a hell of a time getting any horizontal scaling while reading openwebtext.) It may be this simple, but I haven't run any Beam tests on these minor changes. I will probably do a bit more research into that myself.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 5, 2023
wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 5, 2023
wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023
wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023
wrossmorrow added a commit to wrossmorrow/beam-pysdk-io-lzma that referenced this issue Feb 11, 2023
Abacn pushed a commit that referenced this issue Feb 13, 2023
* (#25316) Added naive first shot at enabling LZMA compression

* (#25316) Added a draft line to CHANGES.md

* (#25316) fix linter issues

* (#25316) update tests (draft)

* (#25316) import order in test file
@wrossmorrow
Copy link
Contributor Author

Merged in #25317

@github-actions github-actions bot added this to the 2.46.0 Release milestone Feb 13, 2023
@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. new feature P2 python
Projects
None yet
Development

No branches or pull requests

2 participants