-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vaex structured dataset and native types implementation #1230
Conversation
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
@ryankarlos, thanks for creating the PR! We'll review it shortly. :) |
@samhita-alla thanks - im new to flyte so quite possible i must have missed out a few things. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryankarlos Thanks for your contribution.
Could you create a new flytekit-plugin for the Vaex dataframe? Here is an example.
@pingsutw Thanks, i have now added a plugin for vaex. However, when i am trying to this works by running a simple workflow locally , i get an error - and not sure how to fix it
I have already registered and encoding and decoding handlers so not sure why it is complaning
|
f2c3003
to
a2cfbed
Compare
Codecov Report
@@ Coverage Diff @@
## master #1230 +/- ##
==========================================
+ Coverage 68.57% 68.65% +0.07%
==========================================
Files 288 288
Lines 26224 26351 +127
Branches 2929 2489 -440
==========================================
+ Hits 17984 18092 +108
- Misses 7762 7779 +17
- Partials 478 480 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@ryankarlos, it seems like Vaex dataframe type is |
d60205f
to
7d95dc2
Compare
Ah yes, thanks - ive fixed it now. |
Thank you @ryankarlos. LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: The test failed because the plugin name is inconsistent
plugins/flytekit-vaex/setup.py
Outdated
|
||
PLUGIN_NAME = "vaex" | ||
|
||
microlib_name = f"plugins-{PLUGIN_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
microlib_name = f"plugins-{PLUGIN_NAME}" | |
microlib_name = f"flytekitplugins-{PLUGIN_NAME}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated now
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
b6557ae
to
436347c
Compare
from flytekit import kwtypes, task, workflow | ||
from flytekit.types.structured.structured_dataset import PARQUET, StructuredDataset | ||
|
||
full_schema = Annotated[StructuredDataset, kwtypes(x=int, y=str), PARQUET] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an observation:
i initially assumed col types metadata was skipped then it would still be ok as still have two arguments to Annotated
full_schema = Annotated[StructuredDataset, PARQUET] |
but if i do full_schema = Annotated[StructuredDataset, PARQUET]
, i get the following error when running the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @pingsutw
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I fetched your commit, and reran the test (has updated to Annotated[StructuredDataset, PARQUET]) but didn't get the error. Let's wait to see if ci can pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I fetched your commit, and reran the test (has updated to Annotated[StructuredDataset, PARQUET]) but didn't get the error. Let's wait to see if ci can pass.
thanks, test that has failed in ci is some other one (unrelated to this PR)
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>
@samhita-alla pushed requested changes now. |
@pingsutw, +1 again, please? |
path = ctx.file_access.get_random_remote_directory() | ||
local_dir = ctx.file_access.get_random_local_directory() | ||
local_path = os.path.join(local_dir, f"{0:05}") | ||
df.export_parquet(local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for reviewing it late! As per their docs, HDF5 is the most suitable when the data is huge: https://vaex.readthedocs.io/en/docs/example_io.html#id1. We can go with Parquet, just want to give a heads-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, we can register another handler (VaexDataFrameToHDF5EncodingHandler
), so people can use Annotated
to update the default format. we can add it in the separate PR
def t1() -> Annotated[StructuredDataset, HDF5]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow may also be useful to add support for https://vaex.readthedocs.io/en/latest/faq.html, what are your thoughts ? Im happy to implement and register extra handlers in separate PR if thats ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's awesome, works for me! Please feel free to create issues accordingly and let me know! :)
) -> vaex.dataframe.DataFrameLocal: | ||
local_dir = ctx.file_access.get_random_local_directory() | ||
ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True) | ||
path = f"{local_dir}/00000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it's okay to consider the first partition if the dataframe is huge. How about we consider all files that are present under the parquet directory using vaex.open
or vaex.open_many
? I think you can use the glob pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samhita-alla @pingsutw Thanks for spotting this . Actually, looking at this more closely - i think i may need to use df.export_many
https://vaex.io/docs/guides/io.html#Export-to-multiple-files-in-parallel if we want to serialise chunks to multiple parts in parallel.
From the docs https://vaex.io/docs/guides/io.html#Export-to-multiple-files-in-parallel :
With the export_many method one can export a DataFrame to muliple files of the same type in parallel. This is likely to be more performant when exporting very large DataFrames to the cloud compared to writing a single large Arrow of Parquet file, where each chunk is written in succession.
What i implemented writes chunks serially to single parquet it seems according to the docs (default chunk size 1048576) . Quoting from this section https://vaex.io/docs/guides/io.html#Exporting-binary-file-formats
When exporting to Apache Arrow and Apache Parquet file format, the data is written in chunks thus enabling to export of data that does not fit in RAM all at once. A custom chunk size can be specified via the chunk_size argument, the default value of which is 1048576. For example:
Do we want to support one or both options and do we want to give the user option to override chunk size ?
Accordingly, we can consider using vaex.open
or vaex.open_many
for single or multiple parts as you suggested, for decoding step. Also, you think maybe worth adding an extra workflow test for dataframe with > chunk size limit to ascertain this behaviour for either or both options implemented ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, ive been trying to mimic polars implementation - any idea why 00000
suffix in path and how is this being split to multiple parts (as polars implementation seems to write to single parquet https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html) ?
flytekit/plugins/flytekit-polars/flytekitplugins/polars/sd_transformers.py
Lines 66 to 67 in caf612d
ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True) | |
path = f"{local_dir}/00000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryankarlos, thanks for doing the research! I think 00000
isn't a partition after all. It's right in your code where you're creating a local path using local_path = os.path.join(local_dir, f"{0:05}")
syntax. So it's just a file. 😅
Would you mind creating an issue to support writing large dataframes using export_many
? This isn't required now but you or someone else can implement it later.
Congrats on merging your first pull request! 🎉 |
* vaex structured dataset and native types implementation Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * Create new plugin for vaex Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix vaex type to DataFrameLocal and add reqs Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix tests Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * add flytekit-vaex to python build Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * correct microlib name Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix test Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * run pip-compile again Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * small fixes Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> Signed-off-by: Vivek Praharsha <vpraharsha@outlook.com>
Vaex has great performance on a single machine, which is usually needed for most datasets. This PR adds support for Vaex as a pandas alternative for StructuredDataset object type.
We extend StructuredDatasetDecoder and StructuredDatasetEncoder for vaex as in https://docs.flyte.org/projects/cookbook/en/latest/auto/core/type_system/structured_dataset.html
This PR implements automatic serialization and deserialization between consecutive tasks using parquet but could be extended to Arrow and HDF5 or the other binary formats supported by vaex https://vaex.readthedocs.io/en/latest/guides/io.html
Type
Are all requirements met?
Complete description
Added support for Vaex Dataframe as a type
Vaex Structured Dataset Encode and Decoder for serialisation and deserialisation
Tracking Issue
Fixes flyteorg/flyte#701
Follow-up issue
NA
OR
https://github.com/flyteorg/flyte/issues/