vaex structured dataset and native types implementation #1230

ryankarlos · 2022-10-09T23:54:28Z

Vaex has great performance on a single machine, which is usually needed for most datasets. This PR adds support for Vaex as a pandas alternative for StructuredDataset object type.
We extend StructuredDatasetDecoder and StructuredDatasetEncoder for vaex as in https://docs.flyte.org/projects/cookbook/en/latest/auto/core/type_system/structured_dataset.html

This PR implements automatic serialization and deserialization between consecutive tasks using parquet but could be extended to Arrow and HDF5 or the other binary formats supported by vaex https://vaex.readthedocs.io/en/latest/guides/io.html

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

Added support for Vaex Dataframe as a type
Vaex Structured Dataset Encode and Decoder for serialisation and deserialisation

Tracking Issue

Fixes flyteorg/flyte#701

Follow-up issue

NA
OR
https://github.com/flyteorg/flyte/issues/

welcome · 2022-10-09T23:54:30Z

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
Sign off your commits (Reference: DCO Guide).

samhita-alla · 2022-10-10T13:20:20Z

@ryankarlos, thanks for creating the PR! We'll review it shortly. :)

ryankarlos · 2022-10-10T23:28:27Z

@samhita-alla thanks - im new to flyte so quite possible i must have missed out a few things.

pingsutw

@ryankarlos Thanks for your contribution.
Could you create a new flytekit-plugin for the Vaex dataframe? Here is an example.

ryankarlos · 2022-10-15T04:10:54Z

@pingsutw Thanks, i have now added a plugin for vaex.

However, when i am trying to this works by running a simple workflow locally , i get an error - and not sure how to fix it

StructuredDatasetTransformerEngine.register(VaexDataFrameToParquetEncodingHandlers())
StructuredDatasetTransformerEngine.register(ParquetToVaxDataFrameDecodingHandler())
StructuredDatasetTransformerEngine.register_renderer(vaex.DataFrame, VaexDataFrameRenderer())

subset_schema = Annotated[StructuredDataset, kwtypes(col2=str), PARQUET]

@task
def generate() -> subset_schema:
    pd_df = pd.DataFrame({"col1": [1, 3, 2], "col2": list("abc")})
    vaex_df = vaex.from_pandas(pd_df)
    return StructuredDataset(dataframe=vaex_df)

@task
def consume(df: subset_schema) -> subset_schema:
    df = df.open(vaex.DataFrame).all()
    assert df["col2"][0] == "a"
    assert df["col2"][1] == "b"
    assert df["col2"][2] == "c"
    return StructuredDataset(dataframe=df)

@workflow
def wf():
    consume(df=generate())

if __name__ == "__main__":
    wf()

I have already registered and encoding and decoding handlers so not sure why it is complaning

TypeError: Failed to convert return value for var o0 for function generate with error 
<class 'ValueError'>: Failed to find a handler for <class 'vaex.dataframe.DataFrameLocal'>, 
protocol file, fmt parquet

codecov · 2022-10-16T05:15:32Z

Codecov Report

Merging #1230 (9d174f4) into master (63ad4fc) will increase coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1230      +/-   ##
==========================================
+ Coverage   68.57%   68.65%   +0.07%     
==========================================
  Files         288      288              
  Lines       26224    26351     +127     
  Branches     2929     2489     -440     
==========================================
+ Hits        17984    18092     +108     
- Misses       7762     7779      +17     
- Partials      478      480       +2

Impacted Files	Coverage Δ
flytekit/deck/deck.py	`34.04% <0.00%> (-4.26%)`	⬇️
flytekit/clis/sdk_in_container/register.py	`79.68% <0.00%> (-3.08%)`	⬇️
flytekit/types/structured/structured_dataset.py	`60.74% <0.00%> (-2.58%)`	⬇️
flytekit/types/directory/types.py	`54.16% <0.00%> (-0.84%)`	⬇️
...ctured_dataset/test_structured_dataset_workflow.py	`99.24% <0.00%> (-0.76%)`	⬇️
flytekit/core/type_engine.py	`58.89% <0.00%> (-0.50%)`	⬇️
flytekit/core/local_cache.py	`46.66% <0.00%> (-0.40%)`	⬇️
flytekit/clis/sdk_in_container/helpers.py	`92.59% <0.00%> (-0.27%)`	⬇️
flytekit/clis/sdk_in_container/run.py	`84.15% <0.00%> (-0.04%)`	⬇️
plugins/setup.py	`0.00% <0.00%> (ø)`
... and 22 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

samhita-alla · 2022-10-16T06:43:46Z

@ryankarlos, it seems like Vaex dataframe type is vaex.dataframe.DataFrameLocal rather than vaex.dataframe. Your transformer handles the latter use case not the former. Can you re-verify what the type of Vaex dataframe is?

ryankarlos · 2022-10-18T12:59:17Z

@ryankarlos, it seems like Vaex dataframe type is vaex.dataframe.DataFrameLocal rather than vaex.dataframe. Your transformer handles the latter use case not the former. Can you re-verify what the type of Vaex dataframe is?

Ah yes, thanks - ive fixed it now.

pingsutw · 2022-10-19T18:31:43Z

Thank you @ryankarlos. LGTM

pingsutw

nit: The test failed because the plugin name is inconsistent

pingsutw · 2022-10-19T18:47:44Z

plugins/flytekit-vaex/setup.py

+
+PLUGIN_NAME = "vaex"
+
+microlib_name = f"plugins-{PLUGIN_NAME}"


Suggested change

microlib_name = f"plugins-{PLUGIN_NAME}"

microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

thanks, updated now

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

ryankarlos · 2022-10-19T23:10:14Z

plugins/flytekit-vaex/tests/test_vaex_plugin_sd.py

+from flytekit import kwtypes, task, workflow
+from flytekit.types.structured.structured_dataset import PARQUET, StructuredDataset
+
+full_schema = Annotated[StructuredDataset, kwtypes(x=int, y=str), PARQUET]


Just an observation:
i initially assumed col types metadata was skipped then it would still be ok as still have two arguments to Annotated

flytekit/plugins/flytekit-polars/tests/test_polars_plugin_sd.py

Line 10 in b7ecdf6

full_schema = Annotated[StructuredDataset, PARQUET]

but if i do full_schema = Annotated[StructuredDataset, PARQUET], i get the following error when running the test

cc: @pingsutw

hmm, I fetched your commit, and reran the test (has updated to Annotated[StructuredDataset, PARQUET]) but didn't get the error. Let's wait to see if ci can pass.

hmm, I fetched your commit, and reran the test (has updated to Annotated[StructuredDataset, PARQUET]) but didn't get the error. Let's wait to see if ci can pass.

thanks, test that has failed in ci is some other one (unrelated to this PR)

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

plugins/flytekit-vaex/README.md

plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

ryankarlos · 2022-10-24T11:10:27Z

@samhita-alla pushed requested changes now.

samhita-alla · 2022-10-25T12:33:39Z

@pingsutw, +1 again, please?

samhita-alla · 2022-10-27T07:33:08Z

plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py

+        path = ctx.file_access.get_random_remote_directory()
+        local_dir = ctx.file_access.get_random_local_directory()
+        local_path = os.path.join(local_dir, f"{0:05}")
+        df.export_parquet(local_path)


Apologies for reviewing it late! As per their docs, HDF5 is the most suitable when the data is huge: https://vaex.readthedocs.io/en/docs/example_io.html#id1. We can go with Parquet, just want to give a heads-up.

yup, we can register another handler (VaexDataFrameToHDF5EncodingHandler), so people can use Annotated to update the default format. we can add it in the separate PR

def t1() -> Annotated[StructuredDataset, HDF5]

Arrow may also be useful to add support for https://vaex.readthedocs.io/en/latest/faq.html, what are your thoughts ? Im happy to implement and register extra handlers in separate PR if thats ok ?

That's awesome, works for me! Please feel free to create issues accordingly and let me know! :)

samhita-alla · 2022-10-27T08:52:59Z

plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py

+    ) -> vaex.dataframe.DataFrameLocal:
+        local_dir = ctx.file_access.get_random_local_directory()
+        ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True)
+        path = f"{local_dir}/00000"


I'm wondering if it's okay to consider the first partition if the dataframe is huge. How about we consider all files that are present under the parquet directory using vaex.open or vaex.open_many? I think you can use the glob pattern.

@samhita-alla @pingsutw Thanks for spotting this . Actually, looking at this more closely - i think i may need to use df.export_many https://vaex.io/docs/guides/io.html#Export-to-multiple-files-in-parallel if we want to serialise chunks to multiple parts in parallel.

From the docs https://vaex.io/docs/guides/io.html#Export-to-multiple-files-in-parallel :

With the export_many method one can export a DataFrame to muliple files of the same type in parallel. This is likely to be more performant when exporting very large DataFrames to the cloud compared to writing a single large Arrow of Parquet file, where each chunk is written in succession.

What i implemented writes chunks serially to single parquet it seems according to the docs (default chunk size 1048576) . Quoting from this section https://vaex.io/docs/guides/io.html#Exporting-binary-file-formats

When exporting to Apache Arrow and Apache Parquet file format, the data is written in chunks thus enabling to export of data that does not fit in RAM all at once. A custom chunk size can be specified via the chunk_size argument, the default value of which is 1048576. For example:

Do we want to support one or both options and do we want to give the user option to override chunk size ?
Accordingly, we can consider using vaex.open or vaex.open_many for single or multiple parts as you suggested, for decoding step. Also, you think maybe worth adding an extra workflow test for dataframe with > chunk size limit to ascertain this behaviour for either or both options implemented ?

Also, ive been trying to mimic polars implementation - any idea why 00000 suffix in path and how is this being split to multiple parts (as polars implementation seems to write to single parquet https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.write_parquet.html) ?

flytekit/plugins/flytekit-polars/flytekitplugins/polars/sd_transformers.py

Lines 66 to 67 in caf612d

ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True)

path = f"{local_dir}/00000"

@ryankarlos, thanks for doing the research! I think 00000 isn't a partition after all. It's right in your code where you're creating a local path using local_path = os.path.join(local_dir, f"{0:05}") syntax. So it's just a file. 😅

Would you mind creating an issue to support writing large dataframes using export_many? This isn't required now but you or someone else can implement it later.

welcome · 2022-10-28T04:46:55Z

Congrats on merging your first pull request! 🎉

* vaex structured dataset and native types implementation Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * Create new plugin for vaex Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix vaex type to DataFrameLocal and add reqs Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix tests Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * add flytekit-vaex to python build Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * correct microlib name Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * fix test Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * run pip-compile again Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> * small fixes Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com> Signed-off-by: Vivek Praharsha <vpraharsha@outlook.com>

ryankarlos requested review from wild-endeavor, kumare3, eapolinario and pingsutw as code owners October 9, 2022 23:54

ryankarlos mentioned this pull request Oct 10, 2022

[Feature][Flytekit Schema type extension] Vaex Dataframe plugin flyteorg/flyte#701

Closed

13 tasks

pingsutw reviewed Oct 11, 2022

View reviewed changes

ryankarlos requested a review from pingsutw October 15, 2022 04:15

ryankarlos force-pushed the flyte-vaex-plugin branch 4 times, most recently from f2c3003 to a2cfbed Compare October 16, 2022 02:52

ryankarlos force-pushed the flyte-vaex-plugin branch from d60205f to 7d95dc2 Compare October 18, 2022 12:57

pingsutw approved these changes Oct 19, 2022

View reviewed changes

pingsutw requested changes Oct 19, 2022

View reviewed changes

pingsutw reviewed Oct 19, 2022

View reviewed changes

ryankarlos added 6 commits October 19, 2022 22:02

vaex structured dataset and native types implementation

1bf9446

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

Create new plugin for vaex

354311a

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

fix vaex type to DataFrameLocal and add reqs

ae33324

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

fix tests

7489a51

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

add flytekit-vaex to python build

3f70eb2

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

correct microlib name

436347c

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

ryankarlos force-pushed the flyte-vaex-plugin branch from b6557ae to 436347c Compare October 19, 2022 21:02

fix test

e0249c5

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

ryankarlos commented Oct 19, 2022

View reviewed changes

run pip-compile again

dc4fdba

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

pingsutw previously approved these changes Oct 24, 2022

View reviewed changes

samhita-alla reviewed Oct 24, 2022

View reviewed changes

plugins/flytekit-vaex/README.md Outdated Show resolved Hide resolved

samhita-alla reviewed Oct 24, 2022

View reviewed changes

plugins/flytekit-vaex/README.md Outdated Show resolved Hide resolved

samhita-alla reviewed Oct 24, 2022

View reviewed changes

plugins/flytekit-vaex/flytekitplugins/vaex/sd_transformers.py Outdated Show resolved Hide resolved

small fixes

9d174f4

Signed-off-by: Ryan Nazareth <ryankarlos@gmail.com>

ryankarlos dismissed pingsutw’s stale review via 9d174f4 October 24, 2022 11:09

samhita-alla approved these changes Oct 25, 2022

View reviewed changes

samhita-alla reviewed Oct 27, 2022

View reviewed changes

pingsutw approved these changes Oct 28, 2022

View reviewed changes

samhita-alla merged commit b1ff43e into flyteorg:master Oct 28, 2022

samhita-alla added the hacktoberfest-accepted label Oct 28, 2022

This was referenced Oct 28, 2022

[Core feature] [Flytekit] Implement Multipart upload and download for flyteplugins-vaex flyteorg/flyte#3036

Open

[Core feature] [Flytekit] Add support for HDF5 and Arrow in flyteplugins-vaex flyteorg/flyte#3037

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vaex structured dataset and native types implementation #1230

vaex structured dataset and native types implementation #1230

ryankarlos commented Oct 9, 2022 •

edited

Loading

welcome bot commented Oct 9, 2022

samhita-alla commented Oct 10, 2022

ryankarlos commented Oct 10, 2022

pingsutw left a comment

ryankarlos commented Oct 15, 2022 •

edited

Loading

codecov bot commented Oct 16, 2022 •

edited

Loading

samhita-alla commented Oct 16, 2022

ryankarlos commented Oct 18, 2022

pingsutw commented Oct 19, 2022

pingsutw left a comment

pingsutw Oct 19, 2022

ryankarlos Oct 19, 2022 •

edited

Loading

ryankarlos Oct 19, 2022 •

edited

Loading

samhita-alla Oct 20, 2022

pingsutw Oct 20, 2022

ryankarlos Oct 20, 2022

ryankarlos commented Oct 24, 2022

samhita-alla commented Oct 25, 2022

samhita-alla Oct 27, 2022

pingsutw Oct 27, 2022

ryankarlos Oct 28, 2022 •

edited

Loading

samhita-alla Oct 28, 2022

samhita-alla Oct 27, 2022

ryankarlos Oct 28, 2022 •

edited

Loading

ryankarlos Oct 28, 2022 •

edited

Loading

samhita-alla Oct 28, 2022

welcome bot commented Oct 28, 2022


		PLUGIN_NAME = "vaex"

		microlib_name = f"plugins-{PLUGIN_NAME}"

	microlib_name = f"plugins-{PLUGIN_NAME}"
	microlib_name = f"flytekitplugins-{PLUGIN_NAME}"

	ctx.file_access.get_data(flyte_value.uri, local_dir, is_multipart=True)
	path = f"{local_dir}/00000"

vaex structured dataset and native types implementation #1230

vaex structured dataset and native types implementation #1230

Conversation

ryankarlos commented Oct 9, 2022 • edited Loading

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

welcome bot commented Oct 9, 2022

samhita-alla commented Oct 10, 2022

ryankarlos commented Oct 10, 2022

pingsutw left a comment

Choose a reason for hiding this comment

ryankarlos commented Oct 15, 2022 • edited Loading

codecov bot commented Oct 16, 2022 • edited Loading

Codecov Report

samhita-alla commented Oct 16, 2022

ryankarlos commented Oct 18, 2022

pingsutw commented Oct 19, 2022

pingsutw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

ryankarlos Oct 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos commented Oct 24, 2022

samhita-alla commented Oct 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryankarlos Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

ryankarlos Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

welcome bot commented Oct 28, 2022

ryankarlos commented Oct 9, 2022 •

edited

Loading

ryankarlos commented Oct 15, 2022 •

edited

Loading

codecov bot commented Oct 16, 2022 •

edited

Loading

ryankarlos Oct 19, 2022 •

edited

Loading

ryankarlos Oct 19, 2022 •

edited

Loading

ryankarlos Oct 28, 2022 •

edited

Loading

ryankarlos Oct 28, 2022 •

edited

Loading

ryankarlos Oct 28, 2022 •

edited

Loading