You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, if users use pandas.DataFrame or a pyspark.DataFrame or pander.DataFrameSchema, Flytekit simply extracts the data from the transport Literaltype.Schema. So consider the following function
This task will not be cached. This is because the dataframe is downloaded and then re-uploaded, as the underlying task transforms are not aware that the passed dataframe was not mutated. But, if FlyteSchema is used, this would work fine.
Goal: What should the final outcome look like, ideally?
Either this should work as the user expects, i.e., cache hit, else this abnormality should be documented.
Describe alternatives you've considered
NA
Propose: Link/Inline OR Additional context
No response
Are you sure this issue hasn't been raised already?
The text was updated successfully, but these errors were encountered:
kumare3
added
enhancement
New feature or request
untriaged
This issues has not yet been looked at by the Maintainers
and removed
untriaged
This issues has not yet been looked at by the Maintainers
labels
Oct 7, 2021
I'm facing a similar dataframe caching issue in the weather forecasting project. I'm using a dynamic workflow to manage the training of a model and the tasks within it rely on training data in the form of a dataframe. Should I use FlyteSchema for this instead?
solution proposal: caching for complex data types, e.g. dataframes
For blob and schema types
expose a hash method in TypeTransformer, which would implement a cache-by-value system that we would implement and maintain. Users who define custom type transformers can implement a hash method too. This would enable default caching solutions for common types like pandas dataframes.
in cases where a hashing function is not provided for a particular type (e.g. new_pandas_like_library.DataFrame), introduce a cache_output_fn (naming TBD) to the @task decorator like so:
@task(cache=True,cache_output_fn={pd.DataFrame: lambdax: hash(x), },cache_version=CACHE_VERSION,)deffunc(x: int) ->pd.DataFrame:
df= ... # get a dataframereturndf
Update:
Consider a mechanism that uses s3 etag (or whatever the blob storage equivalent is) as the hash metadata for a particular reference-based data artifact (dataframe, files, etc). This offloads the burden of computing that hash to blob storage instead of manually computing the hash in the container.
Motivation: Why do you think this is important?
Currently, if users use pandas.DataFrame or a pyspark.DataFrame or pander.DataFrameSchema, Flytekit simply extracts the data from the transport
Literaltype.Schema
. So consider the following functionThe above function will be cached, because the input type has the file path that is cached and hence the function is not run.
But, now consider the following
This task will not be cached. This is because the dataframe is downloaded and then re-uploaded, as the underlying task transforms are not aware that the passed dataframe was not mutated. But, if FlyteSchema is used, this would work fine.
Goal: What should the final outcome look like, ideally?
Either this should work as the user expects, i.e., cache hit, else this abnormality should be documented.
Describe alternatives you've considered
NA
Propose: Link/Inline OR Additional context
No response
The text was updated successfully, but these errors were encountered: