[Core feature] Caching for non-flyte specific offloaded objects #1581

kumare3 · 2021-10-07T04:07:35Z

Motivation: Why do you think this is important?

Currently, if users use pandas.DataFrame or a pyspark.DataFrame or pander.DataFrameSchema, Flytekit simply extracts the data from the transport Literaltype.Schema. So consider the following function

@task(cache=True, cache_version="1.0")
def sub_foo(df: pandas.DataFrame) -> pandas.DataFrame:
    return df

The above function will be cached, because the input type has the file path that is cached and hence the function is not run.

But, now consider the following

@dynamic
def foo(df: pandas.DataFrame) -> pandas.DataFrame:
     return sub_foo(df=df)

This task will not be cached. This is because the dataframe is downloaded and then re-uploaded, as the underlying task transforms are not aware that the passed dataframe was not mutated. But, if FlyteSchema is used, this would work fine.

Goal: What should the final outcome look like, ideally?

Either this should work as the user expects, i.e., cache hit, else this abnormality should be documented.

Describe alternatives you've considered

NA

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

The text was updated successfully, but these errors were encountered:

kumare3 · 2021-10-07T04:08:04Z

cc @eapolinario While writing the issue I realized, it is not that big a problem?

cosmicBboy · 2021-10-14T20:56:13Z

I'm facing a similar dataframe caching issue in the weather forecasting project. I'm using a dynamic workflow to manage the training of a model and the tasks within it rely on training data in the form of a dataframe. Should I use FlyteSchema for this instead?

cosmicBboy · 2021-10-15T18:01:28Z

solution proposal: caching for complex data types, e.g. dataframes

For blob and schema types

expose a hash method in TypeTransformer, which would implement a cache-by-value system that we would implement and maintain. Users who define custom type transformers can implement a hash method too. This would enable default caching solutions for common types like pandas dataframes.
in cases where a hashing function is not provided for a particular type (e.g. new_pandas_like_library.DataFrame), introduce a cache_output_fn (naming TBD) to the @task decorator like so:

@task(
    cache=True,
    cache_output_fn={
        pd.DataFrame: lambda x: hash(x),
    },
    cache_version=CACHE_VERSION,
)
def func(x: int) -> pd.DataFrame:
    df = ...  # get a dataframe
    return df

cosmicBboy · 2021-11-04T21:26:14Z

Update:
Consider a mechanism that uses s3 etag (or whatever the blob storage equivalent is) as the hash metadata for a particular reference-based data artifact (dataframe, files, etc). This offloads the burden of computing that hash to blob storage instead of manually computing the hash in the container.

kumare3 added enhancement New feature or request untriaged This issues has not yet been looked at by the Maintainers and removed untriaged This issues has not yet been looked at by the Maintainers labels Oct 7, 2021

EngHabu added this to Current Milestone in Flyte core team WIP board [Combustion] Oct 27, 2021

kumare3 added this to the 0.18.2 milestone Nov 3, 2021

kumare3 added flytekit FlyteKit Python related issue flytepropeller labels Nov 3, 2021

eapolinario self-assigned this Nov 10, 2021

eapolinario modified the milestones: 0.18.2, 0.19.0 - Eagle Dec 1, 2021

This was referenced Dec 3, 2021

RFC - Caching of non-Flyte offloaded objects #1893

Merged

Caching of offloaded objects flyteorg/flytekit#762

Merged

EngHabu modified the milestones: 0.19.0 - Eagle, 0.19.1 - Jan 2021 Dec 29, 2021

pingsutw changed the title ~~[Core feature] Caching for non-flyte specific offloaded objects~~ c Jan 26, 2022

EngHabu changed the title c [Core feature] Caching for non-flyte specific offloaded objects Jan 26, 2022

EngHabu modified the milestones: 0.19.2 - Jan 2021, 0.19.3 - Feb 2021 Jan 26, 2022

This was referenced Mar 1, 2022

Take literal hash into account during cache key calculation flyteorg/flytepropeller#406

Merged

Add hash to literal #minor flyteorg/flyteidl#237

Merged

eapolinario closed this as completed in flyteorg/flytepropeller#406 Mar 3, 2022

eapolinario mentioned this issue Mar 9, 2022

Document caching of offloaded objects flyteorg/flytesnacks#683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core feature] Caching for non-flyte specific offloaded objects #1581

[Core feature] Caching for non-flyte specific offloaded objects #1581

kumare3 commented Oct 7, 2021

kumare3 commented Oct 7, 2021

cosmicBboy commented Oct 14, 2021

cosmicBboy commented Oct 15, 2021

cosmicBboy commented Nov 4, 2021

[Core feature] Caching for non-flyte specific offloaded objects #1581

[Core feature] Caching for non-flyte specific offloaded objects #1581

Comments

kumare3 commented Oct 7, 2021

Motivation: Why do you think this is important?

Goal: What should the final outcome look like, ideally?

Describe alternatives you've considered

Propose: Link/Inline OR Additional context

kumare3 commented Oct 7, 2021

cosmicBboy commented Oct 14, 2021

cosmicBboy commented Oct 15, 2021

solution proposal: caching for complex data types, e.g. dataframes

cosmicBboy commented Nov 4, 2021