Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] FlyteFile cannot be used with Annotated[FlyteFile, HashMethod(...)] #3424

Closed
2 tasks done
AdrianoKF opened this issue Mar 8, 2023 · 0 comments
Closed
2 tasks done
Labels
bug Something isn't working flytekit FlyteKit Python related issue

Comments

@AdrianoKF
Copy link
Contributor

Describe the bug

Original discussion on Slack: https://flyte-org.slack.com/archives/CP2HDHKE1/p1678183814829309

Using Annotated[FlyteFile, HashMethod(...)] for calculating a custom hash for a FlyteFile doesn't work out-of-the-box, failing an issubclass check in flytekit/types/file/file.py.

This makes it impossible to use custom hash functions with FlyteFile, e.g., to implement content-aware caching.

Note, that the same approach with Annotated[FlyteDirectory, HashMethod(...)] works as expected out of the box.

Example workflow code
import hashlib
from typing import Annotated

import pandas as pd
from flytekit import task, workflow
from flytekit.core.hash import HashMethod
from flytekit.types.file import FlyteFile


def calc_hash(f: FlyteFile) -> str:
    """Calculate SHA1 hash of a file"""

    h = hashlib.sha1(usedforsecurity=False)
    with open(f.path, "rb") as f:
        while chunk := f.read(4096):
            h.update(chunk)
    return str(h.hexdigest())


CachedFlyteFile = Annotated[FlyteFile, HashMethod(calc_hash)]


@task
def write_file() -> CachedFlyteFile:
    print("write_file")
    local_path = "data.parquet"

    df = pd.DataFrame(data={"a": [1, 2, 3], "b": [3, 4, 5]})
    df.to_parquet(local_path)

    return FlyteFile(local_path, remote_path=f"s3://test-repo/main/{local_path}")


@task(cache=True, cache_version="1")
def print_file(file: FlyteFile) -> None:
    file.download()
    print(pd.read_parquet(file))


@workflow
def wf() -> None:
    f = write_file()
    print_file(file=f)


if __name__ == "__main__":
    wf()
    wf()  # don't expect output from `print_file`, since it should be cached

Expected behavior

The first execution of wf should run both the write_file and print_file tasks, the second execution should only run write_file and hit the cache for print_file.

In reality, an exception is raised in flytekit/types/file/file.py, in to_literal():

TypeError: issubclass() arg 1 must be a class

Additional context to reproduce

See example workflow above.

I was able to solve the issue by inserting the following before the issubclass check in flytekit/types/file/file.py (I haven't run the flytekit test suite, so this might not be suitable for a mergeable PR just yet - should I submit a draft PR regardless?):

@@ -284,6 +288,10 @@
                 "None value cannot be converted to a file."
             )
 
+        # Handle Annotated[FlyteFile, ...] correctly by extracting the wrapped type
+        if issubclass(typing.get_origin(python_type), typing.Annotated):
+            python_type = typing.get_args(python_type)[0]
+
         if not (python_type is os.PathLike or issubclass(python_type, FlyteFile)):
             raise ValueError(
                 f"Incorrect type {python_type}, must be either a FlyteFile or os.PathLike"

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@AdrianoKF AdrianoKF added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Mar 8, 2023
@eapolinario eapolinario added flytekit FlyteKit Python related issue and removed untriaged This issues has not yet been looked at by the Maintainers labels Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flytekit FlyteKit Python related issue
Projects
None yet
Development

No branches or pull requests

2 participants