Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend flytekit version hash calculation to be pluggable #2428

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ddl-rliu
Copy link
Contributor

@ddl-rliu ddl-rliu commented May 16, 2024

Tracking issue

Closes flyteorg/flyte#5364

Why are the changes needed?

Extends https://github.com/flyteorg/flytekit/pull/2039/files – that PR gives the minimum API surface for configuring the API via FlytekitPlugin.

This PR increases how pyflyte is pluggable by external libraries, specifically for controlling the additional context used to generate the version string hash.

What changes were proposed in this pull request?

This can be used with a plugin like:

class MyPlugin(FlytekitPlugin):
    @staticmethod
    def get_additional_context(entity: Union[PythonAutoContainerTask, WorkflowBase]) -> List[str]:
        """Get additional context to be used for calculating the version hash."""
        if isinstance(entity, PythonTask):
            return [str(entity.task_config)]
        if isinstance(entity, WorkflowBase):
            task_configs = []
            for n in entity.nodes:
                task_configs.extend(DominoJobPlugin.get_additional_context(n.flyte_entity))
            return task_configs
        return []

This will add the task config to the calculation of the version hash.

How was this patch tested?

Setup process

Screenshots

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

Copy link

codecov bot commented May 17, 2024

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 72.09%. Comparing base (e6e08f9) to head (a744848).
Report is 15 commits behind head on master.

Current head a744848 differs from pull request most recent head d4cda94

Please upload reports for the commit d4cda94 to get more accurate results.

Files Patch % Lines
flytekit/remote/remote.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #2428       +/-   ##
===========================================
+ Coverage   42.77%   72.09%   +29.31%     
===========================================
  Files         185      181        -4     
  Lines       18677    18397      -280     
  Branches     2665     3601      +936     
===========================================
+ Hits         7990    13264     +5274     
+ Misses      10599     4508     -6091     
- Partials       88      625      +537     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ddl-rliu ddl-rliu changed the title x Extend flytekit version hash calculation to be pluggable May 22, 2024
@ddl-rliu ddl-rliu force-pushed the rliu.extend-flytekit-pluggable branch 2 times, most recently from 922e239 to 8d0645c Compare May 22, 2024 19:56
x
Signed-off-by: ddl-rliu <richard.liu@dominodatalab.com>
@ddl-rliu ddl-rliu force-pushed the rliu.extend-flytekit-pluggable branch from 8d0645c to 2951085 Compare May 22, 2024 19:56
@@ -1028,11 +1028,14 @@ def _get_image_names(entity: typing.Union[PythonAutoContainerTask, WorkflowBase]
if isinstance(entity, WorkflowBase):
default_inputs = entity.python_interface.default_inputs_as_kwargs

from flytekit.configuration.plugin import get_plugin

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not the ideal place to have an import.

It also seems like the additional_context (which could probably be named a bit more descriptively?) value should be passed into the register_script function, shouldn't it? The pattern of pulling in special state like this makes this code a bit harder to test / reason about I think.

Copy link
Contributor Author

@ddl-rliu ddl-rliu May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use lazy import here to solve the circular import issue (remote imports plugin, plugin imports remote)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've used this tactic to reduce startup time in Ruby -- but:

  • It looks like there's already a lazy_module helper in Flytekit
  • if this is a tactic for dealing with a potential circular dependency, that can be the sign of a flawed design
  • Remote now takes a dependency on this plugin helper method -- it's cleaner if you can invert the behavior so that this function doesn't need to know anything about finding special mutation functions in plugins

Copy link
Contributor Author

@ddl-rliu ddl-rliu May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could rename additional_context to version_hash_additional_context? I wouldn't rename it to task_configs because it does not necessarily contain the task configs, that only happens for a particular use case.

Moving it to a parameter of register_script seems sensible, but for the moment I'm thinking of keeping the PR as very easy to review/not changing too many methods. Will give it a try at a later point.

edit: will rename to version_hash_additional_context

# The md5 version that we send to S3/GCS has to match the file contents exactly,
# but we don't have to use it when registering with the Flyte backend.
# For that add the hash of the compilation settings to hash of file
version = self._version_from_hash(
md5_bytes, serialization_settings, default_inputs, *_get_image_names(entity)
md5_bytes, serialization_settings, default_inputs, *_get_image_names(entity), *additional_context

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to have pluggable versioning instead? i.e. the plugin defines a custom version function that gets passed md5_bytes, serialization_settings, default_inputs, *_get_image_names(entity) if it exists? Would such a function need a dual?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My feeling is that core logic i.e. the default version hash logic, should remain in its section, rather than moved to the plugin methods. This mostly follows the existing pattern I noticed in FlytekitPlugin, where core logic is not often moved to its methods – however there are some exceptions, as I suppose with FlytekitPlugin.get_remote

@ddl-rliu ddl-rliu marked this pull request as ready for review May 22, 2024 23:42
ddl-rliu added a commit to dominodatalab/flytekit that referenced this pull request May 23, 2024
- This is necessary due to another pending PR to upstream flytekit:
  flyteorg#2428

  In case this PR is not likely to be merged, we have a plan to move away
  from this change, see the linked Doc in DOM-57472
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants