Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/node tag compute #13

Merged
merged 13 commits into from
Oct 18, 2022

Conversation

fdroessler
Copy link
Contributor

This provides the ability to overwrite the compute target on a node basis using Kedro Node tags.

I saw this comment and had a rough implementation running in my plugin prototype. Happy to discuss or change the details of the implementation but gave it an initial stab.

@em-pe
Copy link
Member

em-pe commented Aug 29, 2022

Hi @fdroessler. Happy to see your contribution! I'm wondering if we could move compute specs into plugin config file (similar to the resources block in kedro-kubeflow to keep Azure concepts separate from pipeline code.

You can ignore sonarcloud and e2e tests as they require credentials not available in forks, we'll run them on our side.

@marrrcin
Copy link
Contributor

marrrcin commented Aug 29, 2022

Following up on @em-pe ,we would like this feature to be fully separated-out from the Python code, to not couple the execution layer with the logic layer.
The config for managing resources in the Kedro-Kubeflow plugin looks like this:

  resources:

    # For nodes that require more RAM you can increase the "memory"
    data_import_step:
      memory: 2Gi

    # Training nodes can utilize more than one CPU if the algoritm
    # supports it
    model_training:
      cpu: 8
      memory: 1Gi

    # GPU-capable nodes can request 1 GPU slot
    tensorflow_step:
      nvidia.com/gpu: 1

    # Default settings for the nodes
    __default__:
      cpu: 1 
      memory: 1Gi 

So in general, you have:

step_name:
    cpu: 4
    memory: 4Gi

etc.

For kedro-azureml we'll probably want something like this (excerpt from azureml.yml)

azure:
    compute_targets:
        step_name:
            cluster: <cluster name>
        # (... other steps)
        __default__:
            cluster: <cluster name> # <-- this should be populated from the config init, right now it's in the azure.cluster_name field but it should be moved here

See how the __default__ handling is implemented here https://github.com/getindata/kedro-kubeflow/blob/45bf6c5945428f954968b0edcd2810491e4c5a5f/kedro_kubeflow/config.py#L288

Please let us know if you have any additional questions.

@fdroessler
Copy link
Contributor Author

Ok I see the general idea here. Just a few questions from the top of my mind:

  • Does that also imply that one needs to keep the naming of nodes and the resource section in sync and would that fail early enough to avoid needing to build and push an image again if there is a typo?
  • In the end for a lot of pipelines there would be a small amount of nodes that would include modification but some larger pipelines it could get a bit messy in that file or not?

What was appealing for me with this was that I can read the pipelines and as far as I know also the kedro-viz graph and know exactly which part of my pipeline is running where. But maybe that benefit is not worth the coupling. I'll give the other suggestions a shot in a separate branch.

@em-pe
Copy link
Member

em-pe commented Aug 31, 2022

The correctness of configuration I think could be handled with a simple config validation, but I agree that the messiness of resources mapping can be a problem for massive pipelines as it impacts readability, code duplication and we don't have a visibility of resources allocation that kedro-viz brings to the table.

I have one more idea that kind of marries two approaches - what if we do the config section that maps kedro node tags to azure compute resources?

you would have something like that in config file:

azure:
    ....
    resources:
        __default__:
            cluster: <default cluster>
        chunky: 
           cluster: <himem cluster>
       gpu:
          cluster: <gpu enabled cluster>

... and then you operate with chunky and gpu tags within the pipeline. We keep the flexibility and simplicity of tags and have it platform independent at the same time. That would work for other execution environments as well

@fdroessler @marrrcin any thoughts?

@marrrcin
Copy link
Contributor

marrrcin commented Sep 1, 2022

Great idea @em-pe , sounds good to me!

@fdroessler
Copy link
Contributor Author

I like it. Will change accordingly :)

@marrrcin
Copy link
Contributor

How is your progress @fdroessler ? Do you need some help?

@fdroessler
Copy link
Contributor Author

How is your progress @fdroessler ? Do you need some help?

Thanks for following up. I had a couple of busy weeks with weddings and the like. But I will have time this week to finish it :) I'll reach out in case I need help!

@marrrcin
Copy link
Contributor

@fdroessler how is it going? 🙂

@fdroessler
Copy link
Contributor Author

Slow progress but getting there. Thanks for the patience it was mostly busy work. :) but have not forgot about it. Running some tests today

@fdroessler
Copy link
Contributor Author

fdroessler commented Oct 12, 2022

@marrrcin I pushed some code earlier with an initial implementation. I was wondering if this was roughly along the lines you and @em-pe were thinking? Would be good to get some comments on the general approach (not the code in detail yet) before cleaning it up and adding tests.

@@ -52,6 +80,11 @@ class KedroAzureRunnerConfig(BaseModel):
account_name: "{storage_account_name}"
# Name of the storage container
container: "{storage_container}"
resources:
__default__:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three things here:

  1. Let's remove azure.cluster_name for the purpose of resources.__default__ to have this setting only in one place.
  2. I would rename resources to compute, as compute is used in Azure SDK, so it will be much clearer for the users.
  3. I would not generate chunky as a template here, maybe let's stick to something self-commenting like
your_node_tag:
    cluster_name: high-cpu-cluster

+ please provide description of the compute node in form of a # comment like in other nodes in this template.

for node in dummy_pipeline_compute_tag.nodes:
if node.tags:
for tag in node.tags:
if "azureml.compute" in tag:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to update the tests once you finish the changes to the kedro tag -> azure compute mapping.

@marrrcin
Copy link
Contributor

@fdroessler Initially looks good, added my comments.

@fdroessler
Copy link
Contributor Author

Thanks for the feedback. I adapted things according to your suggestions and added a small section in the quickstart guide.

@marrrcin
Copy link
Contributor

Looks good to me!
Just resolve the conflicts and I will merge your changes 🙂

This branch has conflicts that must be resolved
to resolve conflicts before continuing.
Conflicting files
CHANGELOG.md
docs/source/03_quickstart.md

@marrrcin marrrcin merged commit d19e0f1 into getindata:develop Oct 18, 2022
@marrrcin
Copy link
Contributor

Merged, we will verify it in e2e tests and if anything goes wrong I will create additional issue. If everything passes fine, we will probably create new release next week.

Thanks so much for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants