Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to run EcsRunLauncher without pulling network config from the current ECS task, and kwargs to customize the task config #9678

Merged
merged 1 commit into from Oct 3, 2022

Conversation

gibsondan
Copy link
Member

@gibsondan gibsondan commented Sep 13, 2022

Right now the only way to use the EcsRunLauncher involves pulling permissions and other configuration from the task that is launching the run. This creates a problem in situations where you might want system code to have different IAM roles than user code, or even launch runs from outside of ECS. In Cloud, it creates an awkward situation where the grpc server tasks are configured based on fields on the instance, but runs are configured by pulling from the current task.

This PR creates a configuration option that lets you pass in all the configuration you need when launching run, with nothing pulled from the current container. The old version is kept as well.

@vercel
Copy link

vercel bot commented Sep 13, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

3 Ignored Deployments
Name Status Preview Comments Updated
dagit-storybook ⬜️ Ignored (Inspect) Oct 3, 2022 at 3:22PM (UTC)
dagster ⬜️ Ignored (Inspect) Oct 3, 2022 at 3:22PM (UTC)
dagster-oss-cloud-consolidated ⬜️ Ignored (Inspect) Oct 3, 2022 at 3:22PM (UTC)

@gibsondan gibsondan force-pushed the ecs3september branch 7 times, most recently from 3bbaa62 to 96e869a Compare September 20, 2022 02:33
@gibsondan gibsondan changed the title WIP Add ability to run EcsRunLauncher without using the current task definitino [september 2022] Add option to run EcsRunLauncher without pulling config from the current task definition Sep 20, 2022
@gibsondan gibsondan marked this pull request as ready for review September 20, 2022 02:52
@gibsondan gibsondan force-pushed the ecs3september branch 4 times, most recently from 7b1af28 to ff1ca4b Compare September 21, 2022 20:32
@gibsondan
Copy link
Member Author

assert container_definition["image"] == image
assert not container_definition.get("entryPoint")
assert not container_definition.get("dependsOn")
# But other stuff is inherited from the parent task definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is largely copy-pasted from other tests, but isn't this the exact opposite of what we're trying to support here?

Can we maybe reduce the test assertions to just the behavior that diverges from the default?

def test_reuse_task_definition(instance):
image = "image"
secrets = []
def test_reuse_task_definition(instance, ecs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little unclear to me if this change is meant to:

  • force registration of a new task definition revision for the same family
  • use a task definition without any pre-populated defaults from the parent task

Which probably points to both some poor decisions with how I initially wrote this and also some naming ambiguities that could be cleaned up in this PR?

task_definition = {}
with suppress(ClientError):
task_definition = self.ecs.describe_task_definition(taskDefinition=family)[
if self.use_current_task_definition:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we instead let people set an optional "TaskDefinition" dict in config - and we merged that dict with our existing defaults? That might help squash "impossible" states where you set use_current_task_definition to False but also don't provide everything needed to define a new task definition.

I'm also just having difficulty following what even happens differently if you set this to True - what do we still pull from the parent's task definition and not from the launcher config that couldn't be represented as a "default" value?

Copy link
Member Author

@gibsondan gibsondan Sep 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm excited to offer something like a task_definition_kwargs field or something that lets you specify arbitrary task definition config so that we don't have to re-implement the whole spec as Dagster config, but I don't think it would be sufficient here. The pieces config that we're offering are a mixture of things on the dagster container (log configuration / sidecars), the task definition (execution_role_arn, task_role_arn), and the task (cluster / security groups / subnets).

Generally I'm trying to follow the playbook/philosphy here (https://elementl.quip.com/9K0bAWhc6t9n/Agent-Configuration-Options) - where we have a relatively simple and blessed way to set the most common pieces of config we think people are likely to want(secrets etc.) and then eventually also an escape hatch where we expose the full raw config for the power users. I think there's value in not needing to send people to the boto spec if they want to do something like 'set env vars' or 'point at an existing cluster' or 'change the IAM role that's used'. But which things are considered 'the most common pieces of config' is very subjective.

for key in expected_keys
if key in metadata.task_definition.keys()
if key in current_task_definition_dict.keys()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmsanders my understanding is that the answer to your question of what else can end up in the task definition if you set use_current_task_definition=True can be found here - it's the set of things that are a) params going into register_task_definition and b) found on the output of describe_task_definition

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we merge a bunch of stuff onto it based on what's in your launcher config.

I guess my question is how much is even left that we actually inherit vs. override.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many random fields here on the taskDefinition arg to that would no longer be copied over (placementConstraints etc.): https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.register_task_definition

Whether they are actually in use by anybody, I do not know

@gibsondan gibsondan dismissed jmsanders’s stale review September 27, 2022 18:58

mostly discussed offline, but did a smallish pass on some naming as well

@gibsondan gibsondan changed the title Add option to run EcsRunLauncher without pulling config from the current task definition Add option to run EcsRunLauncher without pulling config from the current ECS task Sep 28, 2022
("environment", Optional[List[Dict[str, str]]]),
("execution_role_arn", Optional[str]),
("task_role_arn", Optional[str]),
("sidecars", List[Dict[str, Any]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags should also be included here (and passed along to run_task)

might be better in a seperate pr though

sidecars=sidecars,
)

def task_definition_dict(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmsanders @gibsondan instead of managing/wrapping task_defs could we create a base task_def in the cloudformation that users get (and that users can also define) and lean more on overrides? Its a common pattern to define task defs in terraform/cloudformation and it kinda feels like we're creeping into IaC with this.

Also the api ratelimits for the describe/set task-def api's aren't that high by default and not checking/setting this stuff at runtime would help that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original motivations for the way we did things still hold true. Namely that we can't override the image and we want to enable the simple case where a user doesn't need to set up their own infrastructure (and can instead just use docker compose to launch):

https://dagster.phacility.com/D8404
https://dagster.phacility.com/D8486

But the more I've talked with Daniel, the more I think we were overzealous in how much information we copy forward from the original task definition.

@gibsondan
Copy link
Member Author

gibsondan commented Sep 29, 2022 via email

)
check.invariant(
not self.include_sidecars,
"can only set include_sidecars if use_current_task_definition is True",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Because if you're providing your own task definition, presumably you've already defined the sidecars inside it?

)
check.invariant(
self.execution_role_arn,
"Must set execution_role_arn if not pulling from current task definition",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Because if you're providing your own task definition, presumably you've already defined the execution role arn inside it?

is_required=False,
default_value=True,
description=(
"Whether to use our current task definition to initialize the task definition "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include an example of why you'd chose not to use the default behavior?

image=image,
container_name=self.container_name,
command=None,
log_configuration={
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still painting ourselves into a corner here by introducing another opinionated task definition? What happens when we want to support other logDriver configs?

sidecars=sidecars,
)

def task_definition_dict(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original motivations for the way we did things still hold true. Namely that we can't override the image and we want to enable the simple case where a user doesn't need to set up their own infrastructure (and can instead just use docker compose to launch):

https://dagster.phacility.com/D8404
https://dagster.phacility.com/D8486

But the more I've talked with Daniel, the more I think we were overzealous in how much information we copy forward from the original task definition.

"executionRoleArn"
) and task_definition.get("taskRoleArn") == metadata.task_definition.get("taskRoleArn"):
task_definitions_match = True
return existing_task_definition_config == desired_task_definition_config
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self - verify that this handles deep equality correctly

@gibsondan
Copy link
Member Author

@jmsanders and @Ramshackle-Jamathon latest rev responds to your feedback about task definitions by changing the plan - instead of having a path where the run launcher constructs a new task def from scratch, the grpc server would specify the task definition arn to use via container_context (likely using its own task definition). That avoids a situation where we have two totally separate task defs to use for user code.

Still need to do a pass on the tests to reflect that new version, but curious for overall thoughts on that plan

Copy link
Contributor

@jmsanders jmsanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love this new direction.

I think we need to sanitize run_task_kwargs a bit though, don't we?

description="Additional arguments to include while running the task. See "
"https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs.html#ECS.Client.run_task "
"for the available parameters. The overrides and taskDefinition arguments will always "
"be set by the tun launcher. If this field is not set, the arguments to run_task "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"be set by the tun launcher. If this field is not set, the arguments to run_task "
"be set by the run launcher. If this field is not set, the arguments to run_task "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, let's make sure to add a test for this statement:

The overrides and taskDefinition arguments will always be set by the tun launcher.


task_definition = family

if self.run_task_kwargs != None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👨‍🍳 💋

Must nicer if/else logic.

launchType="FARGATE",
overrides=overrides,
**task_config.run_task_kwargs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to munge task_config.run_task_kwargs to not include taskDefinition, launchType, and overrides? Otherwise, we can end up in a situation where the same kwarg is passed to the function twice.

>>> def foo(**kwargs):
...   pass
... 
>>> kwargs = {"bar": 1}
>>> foo(**kwargs)
>>> foo(bar=2, **kwargs)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: foo() got multiple values for keyword argument 'bar'

@gibsondan gibsondan force-pushed the ecs3september branch 2 times, most recently from 77a64cf to 7cc3605 Compare September 29, 2022 22:14
@gibsondan gibsondan changed the title Add option to run EcsRunLauncher without pulling config from the current ECS task Add option to run EcsRunLauncher without pulling network config from the current ECS task, and kwargs to customize the task config Sep 29, 2022
@gibsondan
Copy link
Member Author

Well this has been a journey! @jmsanders i think this is ready for perusal again

@gibsondan gibsondan force-pushed the ecs3september branch 2 times, most recently from 8a90370 to 3b1dec8 Compare September 29, 2022 22:32
@gibsondan gibsondan merged commit 9a2d194 into master Oct 3, 2022
@gibsondan gibsondan deleted the ecs3september branch October 3, 2022 18:18
@gibsondan
Copy link
Member Author

gibsondan commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants