Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure AD authentication support (and SDK upgrade) #9

Merged
merged 6 commits into from
Oct 19, 2023

Conversation

tkent
Copy link

@tkent tkent commented Sep 19, 2023

What's here

This PR adds support for using Azure AD authentication to the azure module.

About the changes

The latest version of the azure golang SDK was needed for this and, unfortunately, this required a very large set of changes. The azure module previously used the storage SDK, which has been removed, and the new SDK is very different. The key changes included are:

  • Replacement of nearly all the Azure SDK interactions.
  • Addition of a new interface to support presigned urls using either authentication source (see RequestPreSigner)
  • Removal of the multi-part upload support (this is now handled in the newer SDK)
  • (BREAKING) Removal of the use_https and api_version configuration options (more on that below)
  • Addition of a new upload_concurrency configuration option.
  • Rename of the config key base_url to domain_suffix.
  • Addition of new general stow tests for presigned requests (ContainerPreSignRequest)
  • Addition of a new general stow test for big files (BigFileUpload)

Things to talk about

The Azure SDK changes forced a few decisions that I want to call out:

The removal of use_https

This just seems like a misconfiguration risk with no clear benefit. MSFT recommends using HTTPS instead of HTTP for storage account access (here) and I can't imagine a scene where http would be preferred.

The removal of api_version

Newer versions of the SDK do not really support changing the API version. If you check the SDK code you see this comment:

Set with caution as this package version has not been tested with arbitrary service versions.

Again, this seems like a misconfiguration risk without a clear benefit.

Presigned Requests and the shared keys disabled flag

When shared keys are disabled on a storage account, you might expect presigned keys to be blocked as well. However, "user delegated SAS tokens" will continue to work even in this case (docs here).

There is just one big caveat. The Azure AD identity used for authentication to the storage account must have the Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey privilege (docs here).

Performance of the Presigned Request implementation when using AD auth

The current presigned request implementation for AD auth will not perform well if a large volume of requests are needed. Each pre-signing requires a distinct delegate key which is an API call to Azure. This could be mitigated, though not easily, by obtaining a long-lived delegate key and renewing it as necessary. I opted not to include that, assuming large volumes of pre-signed requests are unlikely.

Lack of context in the stow interfaces

The methods of the stow interfaces generally do not include a context.Context in their argument signature. The newer azure SDK, however, requires a context.Context for nearly every operation. I chose to use context.Background every time this came up, but context.TODO would have been just as appropriate.

Testing

A big challenge for this particular change is testing. To make it a bit easier for a reviewer to kick the tires, a terraform module was added under the tests/ directory. In there is a readme for using the module to stand up and teardown a storage account.

Unfortunately, even with the fixture and existing stow tests, there is quite a bit of real-world testing this SDK swap will need.

Update

I've put together some instructions for testing this change end to end within a local flyte build using the master branch of flyteorg/flyte. See this gist.

A few things to note if following that guide:

  • For some reason, when Flyte first initializes the configured blob container in the Azure Storage Account, you will get errors (appears to be a race condition). The container is successfully created though. I haven't looked into this.
  • If you delete the storage container Flyte is using (for example, to start over) the Azure APIs will return errors trying to make a new container with the same name for a few minutes. This will clear up once the container is truly deleted, usually within 5 minutes.

@tkent tkent marked this pull request as draft September 21, 2023 03:33
@tkent tkent changed the title DRAFT: Azure AD authentication support (and SDK upgrade) Azure AD authentication support (and SDK upgrade) Sep 21, 2023
@gvashishtha
Copy link

In case it's helpful context, I added use_https and api_version as part of #7 because I didn't want to have to come back in and make another PR in case it turned out I needed those fields in the future. But I agree with your reasoning that those two options increase misconfiguration risk and shouldn't be used with the new Azure SDK.

If the removal of those two fields truly the only breaking change in this PR I hope the Flyte team will take this PR as soon as possible. Right now the only way you can authenticate into Azure with stow is with Azure connection strings. It is much better security practice to use workload identities.

Copy link

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making this change! This has been bothering me for a while...

Can you detail how you tested this change? What kind of workflows have you ran against an installation with this stow version?

azure/config.go Outdated Show resolved Hide resolved
azure/config.go Outdated Show resolved Hide resolved
azure/config.go Show resolved Hide resolved
azure/container.go Outdated Show resolved Hide resolved
azure/container.go Outdated Show resolved Hide resolved
azure/container.go Show resolved Hide resolved
@tkent
Copy link
Author

tkent commented Oct 7, 2023

@gvashishtha Thanks for that context and makes sense! If something like LocalStack existed for Azure Blob Services, then those options would be key for local testing. Alas, there is no such thing.

@tkent
Copy link
Author

tkent commented Oct 7, 2023

@EngHabu

Can you detail how you tested this change?

The testing section covers most of the testing that was put into this. For the scope of this change, I really wanted to stick to the stow fork completely independent of Flyte. I considered that a separate level of testing and one I'd like help with.

What kind of workflows have you ran against an installation with this stow version?

I did happen to test a Flyte build together with this branch and threw together a few basic workflows (the gist in the testing section shows one). However, I am new to flyte so if there are specific things to test for this type of change, I don't have the context for that. Also, the workflows I wrote were in python and so I seemed to be testing out the FlyteKit python SDK more so than anything to do with this code.

@EngHabu
Copy link

EngHabu commented Oct 9, 2023

This is awesome! The one additional test I would like to see is a map task. The reason being, there is a different code path that runs to collect outputs of a map task in propeller.

Something like this would do:

@task
def my_task(i: int) -> string:
   return f"Hello: {i}"

@workflow
def my_wf(numbers: typing.List[int]) -> typing.List[str]:
    return map_task(my_task)(i=numbers)

Besides the few comments above, this looks great to me! I don't think we need to delay it further. Let's address the back compat fail early issue and optionally the nits and mark it for review!

@tkent tkent marked this pull request as ready for review October 10, 2023 03:36
@tkent
Copy link
Author

tkent commented Oct 10, 2023

@EngHabu - Great to hear, thank you!

The one additional test I would like to see is a map task....

I updated the gist about local testing to include that. Below is a console screen shot of what the example task looked like after being ran.

Screen Shot 2023-10-09 at 2 52 59 PM

One of the first tasks we had for this updated stow code within our environment was to use a map_task to process some 1000s of blobs in batches using map_task. It worked great, but unfortunately it was also our first use of map_task. If it was behaving unexpectedly, I wouldn't know.

Also, we have gone through the exercise of getting a build of flyte with this branch deployed using the flyte-binary helm chart. However, doing that required changes in the flyte monorepo, and it didn't seem appropriate to start discussing that until this was sorted.


Aside from that, I wanted to call out a few things:

  1. Double checking you saw the comment about presigned URL performance in the PR description. It certainly doesn't seem like it will be an issue in Flyte from my testing, but that is a concern of mine.
  2. When Flyte boots and the configured container does not yet exist, a race condition often shows up where flyte tries to create the storage container more than once. The first one succeeds, and subsequent ones cause flyte to exit because the SDK returns an error that a container is being created. Restarting flyte fixes this, but it's worth noting. You don't see this issue in the stow end-2-end tests.

Terence Kent and others added 5 commits October 9, 2023 21:09
… Azure AD or shared keys. This required a larger-than-expected series of changes because the Azure SDK now very different than the one stow used before. This is an early commit to get changes out there for discussion. Included here is:

- An upgrade of the Azure SDK used for the project (needed to support AD auth). This, unfortunately also brought in a whole series if indirect upgrades so testing will be needed.
- Support for authenticating via either Azure AD or Shared keys
- Removal of several azure config values (ConfigAPIVersion, ConfigUseHTTPS) and addition of ConfigUploadConcurrency.
- Removal of the multi-part upload code for Azure, this is now included in the SDK
- Addition of new presigned url tests into the general stow test/ module.
- Addition of a new terraform module to stand up an azure storage account for testing.

Signed-off-by: Terence Kent <terence.kent@mcg.com>
Signed-off-by: Terence Kent <terence.kent@mcg.com>
Accepting len(var)==0 vs var==""

Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Terence Kent <terence.kent@mcg.com>
Accepting improved error message

Co-authored-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Terence Kent <terence.kent@mcg.com>
…ity at this point:

- Removed all inline context.Background() calls when using Azure SDK methods. Now set an explicit `ctx` variable at the top of each function.
- Added comments around the clock skew code, and moved all clock skews to 15 minutes (the azure-recommended value). Also left a comment explaining the decision.
- Made the use of removed config values an error state for validation. Added a test to ensure this happens.
- Moved some errant logging in the `checkMetadata` test method to only show up in inequality, to keep test output clean.

Signed-off-by: Terence Kent <terence.kent@mcg.com>
@EngHabu
Copy link

EngHabu commented Oct 10, 2023

Double checking you saw the comment about presigned URL performance in the PR description. It certainly doesn't seem like it will be an issue in Flyte from my testing, but that is a concern of mine.

Yeah noticed that, I believe that's alright... Depending on how you configure flyte, it can avoid issuing presigned urls to begin with..

When Flyte boots and the configured container does not yet exist, a race condition often shows up where flyte tries to create the storage container more than once. The first one succeeds, and subsequent ones cause flyte to exit because the SDK returns an error that a container is being created. Restarting flyte fixes this, but it's worth noting. You don't see this issue in the stow end-2-end tests.

Yeah that will be an issue (annoyance) I suspect. Why do you need to manually restart? why does it not recover on its own (i.e. Go into a crashloop backoff and eventually work) ?

@tkent
Copy link
Author

tkent commented Oct 10, 2023

Why do you need to manually restart? why does it not recover on its own (i.e. Go into a crashloop backoff and eventually work) ?

Oh, I should have been clear - it's only when running locally that I have to restart. When it's deployed in a cluster the pod just restarts so you might not even notice it.

@EngHabu
Copy link

EngHabu commented Oct 10, 2023

@tkent makes sense.. Is that a regression after moving to the "new" APIs? or did that behavior exist before?

… ignored failures due to creating an already-existing container! The code from the old SDK to ignore "already exists" was preserved, but not working with the new SDK.

- Updated the check to work with the new SDK.
-  Added a general test to confirm this behavior.

Note: It's debatable if this should be the default behavior for all stow implementations. However, it seems like whatever the desired behavior is, it should be consistent across all implementations.
Signed-off-by: Terence Kent <terence.kent@mcg.com>
@tkent
Copy link
Author

tkent commented Oct 10, 2023

@EngHabu Ha, you know, I never checked! Turns out it that this is a regression.

The previous implementation silently ignored "Already Exists" errors and the new one did not. I've updated the code to match the old behavior, added a test for it, and the issue no longer shows up.

There is a little awkwardness around ignoring "Already Exists" failures since it doesn't appear to be consistent between the various stow implementations, but it sure seems like the desired behavior.

@EngHabu
Copy link

EngHabu commented Oct 12, 2023

Thank you for the update, @tkent!
Is the PR ready for another round of reviews?

@tkent
Copy link
Author

tkent commented Oct 12, 2023

@EngHabu Yup! The changes should be minimal from last time.

Copy link

@EngHabu EngHabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all the feedback! and for your contribution

@EngHabu EngHabu merged commit 2b4d0dc into flyteorg:master Oct 19, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants