Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3.1.5 #89

Merged
merged 33 commits into from
Jun 13, 2022
Merged

V3.1.5 #89

merged 33 commits into from
Jun 13, 2022

Conversation

leelalagudu
Copy link
Contributor

@leelalagudu leelalagudu commented Jun 5, 2022

Issue #, if available: Fixes #88, Fixes #87, Fixes #86, Fixes #82

Description of changes:

  • Update permission set schema to support NotAction, NotResource, NotCondition keys as well as allow actions/resources elements to be specified as either as single string (or) array of strings. With these changes, the permission set object aligns completely to the schema spec supported by AWS SSO admin API for a permission set object, except for sessionDurationInMinutes. The solution still expects this to be a string instead of a number as it's necessary for ISO 86401 conversions.
  • Handle empty permission set description, session duration and relay state. When a permission set description is set to empty, the solution uses permission set name as the description. When relay state is empty, it would be set to empty by the solution. When session duration is set to empty, the solution would not set any value, however SSO API sets this value to a default of 60 minutes when unset. So, the permission set with an empty session duration set by the user is created in SSO with a value of 60 minutes
  • Enforce waiter blocks within the pagination process when the solution discovers the target list of accounts that fall under a root, ou_id, or account_tag scope , to avoid getting throttled.
  • Based on the configuration parameter SupportNestedOU , the solution would automatically discover all the accounts under the specified ou_id , and traverses the complete tree including any child OU's. Similarly, when this behaviour is enabled, in case of an account moving from one OU to another OU, the solution would also traverse the tree upwards until root to discover the complete list of parent OU's and compute the list of account assignments that are to be added/removed.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@leelalagudu
Copy link
Contributor Author

@jmejco , @tamara-h , the PR's security checks would fail as the CI build project's config file does not yet have SupportNestedOU parameter. Please update the CI repository with this config value, and re-trigger the security checks build project.

@allquixotic
Copy link
Contributor

Fails to self-mutate between current version in master and the new version because, in the OrgMain account, on updating the CF stack env-aws-sso-extensions-for-enterprise-orgEventsProcessorStack, envprocessTargetAccountSM fails because Resource handler returned message: "'arn:aws:iam::accountid:role/env-processTargetAccountSMRole' is not authorized to create managed-rule. (Service: AWSStepFunctions; Status Code: 400; Error Code: AccessDeniedException; Request ID: ... guid; ... etc

It's really a pain when the self-mutation fails :(

@allquixotic
Copy link
Contributor

allquixotic commented Jun 7, 2022

The recent fixes (current as of commit id 2958f9d) get rid of all the errors I’ve been seeing. However, when updating an existing permission set that had no inline policy and no managed policies, when I try to add AWS managed policies in the S3 permission set, it doesn’t provide any error over SNS, but it doesn’t update.

Here is my permission set data:

{
  “permissionSetName”: “foo”, “sessionDurationInMinutes”: “720”, 
  “managedPoliciesArnList”: [“arn:aws:iam::aws:policy/AdministratorAccess”, “arn:aws:iam::aws:policy/AWSSupportAccess”],
  “inlinePolicyDocument”: {},
  “relayState”: “https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#”,
  “tags”: [ { “Key”: “something”, “Value”: “someval” } ]
}

The previously existing permission set in AWS SSO has zero managed policies and no inline policy.

Also, even after the latest commit (2958f9d), I still get rate limit exceeded when pushing to S3 many (new) permission sets, permission set updates, new links data, or links data updates. I have to sleep for about 9 seconds between S3 PutObject calls to keep it from getting rate limited.

For clarification, I’m not getting rate limited by S3; it’s the Lambda functions within SSO Extensions that get rate limited by AWS SSO, far as I can tell.

@leelalagudu
Copy link
Contributor Author

Hi @allquixotic ,

I had re-factored the queueing design (now taking into consideration that Lambda-SQS trigger does not really abide by concurrency limits you set up for the lambda) such that using both a batch size of 1, along with message group ID's ranging between 0 and 9 (take the last digit of the account) and the queue being FIFO, this should theoretically stay within the 10 live calls threshold at any given time.

Rather than hyptohesising this further, we will be setting up load test environment with approximately ~50 accounts and validate /fix the behaviour.

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

{ "permissionSetName": "SysAdmin-ps", "sessionDurationInMinutes": "240", "managedPoliciesArnList": ["arn:aws:iam::aws:policy/AdministratorAccess", "arn:aws:iam::aws:policy/AWSSupportAccess"], "inlinePolicyDocument": {}, "relayState": "https://eu-west-1.console.aws.amazon.com/systems-manager/managed-instances?region=eu-west-1#", "tags": [ { "Key": "versionid", "Value": "01" }, { "Key": "team", "Value": "SysAdmins" } ] }
, and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

…reated prior to state machine; Tune retry and jitter back-off parameters to handle heavier loads
…nd then state machine; tune retry and jitter delay params for handling throttling scenarios in heavy load
@allquixotic
Copy link
Contributor

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

...

, and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

I tried it again, and it still does not attach the managed policies. It doesn't throw any error over SNS either. Is there some specific function I should look for logs?

@leelalagudu
Copy link
Contributor Author

Regarding the permission set scenario you described, using the same commit ID, I first created a PS with empty inline policy and managed policies array, and then updated it to reflect this:

...

, and this change propagated successfully in my environment and I could see the change trickle down through SSO. Can you please check this once again? Only because I am unable to re-produce this locally

I tried it again, and it still does not attach the managed policies. It doesn't throw any error over SNS either. Is there some specific function I should look for logs?

@allquixotic , when you are provisioning the changed permission set, can you see what's happening from logs in the log group /aws/lambda/env-permissionSetTopicProcessor please? This is where the delta is calculated and processed. If there are no logs populated here when you are pushing the updated PS through S3, can you check what's happening in the log group /aws/lambda/env-psCuHandler please? This is the user interface handler, and we will know if there's an issue at this layer.

Thank you,
Leela

@jjleigh
Copy link
Contributor

jjleigh commented Jun 9, 2022

We have tested the Nested OU and account assignment to orgs with over 40 accounts and that is all working. An issue that did arise was doing account assignment at the root level with 126 accounts. This fails in the step function with the error
Error - Organizations.InvalidInputException Cause - You specified an invalid value for nextToken. You must get the value from the response to a previous call to the API. (Service: Organizations, Status Code: 400, Request ID: 8979fd76-2be8-4ba7-9dc5-c1c5de413487)

Root mapping is only mapping to five accounts, no error in lambda or no error email. When the step function was checked the error above was noticed.

Mapping file - root%all%aws-aes-engg-sso-ps%aws-aes-engg-sso%GROUP%ssofile

step input for failing step below. I replaced the sensitive details

{ "action": "create", "entityType": "root", "instanceArn": "<instanceArn>", "permissionSetArn": "<instanceArn>", "principalId": "<principalId>", "principalType": "GROUP", "targetType": "AWS_ACCOUNT", "topicArn": "<topicArn>", "pageSize": 5, "waitSeconds": 2, "supportNestedOU": "true", "tagKey": "", "emptyString": "", "tagValues": "", "ou_id": "", "resourceTypeFilters": "organizations:account", "listRootAccounts": { "Accounts": [ { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-03-01T12:11:33.059Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-08-09T13:14:55.607Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-08-14T19:10:37.341Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-04-28T11:13:22.576Z", "Name": "<name>", "Status": "ACTIVE" }, { "Arn": "<arn>", "Email": "<email>", "Id": "<id>", "JoinedMethod": "CREATED", "JoinedTimestamp": "2021-03-01T12:08:52.478Z", "Name": "<name>", "Status": "ACTIVE" } ], "NextToken": "H4sIAAAAAAAAADVU3U6aQRB9F65709s+R1+gSbkwbWqiiRdtmqCiIr8qWP8QRSgggooKouzMPNJ5hc4c0mSymT17dvbMxxl+ZTaya+srqz8ynz5+yKxn11a+fF/5mf36efVb1rEM7AX6BPPow9pc32Fd6BR6A7uBzoNj98QNNoHtwy5g27ATWAVWhBVIKENL0AOoEy6hHlewFqwJO4XtwTZhVVgd2oBeQM+Jn3Gts8gBrBTX9S/sGvYG68GmlDejAI8FTGGJp3MiTn4mOIH2oW3oK7QDHUOH5LTjVJvQd+gCuhNhOWiLj+4ROQo9EZ6fkVnglSa3rUjskPrzMFd+DPV+vUiRdVpskHx5gIyQRpAbyC2kC0mQAaSH1IfccWs8XeITbhckTyFtbueQGeSFuIMCeWf+BNmB1CB/kCqQBiQP2YRUIbuQAqQF2YJsR6Qc5B5ShxxBLkk4J9kraFSIJ/xKk6dOO4mj1EaaIt1FC2mIdByy0zXxPnFD6sTqbUonJKVHpAnXBZlzEg6RhI2047mUWHASGlxYlD0ItckbOUJqUGSd0aLOKyYXbHAfUqTsCtcyb5WQikg15mcRfte/iT3AxuFkHfGnv4M9/nfCgm4ZwG5pGKGT3dId2mxG1zWYu+dfYSMmXrBGc5ajVKx5aIXeuw0jhbGfWaTNmhXmbtFdGqbAF7usP2aS41AMeDqhnU75LsckDOzT0WW1ARu5ht4HogJ9gL7RliNoIu6gT3EPOuPao+ZxyFO/PoyIWZhwLoa0utfPU/bS/45U6fwy3X4V3elyQJYDnqPtd5lscfar/B8o8WtskVaDbmd+/wMn6ghzcgQAAA==" } }

@leelalagudu
Copy link
Contributor Author

Hi @jjleigh ,

We're unable to reproduce the root issue that you are seeing with the step functions. We've tried the below use cases in 2 of our dev environments and are unable to reproduce the issue. @jmejco had done a stellar job of setting up an AWS org with ~ 60 accounts , so that we could stress test the solution and document processing teams for average loads. These tests were done as part of this validation

  • He triggered 300 create account assignments in parallel using root scope and did not see the exception you're seeing
  • He then triggered 300 delete acccount assignments in parallel using root scope and did not see the exception you're seeing
  • On a much smaller environment ~ 6 accounts , I used the same permission set name and group name that you're using and triggered root scope assignment with pageSize reduced to 1, and the step function looped through 6 times using nextToken parameter successfully.

As this is specific to your environment, could you please validate the following:

  • Re-run a root scope account assignment and see if you're seeing the same exception. If you are , can you take the nextToken value from the execution payload and run a CLI command with that to see if that works. For ex aws organizations list-accounts --page-size 5 --next-token <your token from execution payload> ?

@leelalagudu
Copy link
Contributor Author

Hi @allquixotic , we are in the process of doing more stress tests on the solution, but so far we've validated ~ 300 account assignment create/delete operations and ensured that the throttle tuning works with 0 failures.

If it's feasible for you , could you try the same bulk load flow you were trying earlier with the latest commit ID please? Only if that's feasible for you.

Thank you,
Leela

@jjleigh
Copy link
Contributor

jjleigh commented Jun 10, 2022

Hi @jjleigh ,

We're unable to reproduce the root issue that you are seeing with the step functions. We've tried the below use cases in 2 of our dev environments and are unable to reproduce the issue. @jmejco had done a stellar job of setting up an AWS org with ~ 60 accounts , so that we could stress test the solution and document processing teams for average loads. These tests were done as part of this validation

  • He triggered 300 create account assignments in parallel using root scope and did not see the exception you're seeing
  • He then triggered 300 delete acccount assignments in parallel using root scope and did not see the exception you're seeing
  • On a much smaller environment ~ 6 accounts , I used the same permission set name and group name that you're using and triggered root scope assignment with pageSize reduced to 1, and the step function looped through 6 times using nextToken parameter successfully.

As this is specific to your environment, could you please validate the following:

  • Re-run a root scope account assignment and see if you're seeing the same exception. If you are , can you take the nextToken value from the execution payload and run a CLI command with that to see if that works. For ex aws organizations list-accounts --page-size 5 --next-token <your token from execution payload> ?

We tried this and got a message saying: Cannot specify --no-paginate along with pagination arguements: --page-size

so we tried without the page size flag and got the same error from the step function that I shared above.

@jmejco
Copy link
Contributor

jmejco commented Jun 10, 2022

@jjleigh @leelalagudu

The correct argument to provide the NextToken in the CLI is --starting-token.

@allquixotic
Copy link
Contributor

Hi, with the latest commits, I uploaded one item to links_data per second and received no rate limit issues. This is much improved from previous commits.

However, I still can reproduce two issues that may be outside the scope of this PR:

  1. When deleting a permission set attached to an account, I get an SNS error email saying "Cannot delete permission set as there are existing links that reference the permission set". Why can't it remove those links before deleting it?

  2. When uploading a new copy of the deleted permission set (deleted from S3, not from SSO) with AWS managed policies, it doesn't add the AWS managed policies.

@allquixotic
Copy link
Contributor

There's also a race condition, where, while waiting for accounts to be deprovisioned (which can take minutes if you have dozens of accounts provisioned for the same permission set), trying to delete the permission set won't work.

So to do something like clear the links_data and permission_set data in S3 for a given permission set, you first have to delete the links_data, wait, then once it's deprovisioned, delete the permission_set file.

@leelalagudu
Copy link
Contributor Author

Hi @allquixotic ,

  • AWS SSO admin API does not allow deleting a permission set that already has account assignments/other applications provisioned. The solution enforces the same constraint , which is why you see the SNS error about permission set having existing links.
  • While it's technically feasible to delete all the related account assignments through the solution when a permission set delete operation is triggered, we prefer to de-couple these use cases allowing customers to manage permission sets and account assignments individually through the solution.
  • The sequence flow you described for deleting a permission set is correct, that's the sequence we intend customers to adapt.
  • The race condition you observe is an inherent feature of the service behaviour, i.e. the service does not allow orpahaned account assignments from a permission set perspective.
  • The 2nd issue that you described in your last comment i.e. re-uploading a modified copy of the deleted permission set file, I just validated the exact flow in 3.1.5 and can confirm the following:
  1. When deleting a permission set from S3 that has already been provisioned to accounts through the solution, the S3 file would be deleted as it's a user action, however an error notification is sent out similar to what you've received, about linked account assignment.
  2. When uploading the same permission set copy to S3, the solution verifies there's no delta and does not do any changes.
  3. When uploading a modified permission set to S3 (executed after step 1) , change being added managed policies, the solution calculates this delta , updates the permission set, and re-provisions the updated permission set to all the provisioned accounts.
  • I am unable to reproduce the issue you're observing. Can you compare the sequence above with what you've executed and let us know if you would want us to try a different flow?
  • Additionally, @jmejco and I have significantly optimised the scaling design of the solution. I believe you will find these details that are added to the PR interesting .

Thank you,
Leela

@allquixotic
Copy link
Contributor

I'm OK with most of what you wrote and agree, but I tried something completely new:

First, I deleted my permission set (after removing all the account links -- I had to do 9 of them by hand, because for some reason, deleting all the links data relative to this permset didn't delete all the account links, and none of them were established by hand)

Once confirming the permission set did not exist at all in AWS SSO, I re-uploaded the .json into permission_sets, and the account links into links_data. The link was root%all%PermSet-ps%GroupNameHere%GROUP%ssofile.

Then, confusingly, I neither received an SNS error notification about this, nor did the permission set get created.

So, something is failing at a low level with this particular permission set, so that it just won't create it, and won't tell me why it can't via the usual error email.

Here is the permission set verbatim, except that something in the permissionSetName is replaced with 16 lowercase and uppercase letters that are specific to our business case.

{
  "permissionSetName": "something",
  "sessionDurationInMinutes": "720",
  "managedPoliciesArnList": [
    "arn:aws:iam::aws:policy/AdministratorAccess",
    "arn:aws:iam::aws:policy/AWSSupportAccess"
  ],
  "inlinePolicyDocument": {},
  "relayState": "https://us-east-1.console.aws.amazon.com/console/home?region=us-east-1#",
  "tags": [
    {
      "Key": "versionid", 
      "Value": "01"
    }
  ]
}

I'll note that SSO Extensions does create and provision many other permission sets successfully, but this specific one just doesn't work for some reason. I'll check the log groups you suggested earlier when I have a chance.

@leelalagudu
Copy link
Contributor Author

Thank you for the updates @allquixotic . When you do get a chance, please have a look at those log files and see what's happening with this specific permission set. If the issue persists, please can you create a new issue so that this could be tracked separately.

For now, the PR would be approved and merged in the next few days as all the target stories for the PR are now handled. Do let us know if you still see issues that are in the scope of this PR i.e. nested OU support (both for provisioning & self sustaining flows), handle throttling for any organizations load, updated permission set schema definitions to align with standard AWS SSO admin API.

Thank you,
Leela

@allquixotic
Copy link
Contributor

allquixotic commented Jun 13, 2022

Thanks, I agree, this PR seems to resolve the issues it claims to. No need to hold the PR for other unrelated issues. Just reporting the results of my testing.

Tell you what; to make it easier to forget about the messages in this PR comments, I'll open a new issue with my findings about that bad permission set that isn't being created or updated.

@jmejco jmejco requested review from jmejco, tamara-h and vpegg June 13, 2022 06:42
@jmejco jmejco merged commit 15ddc93 into aws-samples:main Jun 13, 2022
@leelalagudu leelalagudu deleted the v3.1.5 branch July 22, 2022 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants