This repository has been archived by the owner on May 16, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 19
Sync: Fix page by page query in sync, leading to duplicated entries #667
Open
keymon
wants to merge
11
commits into
common-fate:release/v0.15
Choose a base branch
from
keymon:confluent/hector-v0.15-add-ddb-commands
base: release/v0.15
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Sync: Fix page by page query in sync, leading to duplicated entries #667
keymon
wants to merge
11
commits into
common-fate:release/v0.15
from
keymon:confluent/hector-v0.15-add-ddb-commands
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
A simple helper to automatically run paginated queries with ddb
This is useful to troubleshoot the internal DB state. Uses paginated queries
Due a bug in the sync process, that does not consume paginated results from dynamodb, we are getting duplicated users in dynamodb. This command searches all duplicated users based on email, keeping only the first one (most recent CreatedAt). It reports back the duplicated users and allow to delete them in dynamodb. Supports options to limit the users to list, dryrun mode, maximum duplicates to delete, etc.
We must only pass the unique keys to the deletion task, or dynamodb complains by duplicates (even on a delete!)
To speed up the deletion of >600k entries
We might have 1000s of users, and the ddb.Query() will return a paginated response. We must read all of the pages in order to work. Otherwise, the sync process will consider that the users retuned by the IDP do not exist, and create new entries in each iteration.
detect and report duplicated users when syncing, returning the oldest. This is important if there are duplicates in DB to be sure we use the oldest and provide a consistent state. Logging is Good to find out about the existence of the bug
Also do not log all duplicated emails, can be many
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Comes from #665
In this PR we:
Modify the DynamodDB queries in the sync lambda to iterate
the query page by page.
Provide commands in dev-cli tool to allow get users and groups
Provide a command line tool to delete duplicated users in dynamodb
Why?
Currently sync process does not iterate the paginated responses from
dynamodb, reading around 2700 users maximum, Active and Archived.
This can lead to a bug in the sync proces, where it will create
duplicated entries in DB in each sync, with new IDs, and updating
the groups to point to these users. In the meantime, the users
might still have access to Glide, but not be members of the groups.
In detail:
was partial, it will be considered a new user.
Observed behaviour:
on groups. But can be added individually.
How to trigger the issue:
by syncing >2700 users. Users can be later deactivated.
Workaround and remediation
In our case our DB grew to >650k users. We were forced to create
a cli tool to cleanup dynamodb that is included.
This tool can run workers in parallel, but does not handle well failure/retries. It is safe to rerun and might require multiple reruns.
How did you test it?
Tested manually. Pending implement some unit testing mocking
the ddb library.
Cli tool tested manually.
Potential risks
If somebody was experiencing this bug, they might have 100s of 1000s of users. that means the page by page query will take long to run and it might cause sync to excess the time to run.
One workaround is just run the tool to delete duplicates, but you might want to extend the lambda timeout.
Is patch release candidate?