Sync: Fix page by page query in sync, leading to duplicated entries #667

keymon · 2024-01-31T11:26:43Z

Comes from #665

In this PR we:

Modify the DynamodDB queries in the sync lambda to iterate
the query page by page.
Provide commands in dev-cli tool to allow get users and groups
Provide a command line tool to delete duplicated users in dynamodb

Why?

Currently sync process does not iterate the paginated responses from
dynamodb, reading around 2700 users maximum, Active and Archived.

This can lead to a bug in the sync proces, where it will create
duplicated entries in DB in each sync, with new IDs, and updating
the groups to point to these users. In the meantime, the users
might still have access to Glide, but not be members of the groups.

In detail:

Sync reads the first 2700 user entries in dynamodb, missing any other
Sync reads all IDP users
Sync tries to find if the user is already in DB. As the query
was partial, it will be considered a new user.
A new user ID is created, with a new PK/SK
Groups are updated to point to the new ID

Observed behaviour:

A random set of users would not have access to any rule based
on groups. But can be added individually.
DynamoDB table grows infinitely in each sync

How to trigger the issue:

Try to get more than >2700 users in the dynamodb, for instance
by syncing >2700 users. Users can be later deactivated.

Workaround and remediation

In our case our DB grew to >650k users. We were forced to create
a cli tool to cleanup dynamodb that is included.

This tool can run workers in parallel, but does not handle well failure/retries. It is safe to rerun and might require multiple reruns.

How did you test it?

Tested manually. Pending implement some unit testing mocking
the ddb library.

Cli tool tested manually.

Potential risks

If somebody was experiencing this bug, they might have 100s of 1000s of users. that means the page by page query will take long to run and it might cause sync to excess the time to run.

One workaround is just run the tool to delete duplicates, but you might want to extend the lambda timeout.

Is patch release candidate?

A simple helper to automatically run paginated queries with ddb

This is useful to troubleshoot the internal DB state. Uses paginated queries

Due a bug in the sync process, that does not consume paginated results from dynamodb, we are getting duplicated users in dynamodb. This command searches all duplicated users based on email, keeping only the first one (most recent CreatedAt). It reports back the duplicated users and allow to delete them in dynamodb. Supports options to limit the users to list, dryrun mode, maximum duplicates to delete, etc.

We must only pass the unique keys to the deletion task, or dynamodb complains by duplicates (even on a delete!)

To speed up the deletion of >600k entries

We might have 1000s of users, and the ddb.Query() will return a paginated response. We must read all of the pages in order to work. Otherwise, the sync process will consider that the users retuned by the IDP do not exist, and create new entries in each iteration.

detect and report duplicated users when syncing, returning the oldest. This is important if there are duplicates in DB to be sure we use the oldest and provide a consistent state. Logging is Good to find out about the existence of the bug

Also do not log all duplicated emails, can be many

keymon · 2024-02-01T10:47:35Z

We had to delete 100s of 1000s using the cli. But this in one example with one:

Fixed with:

./bin/commonfate ddb delete-duplicated-users --workers 20  --name Granted    | tee ~/tmp/delete-duplicated-users.report.$(date +%Y%m%d-%H%M%S).json > /dev/null 
Got 3041 users
Calculating duplicates...
Found 1 duplicated users
- user@company.com: 1
WARNING: Dry-run mode, skipping deletion...

./bin/commonfate ddb delete-duplicated-users --workers 20  --name Granted  --dry-run=false  | tee ~/tmp/delete-duplicated-users.report.$(date +%Y%m%d-%H%M%S).json > /dev/null 
Got 3041 users
Calculating duplicates...
Found 1 duplicated users
 - user@company.com: 1
Deleting duplicates at 2024-02-01T10:50:34Z using 20 workers...
Deleting 1 dups for email=shkumar@confluent.io at 2024-02-01T10:50:34Z...
Done deleting dups for email=shkumar@confluent.io Duration=173.408125ms...
Deletion complete at 2024-02-01T10:50:35Z. Duration 173.922ms...

keymon added 11 commits January 30, 2024 11:10

Add helper to run paginated queries with ddb

19af1e2

A simple helper to automatically run paginated queries with ddb

Extend devcli to add commands to dump users+groups

2bbbd4c

This is useful to troubleshoot the internal DB state. Uses paginated queries

delete-duplicated-users: allows filter by email

1600265

devcli: Delete duplicate users by PK/SK

8d3d73a

We must only pass the unique keys to the deletion task, or dynamodb complains by duplicates (even on a delete!)

dev-cli: get-users by email

f75ab02

Refactor delete-duplicated-users to iterate by mails

8f580e7

Implement multithreaded deletion

a12b8f0

To speed up the deletion of >600k entries

Additional logging in sync process

3272ca3

Also do not log all duplicated emails, can be many

github-actions bot deployed to ci-for-pr-667 January 31, 2024 16:24 View deployment

keymon changed the title ~~Sync: Fix page by page query in sync, leading to duplicated entries ### What changed?~~ Sync: Fix page by page query in sync, leading to duplicated entries Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync: Fix page by page query in sync, leading to duplicated entries #667

Sync: Fix page by page query in sync, leading to duplicated entries #667

keymon commented Jan 31, 2024 •

edited

keymon commented Feb 1, 2024 •

edited

Sync: Fix page by page query in sync, leading to duplicated entries #667

Are you sure you want to change the base?

Sync: Fix page by page query in sync, leading to duplicated entries #667

Conversation

keymon commented Jan 31, 2024 • edited

Why?

Workaround and remediation

How did you test it?

Potential risks

Is patch release candidate?

keymon commented Feb 1, 2024 • edited

keymon commented Jan 31, 2024 •

edited

keymon commented Feb 1, 2024 •

edited