Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit amount of tables dataset can import #1491

Open
zsaltys opened this issue Aug 20, 2024 · 1 comment
Open

Limit amount of tables dataset can import #1491

zsaltys opened this issue Aug 20, 2024 · 1 comment

Comments

@zsaltys
Copy link
Contributor

zsaltys commented Aug 20, 2024

One of our users imported a dataset with few glue tables and then someone in their team accidentally ran a misconfigured Glue crawler that created 100,000+ glue tables in their DB..

This caused a LOT of issues..

a) RDS spiked in CPU to the max and was causing all sorts of issues like not being able to scale, nightly updates failing..
b) table synchronizer for that dataset could never finish and it would run for a long time and more and more syncer task instances would run.. When we deleted the tables in glue db it was still not coping because it was trying to sync and calling LF to fix permissions for non existing 100,000 tables and getting throttle errors.

We then tried to remove this dataset.... Removing the shares were very difficult because with the new UI it's very hard to remove just the active S3 share because you can't find it among 100,000 tables so we had to resort to CLI to remove share items and then delete shares.

Once we deletes shares we couldn't delete the dataset either... Eventually we had to manually delete the table records in RDS.. even that was hard because syncer tasks were locking the records and had to stop those first. We then had to run a custom script to clean ES because reindexer does not remove invalid / dead records...

Overall it's an absolute nightmare to solve this issue when something like this happens.

My proposal is let's have a configurable limit how many tables a dataset can have and let's default to 100. This should be small enough so that the syncer could finish running. We could also try to make the syncer more resilient but imo it's still bad to pollute a catalog with 100,000 tables...

@zsaltys
Copy link
Contributor Author

zsaltys commented Aug 20, 2024

@anmolsgandhi fyi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant