Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Generic shares_base module and specific s3_datasets_shares module - p…
…art 9 (share db repositories) (#1351) ### Feature or Bugfix - Refactoring ### Detail As explained in the design for #1123 and #1283 we are trying to implement generic `datasets_base` and `shares_base` modules that can be used by any type of datasets and by any type of shareable object in a generic way. This PR includes: - Split the ShareobjectRepository from s3_datasets_shares into: - `ShareobjectRepository` (shares_base) - generic db operations on share objects - no references to S3, Glue - `ShareStatusRepository` (shares_base) - db operations related to the sharing state machine states - a way to split the db operations into smaller chunks - `S3ShareobjectRepository` (s3_datasets_share) - db operations on s3 share objects - used only in s3_datasets_shares. They might contain references to DatasetTables, S3Datasets... They are used in the clean-up activities and to count resources in environment. - Adapt `S3ShareobjectRepository` to S3 objects. For some queries it was needed to add filters on the type of share items retrieved, so that if in the future anyone adds a new share type the code still makes sense. To add some more meaning, some functions are renamed to clearly point out that they are s3 functions or what they do. - Make `ShareobjectRepository` completely generic. The following queries needed extra work: - ShareObjectRepository.get_share_item - renamed as `get_share_item_details` - `list_shareable_items` - split in 2 parts `list_shareable_items_of_type` + `paginated_list_shareable_items`: the first function is invoked recursively over the list of share processors, instead of querying the DatasetTable, DatasetStorageLocation and DatasetBucket we query the shareable_type. The challenge is to get all fields from the db Resource object that all of them are built upon. In particular the field `itemName` does not match the BucketName (in bucket) or the S3Prefix (in folders). For this reason I added a migration script to backfill the DatasetBucket.name as DatasetBucket.S3BucketName. and the DatasetStorageLocation.name with DatasetStorageLocation.S3Prefix. `paginated_list_shareable_items` joins the list of subqueries, filters and paginates. - In verify_dataset_share_objects instead of using list_shareable_items, I replaced it by `ShareObjectRepository.get_all_share_items_in_share` which does not need tables, storage, avoiding the whole S3 logic and avoiding unnecessary queries - Remove S3 references from shares_base.api.resolvers. Use DatasetBase and DatasetBaseRepository instead. Remove unused `ShareableObject`. - I had some problems with circular dependencies so I created the `ShareProcessorManager` in shares_base for the registration of processors. The SharingService uses the manager to get all processors. Missing items for Part10: - Lake Formation cross region table references in shares_base/services/share_item_service.py:add_shared_item - remove table references in shares_base/services/share_item_service.py:remove_shared_item - remove s3_prefix references shares_base/services/share_notification_service:notify_new_data_available_from_owners - RENAMING! Right now the names are a bit misleading ### Relates - #1283 - #1123 - #955 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
- Loading branch information