Generic shares_base module and specific s3_datasets_shares module - p…

…art 9 (share db repositories) (#1351) ### Feature or Bugfix - Refactoring ### Detail As explained in the design for #1123 and #1283 we are trying to implement generic `datasets_base` and `shares_base` modules that can be used by any type of datasets and by any type of shareable object in a generic way. This PR includes: - Split the ShareobjectRepository from s3_datasets_shares into: - `ShareobjectRepository` (shares_base) - generic db operations on share objects - no references to S3, Glue - `ShareStatusRepository` (shares_base) - db operations related to the sharing state machine states - a way to split the db operations into smaller chunks - `S3ShareobjectRepository` (s3_datasets_share) - db operations on s3 share objects - used only in s3_datasets_shares. They might contain references to DatasetTables, S3Datasets... They are used in the clean-up activities and to count resources in environment. - Adapt `S3ShareobjectRepository` to S3 objects. For some queries it was needed to add filters on the type of share items retrieved, so that if in the future anyone adds a new share type the code still makes sense. To add some more meaning, some functions are renamed to clearly point out that they are s3 functions or what they do. - Make `ShareobjectRepository` completely generic. The following queries needed extra work: - ShareObjectRepository.get_share_item - renamed as `get_share_item_details` - `list_shareable_items` - split in 2 parts `list_shareable_items_of_type` + `paginated_list_shareable_items`: the first function is invoked recursively over the list of share processors, instead of querying the DatasetTable, DatasetStorageLocation and DatasetBucket we query the shareable_type. The challenge is to get all fields from the db Resource object that all of them are built upon. In particular the field `itemName` does not match the BucketName (in bucket) or the S3Prefix (in folders). For this reason I added a migration script to backfill the DatasetBucket.name as DatasetBucket.S3BucketName. and the DatasetStorageLocation.name with DatasetStorageLocation.S3Prefix. `paginated_list_shareable_items` joins the list of subqueries, filters and paginates. - In verify_dataset_share_objects instead of using list_shareable_items, I replaced it by `ShareObjectRepository.get_all_share_items_in_share` which does not need tables, storage, avoiding the whole S3 logic and avoiding unnecessary queries - Remove S3 references from shares_base.api.resolvers. Use DatasetBase and DatasetBaseRepository instead. Remove unused `ShareableObject`. - I had some problems with circular dependencies so I created the `ShareProcessorManager` in shares_base for the registration of processors. The SharingService uses the manager to get all processors. Missing items for Part10: - Lake Formation cross region table references in shares_base/services/share_item_service.py:add_shared_item - remove table references in shares_base/services/share_item_service.py:remove_shared_item - remove s3_prefix references shares_base/services/share_notification_service:notify_new_data_available_from_owners - RENAMING! Right now the names are a bit misleading ### Relates - #1283 - #1123 - #955 ### Security Please answer the questions below briefly where applicable, or write `N/A`. Based on [OWASP 10](https://owasp.org/Top10/en/). - Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? - Is the input sanitized? - What precautions are you taking before deserializing the data you consume? - Is injection prevented by parametrizing queries? - Have you ensured no `eval` or similar functions are used? - Does this PR introduce any functionality or component that requires authorization? - How have you ensured it respects the existing AuthN/AuthZ mechanisms? - Are you logging failed auth attempts? - Are you using or adding any cryptographic features? - Do you use a standard proven implementations? - Are the used keys controlled by the customer? Where are they stored? - Are you introducing any new policies/roles/users? - Have you used the least-privilege principle? How? By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
data-dot-all · Jun 21, 2024 · 342798c · 342798c
1 parent e461fa1
commit 342798c
Show file tree

Hide file tree

Showing 27 changed files with 1,008 additions and 1,184 deletions.
diff --git a/backend/dataall/modules/s3_datasets/db/dataset_bucket_repositories.py b/backend/dataall/modules/s3_datasets/db/dataset_bucket_repositories.py
@@ -20,6 +20,7 @@ def create_dataset_bucket(session, dataset: S3Dataset, data: dict = None) -> Dat
             KmsAlias=dataset.KmsAlias,
             imported=dataset.imported,
             importedKmsKey=dataset.importedKmsKey,
+            name=dataset.S3BucketName,
         )
         session.add(bucket)
         session.commit()

diff --git a/backend/dataall/modules/s3_datasets_shares/__init__.py b/backend/dataall/modules/s3_datasets_shares/__init__.py
@@ -46,9 +46,13 @@ def is_supported(modes: List[ImportMode]):
     def depends_on() -> List[Type['ModuleInterface']]:
         from dataall.modules.notifications import NotificationsModuleInterface
         from dataall.modules.s3_datasets import DatasetAsyncHandlersModuleInterface
-        from dataall.modules.shares_base import SharesBaseModuleInterface
+        from dataall.modules.shares_base import SharesBaseAsyncHandlerModuleInterface
 
-        return [DatasetAsyncHandlersModuleInterface, NotificationsModuleInterface, SharesBaseModuleInterface]
+        return [
+            DatasetAsyncHandlersModuleInterface,
+            NotificationsModuleInterface,
+            SharesBaseAsyncHandlerModuleInterface,
+        ]
 
     def __init__(self):
         log.info('S3 Sharing handlers have been imported')
@@ -83,7 +87,10 @@ def depends_on() -> List[Type['ModuleInterface']]:
         return [SharesBaseECSTaskModuleInterface, NotificationsModuleInterface]
 
     def __init__(self):
-        from dataall.modules.shares_base.services.sharing_service import SharingService, SharingProcessorDefinition
+        from dataall.modules.shares_base.services.share_processor_manager import (
+            ShareProcessorManager,
+            ShareProcessorDefinition,
+        )
         from dataall.modules.shares_base.services.shares_enums import ShareableType
         from dataall.modules.s3_datasets.db.dataset_models import DatasetTable, DatasetBucket, DatasetStorageLocation
         from dataall.modules.s3_datasets_shares.services.share_processors.glue_table_share_processor import (
@@ -96,18 +103,18 @@ def __init__(self):
             ProcessS3AccessPointShare,
         )
 
-        SharingService.register_processor(
-            SharingProcessorDefinition(
+        ShareProcessorManager.register_processor(
+            ShareProcessorDefinition(
                 ShareableType.Table, ProcessLakeFormationShare, DatasetTable, DatasetTable.tableUri
             )
         )
-        SharingService.register_processor(
-            SharingProcessorDefinition(
+        ShareProcessorManager.register_processor(
+            ShareProcessorDefinition(
                 ShareableType.S3Bucket, ProcessS3BucketShare, DatasetBucket, DatasetBucket.bucketUri
             )
         )
-        SharingService.register_processor(
-            SharingProcessorDefinition(
+        ShareProcessorManager.register_processor(
+            ShareProcessorDefinition(
                 ShareableType.StorageLocation,
                 ProcessS3AccessPointShare,
                 DatasetStorageLocation,