Skip to content

Commit

Permalink
Generic shares_base module and specific s3_datasets_shares module - p…
Browse files Browse the repository at this point in the history
…art 9 (share db repositories) (#1351)

### Feature or Bugfix
- Refactoring

### Detail
As explained in the design for #1123 and #1283 we are trying to
implement generic `datasets_base` and `shares_base` modules that can be
used by any type of datasets and by any type of shareable object in a
generic way.

This PR includes:
- Split the ShareobjectRepository from s3_datasets_shares into:
- `ShareobjectRepository` (shares_base) - generic db operations on share
objects - no references to S3, Glue
- `ShareStatusRepository` (shares_base) - db operations related to the
sharing state machine states - a way to split the db operations into
smaller chunks
- `S3ShareobjectRepository` (s3_datasets_share) - db operations on s3
share objects - used only in s3_datasets_shares. They might contain
references to DatasetTables, S3Datasets... They are used in the clean-up
activities and to count resources in environment.

- Adapt `S3ShareobjectRepository` to S3 objects. For some queries it was
needed to add filters on the type of share items retrieved, so that if
in the future anyone adds a new share type the code still makes sense.
To add some more meaning, some functions are renamed to clearly point
out that they are s3 functions or what they do.

- Make `ShareobjectRepository` completely generic. The following queries
needed extra work:
- ShareObjectRepository.get_share_item - renamed as
`get_share_item_details`
- `list_shareable_items` - split in 2 parts
`list_shareable_items_of_type` + `paginated_list_shareable_items`: the
first function is invoked recursively over the list of share processors,
instead of querying the DatasetTable, DatasetStorageLocation and
DatasetBucket we query the shareable_type. The challenge is to get all
fields from the db Resource object that all of them are built upon. In
particular the field `itemName` does not match the BucketName (in
bucket) or the S3Prefix (in folders). For this reason I added a
migration script to backfill the DatasetBucket.name as
DatasetBucket.S3BucketName. and the DatasetStorageLocation.name with
DatasetStorageLocation.S3Prefix. `paginated_list_shareable_items` joins
the list of subqueries, filters and paginates.
- In verify_dataset_share_objects instead of using list_shareable_items,
I replaced it by `ShareObjectRepository.get_all_share_items_in_share`
which does not need tables, storage, avoiding the whole S3 logic and
avoiding unnecessary queries

- Remove S3 references from shares_base.api.resolvers. Use DatasetBase
and DatasetBaseRepository instead. Remove unused `ShareableObject`.

- I had some problems with circular dependencies so I created the
`ShareProcessorManager` in shares_base for the registration of
processors. The SharingService uses the manager to get all processors.

Missing items for Part10:
- Lake Formation cross region table references in
shares_base/services/share_item_service.py:add_shared_item
- remove table references in
shares_base/services/share_item_service.py:remove_shared_item
- remove s3_prefix references
shares_base/services/share_notification_service:notify_new_data_available_from_owners
- RENAMING! Right now the names are a bit misleading

### Relates
- #1283 
- #1123 
- #955 


### Security
Please answer the questions below briefly where applicable, or write
`N/A`. Based on
[OWASP 10](https://owasp.org/Top10/en/).

- Does this PR introduce or modify any input fields or queries - this
includes
fetching data from storage outside the application (e.g. a database, an
S3 bucket)?
  - Is the input sanitized?
- What precautions are you taking before deserializing the data you
consume?
  - Is injection prevented by parametrizing queries?
  - Have you ensured no `eval` or similar functions are used?
- Does this PR introduce any functionality or component that requires
authorization?
- How have you ensured it respects the existing AuthN/AuthZ mechanisms?
  - Are you logging failed auth attempts?
- Are you using or adding any cryptographic features?
  - Do you use a standard proven implementations?
  - Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users?
  - Have you used the least-privilege principle? How?


By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache 2.0 license.
  • Loading branch information
dlpzx committed Jun 21, 2024
1 parent e461fa1 commit 342798c
Show file tree
Hide file tree
Showing 27 changed files with 1,008 additions and 1,184 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ def create_dataset_bucket(session, dataset: S3Dataset, data: dict = None) -> Dat
KmsAlias=dataset.KmsAlias,
imported=dataset.imported,
importedKmsKey=dataset.importedKmsKey,
name=dataset.S3BucketName,
)
session.add(bucket)
session.commit()
Expand Down
25 changes: 16 additions & 9 deletions backend/dataall/modules/s3_datasets_shares/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,13 @@ def is_supported(modes: List[ImportMode]):
def depends_on() -> List[Type['ModuleInterface']]:
from dataall.modules.notifications import NotificationsModuleInterface
from dataall.modules.s3_datasets import DatasetAsyncHandlersModuleInterface
from dataall.modules.shares_base import SharesBaseModuleInterface
from dataall.modules.shares_base import SharesBaseAsyncHandlerModuleInterface

return [DatasetAsyncHandlersModuleInterface, NotificationsModuleInterface, SharesBaseModuleInterface]
return [
DatasetAsyncHandlersModuleInterface,
NotificationsModuleInterface,
SharesBaseAsyncHandlerModuleInterface,
]

def __init__(self):
log.info('S3 Sharing handlers have been imported')
Expand Down Expand Up @@ -83,7 +87,10 @@ def depends_on() -> List[Type['ModuleInterface']]:
return [SharesBaseECSTaskModuleInterface, NotificationsModuleInterface]

def __init__(self):
from dataall.modules.shares_base.services.sharing_service import SharingService, SharingProcessorDefinition
from dataall.modules.shares_base.services.share_processor_manager import (
ShareProcessorManager,
ShareProcessorDefinition,
)
from dataall.modules.shares_base.services.shares_enums import ShareableType
from dataall.modules.s3_datasets.db.dataset_models import DatasetTable, DatasetBucket, DatasetStorageLocation
from dataall.modules.s3_datasets_shares.services.share_processors.glue_table_share_processor import (
Expand All @@ -96,18 +103,18 @@ def __init__(self):
ProcessS3AccessPointShare,
)

SharingService.register_processor(
SharingProcessorDefinition(
ShareProcessorManager.register_processor(
ShareProcessorDefinition(
ShareableType.Table, ProcessLakeFormationShare, DatasetTable, DatasetTable.tableUri
)
)
SharingService.register_processor(
SharingProcessorDefinition(
ShareProcessorManager.register_processor(
ShareProcessorDefinition(
ShareableType.S3Bucket, ProcessS3BucketShare, DatasetBucket, DatasetBucket.bucketUri
)
)
SharingService.register_processor(
SharingProcessorDefinition(
ShareProcessorManager.register_processor(
ShareProcessorDefinition(
ShareableType.StorageLocation,
ProcessS3AccessPointShare,
DatasetStorageLocation,
Expand Down
Loading

0 comments on commit 342798c

Please sign in to comment.