Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

blitzmohit · 2023-08-31T17:07:15Z

Is your feature request related to a problem? Please describe.
We have a few databases currently in production use with 30+ tables where each of those tables point to a different S3 bucket.

Additionally the same logical dataset could be split across multiple tables & s3 buckets, this could be done to enhance performance (there are limits to read/writes that a particular bucket/prefix can handle) or make them easier to manage.

For example a click metrics dataset with hourly table & aggregated daily table (i.e. 2 tables & 2 separate s3 buckets) and a display metrics dataset with just one table and bucket.

Currently data.all's dataset representation is based on the idea of one s3 bucket and 1 glue database with all tables in that database pointing to this bucket.

The data as described earlier does not lend well to importing and sharing in data.all, for example which bucket out of the 30 would you specify as the dataset bucket and the complete database does not be part of the dataset only specific tables.

Describe the solution you'd like
When importing a database the user should be able to select or filter the list of tables being imported from a particular database & specify the list of buckets that would be imported as part of the dataset.

Adding support for multiple buckets in a dataset and making them available as a sharable item as part of the dataset.
Adding support for filtering of tables that are imported as part of a dataset
Check crawler, profiling and sharing for modifications now that a database spans multiple datasets

Describe alternatives you've considered
Creating a new logical construct "dataset group" which contains multiple datasets and sharing support on the level of the group, Based on the previous example click hourly & click daily would be two datasets in the click dataset group.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

dlpzx · 2023-09-07T13:53:47Z

Hi @blitzmohit, thanks for opening the issue. The design principle 1 dataset = 1 S3Bucket = 1 Glue database has also been a challenge for other customers. I see 2 alternatives:

create a new construct "dataset group" => 1 dataset = 1 S3Bucket, 1 dataset-group = 1 Glue database
modify datasets to be "multi-S3 bucket" => 1 dataset = X S3 Buckets = 1 Glue database'

To decide on the design I agree with the points you highlighted:

Adding support for multiple buckets in a dataset --> I would take this task first and then the second one. It requires: a) changes to the import api, b) the RDS models (maybe we need to break the dataset table and create a new table 'dataset_S3_buckets' c) dataset stack, we need to register all S3 locations in LakeFormation, modify the permissions for the IAM roles (Dataset roles and environment team roles) d) frontend views
making them available as a sharable item as part of the dataset. --> are we going to share all buckets at once? or do we allow bucket by bucket sharing?
Adding support for filtering of tables that are imported as part of a dataset --> this is a great feature. One customer has implemented something similar, they run the crawler and identify the schema at the import and then select only the tables that they want to import. Is this what you are referring to here?
Impact on "dataset" support resources linked to a bucket: crawler, profiling job ---> for the crawler it might be easier to import it as well or to disable it for multi-buckets datasets. If the dataset has a lot of S3 buckets it can get complex to design a crawler that works. The profiling job works at the table level using Glue and getting access via lake Formation, so we should be fine.
Impact on other functionalities: e.g. there is a direct link to the S3 bucket from UI

We can meet in the following weeks to work on a design, thank you for opening the issue :)

zsaltys · 2023-09-12T11:53:30Z

@dlpzx @blitzmohit I think at the moment this issue is too wide and we should open separate more specific items. I propose we close this one out.

dlpzx · 2023-11-08T08:32:41Z

Merged and released with v2.1.0 🚀

anmolsgandhi added type: enhancement Feature enhacement type: newfeature New feature request status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up labels Sep 7, 2023

anmolsgandhi added this to the v2.1.0 milestone Sep 8, 2023

anmolsgandhi added the effort: large label Sep 8, 2023

zsaltys mentioned this issue Sep 12, 2023

on import prefilter database tables based on table S3 location #745

Closed

dlpzx linked a pull request Oct 30, 2023 that will close this issue

feat: Handle Pre-filtering of tables #811

Merged

dlpzx closed this as completed Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

blitzmohit commented Aug 31, 2023

dlpzx commented Sep 7, 2023

zsaltys commented Sep 12, 2023

dlpzx commented Nov 8, 2023

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

Comments

blitzmohit commented Aug 31, 2023

dlpzx commented Sep 7, 2023

zsaltys commented Sep 12, 2023

dlpzx commented Nov 8, 2023