Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720

Closed
blitzmohit opened this issue Aug 31, 2023 · 3 comments · Fixed by #811
Closed
Labels
effort: large status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: enhancement Feature enhacement type: newfeature New feature request
Milestone

Comments

@blitzmohit
Copy link
Contributor

Is your feature request related to a problem? Please describe.
We have a few databases currently in production use with 30+ tables where each of those tables point to a different S3 bucket.

Additionally the same logical dataset could be split across multiple tables & s3 buckets, this could be done to enhance performance (there are limits to read/writes that a particular bucket/prefix can handle) or make them easier to manage.

For example a click metrics dataset with hourly table & aggregated daily table (i.e. 2 tables & 2 separate s3 buckets) and a display metrics dataset with just one table and bucket.

Currently data.all's dataset representation is based on the idea of one s3 bucket and 1 glue database with all tables in that database pointing to this bucket.

The data as described earlier does not lend well to importing and sharing in data.all, for example which bucket out of the 30 would you specify as the dataset bucket and the complete database does not be part of the dataset only specific tables.

Describe the solution you'd like
When importing a database the user should be able to select or filter the list of tables being imported from a particular database & specify the list of buckets that would be imported as part of the dataset.

  • Adding support for multiple buckets in a dataset and making them available as a sharable item as part of the dataset.
  • Adding support for filtering of tables that are imported as part of a dataset
  • Check crawler, profiling and sharing for modifications now that a database spans multiple datasets

Describe alternatives you've considered
Creating a new logical construct "dataset group" which contains multiple datasets and sharing support on the level of the group, Based on the previous example click hourly & click daily would be two datasets in the click dataset group.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

@anmolsgandhi anmolsgandhi added type: enhancement Feature enhacement type: newfeature New feature request status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up labels Sep 7, 2023
@dlpzx
Copy link
Contributor

dlpzx commented Sep 7, 2023

Hi @blitzmohit, thanks for opening the issue. The design principle 1 dataset = 1 S3Bucket = 1 Glue database has also been a challenge for other customers. I see 2 alternatives:

  1. create a new construct "dataset group" => 1 dataset = 1 S3Bucket, 1 dataset-group = 1 Glue database
  2. modify datasets to be "multi-S3 bucket" => 1 dataset = X S3 Buckets = 1 Glue database'

To decide on the design I agree with the points you highlighted:

  • Adding support for multiple buckets in a dataset --> I would take this task first and then the second one. It requires: a) changes to the import api, b) the RDS models (maybe we need to break the dataset table and create a new table 'dataset_S3_buckets' c) dataset stack, we need to register all S3 locations in LakeFormation, modify the permissions for the IAM roles (Dataset roles and environment team roles) d) frontend views
  • making them available as a sharable item as part of the dataset. --> are we going to share all buckets at once? or do we allow bucket by bucket sharing?
  • Adding support for filtering of tables that are imported as part of a dataset --> this is a great feature. One customer has implemented something similar, they run the crawler and identify the schema at the import and then select only the tables that they want to import. Is this what you are referring to here?
  • Impact on "dataset" support resources linked to a bucket: crawler, profiling job ---> for the crawler it might be easier to import it as well or to disable it for multi-buckets datasets. If the dataset has a lot of S3 buckets it can get complex to design a crawler that works. The profiling job works at the table level using Glue and getting access via lake Formation, so we should be fine.
  • Impact on other functionalities: e.g. there is a direct link to the S3 bucket from UI

We can meet in the following weeks to work on a design, thank you for opening the issue :)

@anmolsgandhi anmolsgandhi added this to the v2.1.0 milestone Sep 8, 2023
@zsaltys
Copy link
Contributor

zsaltys commented Sep 12, 2023

@dlpzx @blitzmohit I think at the moment this issue is too wide and we should open separate more specific items. I propose we close this one out.

@dlpzx
Copy link
Contributor

dlpzx commented Nov 8, 2023

Merged and released with v2.1.0 🚀

@dlpzx dlpzx closed this as completed Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
effort: large status: not-picked-yet At the moment we have not picked this item. Anyone can pick it up type: enhancement Feature enhacement type: newfeature New feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants