Enhancing Dataset representation to consider mutiple buckets & filtered list of tables #720
Labels
effort: large
status: not-picked-yet
At the moment we have not picked this item. Anyone can pick it up
type: enhancement
Feature enhacement
type: newfeature
New feature request
Milestone
Is your feature request related to a problem? Please describe.
We have a few databases currently in production use with 30+ tables where each of those tables point to a different S3 bucket.
Additionally the same logical dataset could be split across multiple tables & s3 buckets, this could be done to enhance performance (there are limits to read/writes that a particular bucket/prefix can handle) or make them easier to manage.
For example a click metrics dataset with hourly table & aggregated daily table (i.e. 2 tables & 2 separate s3 buckets) and a display metrics dataset with just one table and bucket.
Currently data.all's dataset representation is based on the idea of one s3 bucket and 1 glue database with all tables in that database pointing to this bucket.
The data as described earlier does not lend well to importing and sharing in data.all, for example which bucket out of the 30 would you specify as the dataset bucket and the complete database does not be part of the dataset only specific tables.
Describe the solution you'd like
When importing a database the user should be able to select or filter the list of tables being imported from a particular database & specify the list of buckets that would be imported as part of the dataset.
Describe alternatives you've considered
Creating a new logical construct "dataset group" which contains multiple datasets and sharing support on the level of the group, Based on the previous example click hourly & click daily would be two datasets in the click dataset group.
P.S. Please Don't attach files. Add code snippets directly in the message body instead.
The text was updated successfully, but these errors were encountered: