Skip to content

S3 Bucket Dataset #959

@lmarini

Description

@lmarini

User Story

When creating a new dataset, the user can pick from a list of types. The built-in dataset is the current one and stores everything local. The new type will be of type s3 bucket mirror. The user will provide s3 credentials as part of the dataset creation step. After that the dataset will function similarly to the built-in type, but the users cannot add/remove/edit files/folders (maybe we can support this in the future?).

Implementation

An s3 bucket dataset has the following attributes:

  • Files and folders are not stored locally, but on the s3 bucket
  • Add webhook to clowder 2 to send s3 events back to clowder. When things are modified in the bucket, clowder2 can update its state.
  • Alternative is to always query the bucket live.
  • Extend files and folders need to support the concept of remote versions.
  • S3 bucket credentials are provided by the user when creating a dataset. Credentials are kept in a vault (for example HashiCorp Vault / OpenBao).
  • Adding local files folders is disabled. Users can still add metadata and run extractors on them.
  • File versions are ignored? Updated through events?
  • Download endpoints will stream data from the S3 bucket (phase 1). We could eventually have extractors download directly from S3 buckets (phase 2).

Updates to models

  • Dataset models stay the same. Add type. Type drives everything about the dataset behavior.
  • File already includes StorageType. Folder does not.

Other Thoughts

  • If at any point the credentials stop working, the system should send a clear message to the user and allow them to update credentials.
  • When running an extractor on a file, the extractor could download directly from the bucket. This will require more development.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicLarge feature to be broken down into user stories

    Type

    No type

    Projects

    Status

    No status

    Status

    No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions