Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source File: add support for directories #1874

Closed
eugene-kulak opened this issue Jan 28, 2021 · 3 comments
Closed

Source File: add support for directories #1874

eugene-kulak opened this issue Jan 28, 2021 · 3 comments
Labels
area/connectors Connector related issues type/enhancement New feature or request wontfix This will not be worked on

Comments

@eugene-kulak
Copy link
Contributor

eugene-kulak commented Jan 28, 2021

Tell us about the problem you're trying to solve

Currently, we can point URL to file only, so we always have a single stream. It would be nice to list files in the directory in case URL points to a directory object.

Describe the solution you’d like

Given file structure

files_folder/
   file1.csv
   file2.xls
   file3.csv
   inner_folder/
      file4.csv

For url="ssh://127.0.0.1:200/files_folder/" and format=csv we should see following streams

streams:
   file1  
   file3
   inner_folder_file4.csv

Describe the alternative you’ve considered or used

A clear and concise description of any alternative solutions or features you've considered or are using today.

Additional context

Add any other context or screenshots about the feature request here.

┆Issue is synchronized with this Asana task by Unito

@eugene-kulak eugene-kulak added type/enhancement New feature or request zazmic labels Jan 28, 2021
@ChristopheDuong
Copy link
Contributor

ChristopheDuong commented Jan 28, 2021

  • A folder could be considered as a Database Schema
  • we could introduce a regex pattern to match multiple files that would form the same stream (they all need to be the same schema, maybe not the same file format?) => that way you can upload a large file in different chunk/partitionned files
  • you can have different streams in a folder as we do with other sources

But then comes the question of incremental refreshes with such a source...
Maybe the state should keep track of filenames, modification dates, checksum, etc from replicated files?)

@eugene-kulak
Copy link
Contributor Author

@ChristopheDuong

  • I like this analogy
  • To support this we need somehow define a rule to extract base stream name from file name. Then we can apply a regex to all filenames and find all partitions for each stream.
  • Regarding different file formats, I would allow only compressed variations, i.e csv and csv.gz.
  • Yes we can, the problem/complication that I see is that we have too many stream-specific options on the connector level (format, reader_options).

@sherifnada
Copy link
Contributor

this is happening in the S3 connector, and won't happen in the file source per say. So closing as wontfix.

@sherifnada sherifnada added the wontfix This will not be worked on label Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues type/enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants