Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement ability to retrieve fragment filename #30772

Closed
asfimport opened this issue Jan 7, 2022 · 1 comment
Closed

[C++] Implement ability to retrieve fragment filename #30772

asfimport opened this issue Jan 7, 2022 · 1 comment

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jan 7, 2022

A user has requested the ability to include the filename of the CSV in the dataset output - see discussion on ARROW-15260 for more context.

Relevant info from that ticket:

 
"From a C++ perspective we've got many of the pieces needed already. One challenge is that the datasets API is written to work with "fragments" and not "files". For example, a dataset might be an in-memory table in which case we are working with InMemoryFragment and not FileFragment so there is no concept of "filename".

That being said, the low level ScanBatchesAsync method actually returns a generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a struct with the record batch as well as the source fragment for that record batch.

So if you were to execute scan, you could inspect the fragment and, if it is a FileFragment, you could extract the filename.

Another challenge is that R is moving towards more and more access through an exec plan and not directly using a scanner. In order for that to work we would need to augment the scan results with the filename in C++ before sending into the exec plan. Luckily, we already do this a bit as well. We currently augment the scan results with fragment index, batch index, and whether the batch is the last batch in the fragment.

Since ExecBatch can work with constants efficiently I don't think there will be much performance cost in always including the filename. So the work remaining is simply to add a new augmented field {}fragment_source_name which is always attached if the underlying fragment is a filename. Then users can get this field if they want by including "{_}_fragment_source_name" in the list of columns they query for."

Reporter: Nicola Crane / @thisisnic
Assignee: Sanjiban Sengupta / @sanjibansg

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15281. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Weston Pace / @westonpace:
Issue resolved by pull request 12560
#12560

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant