-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Teach dataset_writer to accept custom functor on base file name template #34565
Comments
Hi @westonpace , can you comment on this issue when you have time? Ty. |
That seems like a reasonable ask. Would you suggest doing this with patterns in the basename template? I'm guessing the old approach Or, there are probably other ways to do this as well. |
Hi Weston yes and thanks for the reply. I'm open to either C printf or pick something from STD. Is it ok to limit the scope to left padding in issue, or you have other suggestions? |
If it is easier to limit the scope to left padding that is fine. Using something from std is fine (and probably preferred) as well. If we do use something from std/printf then we might find it is actually harder to limit the scope to left padding. So I also don't mind if we end up supporting something more. |
Cool. I will ping you again once the PR is ready. |
I have some new thoughts on this issue. Instead of supporting just left padding, I propose we allow users to provide a lambda function with type as |
That's a good idea! |
Hi weston can you guide how to run test dataset_writer_test? it keeps telling me 0 test gets run...
|
Sorry, that's due to a bug that was briefly checked into the main branch. Can you rebase the main branch? It should be fixed now. |
Rebasing fixes it now. Thanks Weston |
…functor ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill formed lambdas. ### Are there any user-facing changes? Yes. It allows user to customize output file names. * Closes: apache#34565
…functor ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill formed lambdas. ### Are there any user-facing changes? Yes. It allows user to customize output file names. * Closes: apache#34565
…functor ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill formed lambdas. ### Are there any user-facing changes? Yes. It allows user to customize output file names. * Closes: apache#34565
#34984) ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill-formed lambdas. ### Are there any user-facing changes? Yes. It allows users to customize output file names. * Closes: #34565 Authored-by: Haocheng Liu <lbtinglb@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…functor (apache#34984) ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill-formed lambdas. ### Are there any user-facing changes? Yes. It allows users to customize output file names. * Closes: apache#34565 Authored-by: Haocheng Liu <lbtinglb@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…functor (apache#34984) ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill-formed lambdas. ### Are there any user-facing changes? Yes. It allows users to customize output file names. * Closes: apache#34565 Authored-by: Haocheng Liu <lbtinglb@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
…functor (apache#34984) ### Rationale for this change Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step. ### What changes are included in this PR? A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer. ### Are these changes tested? Yes. Unit tests are added for normal and ill-formed lambdas. ### Are there any user-facing changes? Yes. It allows users to customize output file names. * Closes: apache#34565 Authored-by: Haocheng Liu <lbtinglb@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>
Describe the enhancement requested
Hi,
I want to report a feature request as nowadays dataset_writer only supports basename_template with kIntegerToken. Say I have file 0.parquet, 1,parquet, ..., 10.parquet, 11.parquet and 12.parquet. It does not work with alphabetical sorter and downstream users must implement lexicographic sorter accordingly. In my case, I need to touch quite a few codebases to support hive style partition parquet in my org.
I propose to add a new option which allows left padding with zeros. so the names will be 001,parquet, ...,010.parquet, 011.parquet and 012.parquet. Let me know thoughts and suggestions as I can help with the contribution.
Component(s)
C++
The text was updated successfully, but these errors were encountered: