Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Teach dataset_writer to accept custom functor on base file name template #34565

Closed
HaochengLIU opened this issue Mar 14, 2023 · 10 comments · Fixed by #34984
Closed

[C++] Teach dataset_writer to accept custom functor on base file name template #34565

HaochengLIU opened this issue Mar 14, 2023 · 10 comments · Fixed by #34984

Comments

@HaochengLIU
Copy link
Contributor

Describe the enhancement requested

Hi,
I want to report a feature request as nowadays dataset_writer only supports basename_template with kIntegerToken. Say I have file 0.parquet, 1,parquet, ..., 10.parquet, 11.parquet and 12.parquet. It does not work with alphabetical sorter and downstream users must implement lexicographic sorter accordingly. In my case, I need to touch quite a few codebases to support hive style partition parquet in my org.

I propose to add a new option which allows left padding with zeros. so the names will be 001,parquet, ...,010.parquet, 011.parquet and 012.parquet. Let me know thoughts and suggestions as I can help with the contribution.

Component(s)

C++

@HaochengLIU
Copy link
Contributor Author

Hi @westonpace , can you comment on this issue when you have time? Ty.

@westonpace
Copy link
Member

That seems like a reasonable ask. Would you suggest doing this with patterns in the basename template? I'm guessing the old approach {i} was based on python's f strings ({i: <16}) so maybe we should pick something like printf that is available in C++ and then fully support the various format strings.

Or, there are probably other ways to do this as well.

@HaochengLIU
Copy link
Contributor Author

Hi Weston yes and thanks for the reply. I'm open to either C printf or pick something from STD. Is it ok to limit the scope to left padding in issue, or you have other suggestions?

@westonpace
Copy link
Member

If it is easier to limit the scope to left padding that is fine.

Using something from std is fine (and probably preferred) as well.

If we do use something from std/printf then we might find it is actually harder to limit the scope to left padding. So I also don't mind if we end up supporting something more.

@HaochengLIU
Copy link
Contributor Author

Cool. I will ping you again once the PR is ready.

@HaochengLIU
Copy link
Contributor Author

I have some new thoughts on this issue. Instead of supporting just left padding, I propose we allow users to provide a lambda function with type as std::function(std::string<int>) in the WriterOption. By doing so, users can do whatever they want(e.g. left padding, use hash value as file name, etc) with a generic solution in C++. If that sounds good to you I will tackle it in my spare cycle in April.

@westonpace
Copy link
Member

That's a good idea!

@HaochengLIU HaochengLIU changed the title [C++] dataset_writer does not support left padding [C++] Teach dataset_writer to accept custom functor on base file name template Apr 5, 2023
@HaochengLIU
Copy link
Contributor Author

Hi weston can you guide how to run test dataset_writer_test? it keeps telling me 0 test gets run...

build ninja && ctest -R dataset-writer-test -V

1: Test command: /Users/haochengliu/Documents/projects/Arrow/arrow/cpp/build-support/run-test.sh "/Users/haochengliu/Documents/projects/Arrow/build" "test" "/Users/haochengliu/Documents/projects/Arrow/build/debug//arrow-dataset-dataset-writer-test"
51: Working Directory: /Users/haochengliu/Documents/projects/Arrow/build/src/arrow/dataset
51: Test timeout computed to be: 10000000
51: Running arrow-dataset-dataset-writer-test, redirecting output into /Users/haochengliu/Documents/projects/Arrow/build/build/test-logs/arrow-dataset-dataset-writer-test.txt (attempt 1/1)
51: Running main() from /Users/haochengliu/Documents/projects/Arrow/build/googletest_ep-prefix/src/googletest_ep/googletest/src/gtest_main.cc
51: [==========] Running 0 tests from 0 test suites.
51: [==========] 0 tests from 0 test suites ran. (0 ms total)
51: [  PASSED  ] 0 tests.
51: ~/Documents/projects/Arrow/build/src/arrow/dataset
1/1 Test #51: arrow-dataset-dataset-writer-test ...   Passed    0.19 sec

The following tests passed:
	arrow-dataset-dataset-writer-test

@westonpace
Copy link
Member

Hi weston can you guide how to run test dataset_writer_test? it keeps telling me 0 test gets run...

Sorry, that's due to a bug that was briefly checked into the main branch. Can you rebase the main branch? It should be fixed now.

@HaochengLIU
Copy link
Contributor Author

Rebasing fixes it now. Thanks Weston

HaochengLIU added a commit to HaochengLIU/arrow that referenced this issue Apr 8, 2023
…functor

### Rationale for this change

Existing basename_template will only use a monotonically increasing int
as new filenames. when there is needs for custom filenames(left padding,
hash-code), downstream users must rename the files in a post processing
step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows
users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill formed lambdas.

### Are there any user-facing changes?

Yes. It allows user to customize output file names.
* Closes: apache#34565
HaochengLIU added a commit to HaochengLIU/arrow that referenced this issue Apr 8, 2023
…functor

### Rationale for this change

Existing basename_template will only use a monotonically increasing int
as new filenames. when there is needs for custom filenames(left padding,
hash-code), downstream users must rename the files in a post processing
step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows
users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill formed lambdas.

### Are there any user-facing changes?

Yes. It allows user to customize output file names.
* Closes: apache#34565
HaochengLIU added a commit to HaochengLIU/arrow that referenced this issue Apr 12, 2023
…functor

### Rationale for this change

Existing basename_template will only use a monotonically increasing int
as new filenames. when there is needs for custom filenames(left padding,
hash-code), downstream users must rename the files in a post processing
step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows
users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill formed lambdas.

### Are there any user-facing changes?

Yes. It allows user to customize output file names.
* Closes: apache#34565
westonpace pushed a commit that referenced this issue Apr 12, 2023
#34984)

### Rationale for this change

Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill-formed lambdas.

### Are there any user-facing changes?

Yes. It allows users to customize output file names.
* Closes: #34565

Authored-by: Haocheng Liu <lbtinglb@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
@westonpace westonpace added this to the 12.0.0 milestone Apr 12, 2023
liujiacheng777 pushed a commit to LoongArch-Python/arrow that referenced this issue May 11, 2023
…functor (apache#34984)

### Rationale for this change

Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill-formed lambdas.

### Are there any user-facing changes?

Yes. It allows users to customize output file names.
* Closes: apache#34565

Authored-by: Haocheng Liu <lbtinglb@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…functor (apache#34984)

### Rationale for this change

Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill-formed lambdas.

### Are there any user-facing changes?

Yes. It allows users to customize output file names.
* Closes: apache#34565

Authored-by: Haocheng Liu <lbtinglb@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
rtpsw pushed a commit to rtpsw/arrow that referenced this issue May 16, 2023
…functor (apache#34984)

### Rationale for this change

Existing basename_template will only use a monotonically increasing int as new filenames. when there is needs for custom filenames(left padding, hash-code), downstream users must rename the files in a post-processing step.

### What changes are included in this PR?

A new functor is added to FileSystemDatasetWriteOptions which allows users to provide a custom name for dataset_writer.

### Are these changes tested?

Yes. Unit tests are added for normal and ill-formed lambdas.

### Are there any user-facing changes?

Yes. It allows users to customize output file names.
* Closes: apache#34565

Authored-by: Haocheng Liu <lbtinglb@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants