Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33749: [Ruby] Add Arrow::RecordBatch#each_raw_record #37137

Merged
merged 23 commits into from
Sep 5, 2023

Conversation

otegami
Copy link
Contributor

@otegami otegami commented Aug 12, 2023

Rationale for this change

This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format.

What changes are included in this PR?

  • Add the following methods to make the raw_records method iterable.
    • Arrow::RecordBatch#each_raw_record
    • Arrow::Table#each_raw_record
  • Add related test

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

This PR is related to #33749

Add Arrow::Table#each_raw_record to make Arrow::Table#raw_records be iterable.
@otegami otegami requested a review from kou as a code owner August 12, 2023 08:41
@github-actions
Copy link

⚠️ GitHub issue #33749 has been automatically assigned in GitHub to PR creator.

n_columns_(n_columns) {
record_(Qnil),
n_columns_(n_columns),
is_produce_mode_(false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating RawRecordProducer instead of adding a new "produce" mode to RawRecordsBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: 3318abc
Sounds pretty nice to me!
I tried to split RawRecordProducer into its own source file because some methods had to know the row-wise processing situation.
What do you think of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I should have written RawRecordsProducer not RawRecordProducer because it produces multiple raw records.

So I think that we can still use raw-records.cpp instead of raw-record.cpp because it also handles multiple raw records not just one raw record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: d3e5f3f

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 12, 2023
To enhance clarity and maintainability:
- Implemented the new `RawRecordProducer` class, dedicated to processing records on a row-by-row basis.
- This specialization eliminates conditional branching within the `convert` method previously present in the RawRecordsBuilder.
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 14, 2023
@otegami otegami requested a review from kou August 14, 2023 12:25
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 15, 2023
@kou
Copy link
Member

kou commented Aug 15, 2023

Could you try writing EachRawRecordBasicArraysTest#test_boolean into ruby/red-arrow/test/each-raw-record/test-basic-arrays.rb?
It will show what should we do as the next step.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 16, 2023
@otegami otegami force-pushed the ruby-arrow-table-each-raw-record branch from ec6f5a7 to 93adaac Compare August 17, 2023 12:25
@otegami
Copy link
Contributor Author

otegami commented Aug 17, 2023

Could you try writing EachRawRecordBasicArraysTest#test_boolean into ruby/red-arrow/test/each-raw-record/test-basic-arrays.rb?
It will show what should we do as the next step.

I got it. So as the next step, I will add the tests about ArrowTable#each_raw_record and Arrow::RecordBatch#each_raw_record as the following.
fix: 93adaac

const auto& chunked_array = table.column(i).get();
column_index_ = i;

for (const auto array : chunked_array->chunks()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we miss & here to avoid needless reference count increment:

Suggested change
for (const auto array : chunked_array->chunks()) {
for (const auto& array : chunked_array->chunks()) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: 34b53e0
I changed the logic of void produce(const arrow::Table& table). So could you recheck this point again?🙏


class RawRecordsProducer : private Converter, public arrow::ArrayVisitor {
public:
explicit RawRecordsProducer(int n_columns)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove n_columns here and define n_columns in each produce()?

 void produce(const arrow::RecordBatch& record_batch) {
   auto n_columns = record_batch->num_columns();
   // ...
 }
 
 void produce(const arrow::Table& table) {
   auto n_columns = table->num_columns();
   // ...
 }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: b0731a7

include EachRawRecordBasicArraysTests

def build(schema, records)
Arrow::Table.new(schema, records)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create multiple chunked arrays?

Suggested change
Arrow::Table.new(schema, records)
Arrow::Table.new(schema,
[
Arrow::RecordBatch.new(schema, records[0, 2]),
Arrow::RecordBatch.new(schema, records[2..-1]),
])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: 34b53e0

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Aug 18, 2023
@github-actions github-actions bot removed the awaiting changes Awaiting changes label Aug 19, 2023
@otegami
Copy link
Contributor Author

otegami commented Aug 26, 2023

Upcoming Tasks

  • test-dense-union-array
  • test-dictionary-array
  • test-list-array
  • test-map-array
  • test-multiple-columns
  • test-sparse-union-array
  • test-sparse-union-array
  • test-table

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2023
@otegami
Copy link
Contributor Author

otegami commented Sep 4, 2023

@kou
I added the test cases like what we did with raw_records. (It means I don't change the test cases which is used in Arrow::Table#raw_records and Arrow::RecordBatch#raw_records.)
On the other hand, I'm concerned that the code differences are too much.
If it's better to split the PR for each test case, I'd like to make those adjustments.

@otegami otegami requested a review from kou September 4, 2023 11:53
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I'll merge this as-is. We can improve this by follow-up tasks. Could you open new issues for the followings?

  • Use an empty chunked array for Arrow::Table tests
  • Add support for table.each_raw_record.to_a
  • Unify tests with raw_records and each_raw_record

(If you have more improvement ideas, please open new issues for them. We can discuss further on them.)

include EachRawRecordDenseUnionArrayTests

def build(type, records)
build_record_batch(type, records).to_table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've just created this issue here: #37561

@kou kou merged commit 7dd8624 into apache:main Sep 5, 2023
10 checks passed
@kou kou removed the awaiting change review Awaiting change review label Sep 5, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Sep 5, 2023
@otegami otegami deleted the ruby-arrow-table-each-raw-record branch September 5, 2023 01:59
@otegami
Copy link
Contributor Author

otegami commented Sep 5, 2023

@kou
Thank you for reviewing. I've opened the following issues. Please let me know if there's anything missing.
Also, if I come up with an improvement idea, I will open it.

Use an empty chunked array for Arrow::Table tests

#37561

Add support for table.each_raw_record.to_a

#37562

Unify tests with raw_records and each_raw_record

#37563

@kou
Copy link
Member

kou commented Sep 6, 2023

Thanks!

@kou
Copy link
Member

kou commented Sep 6, 2023

If you have some issues that you want to work on, please add take comment on them and assign them to you.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 7dd8624.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.

kou pushed a commit that referenced this pull request Sep 11, 2023
### Rationale for this change

This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided

### What changes are included in this PR?

- Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block.
- Added related tests
- Resolved warnings related to duplicate test classes which were caused by #37137

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No

* Closes: #37562

Authored-by: otegami <a.s.takuya1026@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…#37137)

### Rationale for this change

This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format.

### What changes are included in this PR?

- Add the following methods to make the raw_records method iterable.
  - Arrow::RecordBatch#each_raw_record
  - Arrow::Table#each_raw_record
- Add related test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

This PR is related to  apache#33749
* Closes: apache#33749

Lead-authored-by: otegami <a.s.takuya1026@gmail.com>
Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ache#37600)

### Rationale for this change

This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided

### What changes are included in this PR?

- Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block.
- Added related tests
- Resolved warnings related to duplicate test classes which were caused by apache#37137

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No

* Closes: apache#37562

Authored-by: otegami <a.s.takuya1026@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…#37137)

### Rationale for this change

This change allows for efficient iteration over large datasets, particularly those utilizing the Apache Parquet format.

### What changes are included in this PR?

- Add the following methods to make the raw_records method iterable.
  - Arrow::RecordBatch#each_raw_record
  - Arrow::Table#each_raw_record
- Add related test

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.

This PR is related to  apache#33749
* Closes: apache#33749

Lead-authored-by: otegami <a.s.takuya1026@gmail.com>
Co-authored-by: takuya kodama <a.s.takuya1026@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ache#37600)

### Rationale for this change

This change aligns the behavior of `each_raw_record` with standard Ruby practices by returning an enumerator when no block is provided

### What changes are included in this PR?

- Made `Arrow::Table#each_raw_record` and `Arrow::RecordBatch#each_raw_record` return Enumerator when it was called without block.
- Added related tests
- Resolved warnings related to duplicate test classes which were caused by apache#37137

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No

* Closes: apache#37562

Authored-by: otegami <a.s.takuya1026@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ruby] Add Arrow::RecordBatch#each_raw_record
2 participants