Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ruby] Loading Large JSON Files with red-arrow Gem #39433

Closed
ashishbista opened this issue Jan 2, 2024 · 0 comments · Fixed by #39464
Closed

[Ruby] Loading Large JSON Files with red-arrow Gem #39433

ashishbista opened this issue Jan 2, 2024 · 0 comments · Fixed by #39464

Comments

@ashishbista
Copy link

ashishbista commented Jan 2, 2024

I am currently using the red-arrow gem to process JSON files within my Ruby application. While the gem works seamlessly with smaller JSON files, I have encountered an issue when attempting to process a larger JSON file with a size of 2.3MB.

Versions:

red-arrow: 14.0.2
OS:  macOS

The code snippet below illustrates how I am attempting to load the JSON file:

table = Arrow::Table.load(json_file, format: :json)

Unfortunately, executing this code results in the following error:

~/gems/ruby-3.3.0/gems/gobject-introspection-4.2.0/lib/gobject-introspection/loader.rb:705:in `invoke': [json-reader][read]: Invalid: straddling object straddles two block boundaries (try to increase block size?) (Arrow::Error::Invalid)
	from ~/gems/ruby-3.3.0/gems/gobject-introspection-4.2.0/lib/gobject-introspection/loader.rb:705:in `invoke'
	from ~/gems/ruby-3.3.0/gems/gobject-introspection-4.2.0/lib/gobject-introspection/loader.rb:573:in `read'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:258:in `block in load_as_json'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:155:in `open_input_stream'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:256:in `load_as_json'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:98:in `load_by_reader'
	from ~/.rvm/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:76:in `load_from_file'
	from ~/.rvm/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:51:in `block in load'
	from ~/.rvm/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:49:in `each'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:49:in `load'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table-loader.rb:26:in `load'
	from ~/gems/ruby-3.3.0/gems/red-arrow-14.0.2/lib/arrow/table.rb:30:in `load'
	....

After researching the issue, I attempted to address it by adjusting the block size using the block_size parameter in the load method:

block_size = 1024 * 1024 * 1024 # Tried different values
table = Arrow::Table.load(json_file, format: :json, block_size: block_size)

However, it seems that the block_size parameter might not be honored by the load method, as suggested by my analysis of the source code.

Therefore, I would like to seek your guidance on whether there is an existing method to adjust the block size when reading files with the red-arrow gem. If this feature is not currently supported, I kindly request consideration for the addition of a feature that either dynamically handles the block size internally or exposes an API to adjust the block size.

Let me know if you need anything else from my side.

Component(s)

Ruby

@kou kou changed the title Loading Large JSON Files with red-arrow Gem [Ruby] Loading Large JSON Files with red-arrow Gem Jan 4, 2024
@kou kou closed this as completed in #39464 Jan 5, 2024
kou added a commit that referenced this issue Jan 5, 2024
…9464)

### Rationale for this change

Other `format:` such as `format: :csv` accepts custom options. `format: :json` should also accept them.

### What changes are included in this PR?

Use `Arrow::JSONReadOptions` for `Table::Load(format: :json)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: #39433

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
@kou kou added this to the 15.0.0 milestone Jan 5, 2024
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ns (apache#39464)

### Rationale for this change

Other `format:` such as `format: :csv` accepts custom options. `format: :json` should also accept them.

### What changes are included in this PR?

Use `Arrow::JSONReadOptions` for `Table::Load(format: :json)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#39433

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ns (apache#39464)

### Rationale for this change

Other `format:` such as `format: :csv` accepts custom options. `format: :json` should also accept them.

### What changes are included in this PR?

Use `Arrow::JSONReadOptions` for `Table::Load(format: :json)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#39433

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ns (apache#39464)

### Rationale for this change

Other `format:` such as `format: :csv` accepts custom options. `format: :json` should also accept them.

### What changes are included in this PR?

Use `Arrow::JSONReadOptions` for `Table::Load(format: :json)`.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.
* Closes: apache#39433

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants