Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][JSON] kMaxParserNumRows Value Increase/Removal #28994

Closed
asfimport opened this issue Jul 12, 2021 · 10 comments · Fixed by #38582
Closed

[C++][JSON] kMaxParserNumRows Value Increase/Removal #28994

asfimport opened this issue Jul 12, 2021 · 10 comments · Fixed by #38582

Comments

@asfimport
Copy link
Collaborator

I'm a new pyArrow user and have been investigating occasional errors related to the Python exception: "ArrowInvalid: Exceeded maximum rows" when parsing JSON line files using pyarrow.json.read_json(). In digging in, it looks like the original source of this exception is in cpp/src/arrow/json/parser.cc on line 703, which appears to throw the error when the number of lines processed exceeds kMaxParserNumRows.

for (; num_rows_ < kMaxParserNumRows; ++num_rows_) {
      auto ok = reader.Parse<parse_flags>(json, handler);
      switch (ok.Code()) {
        case rj::kParseErrorNone:
          // parse the next object
          continue;
        case rj::kParseErrorDocumentEmpty:
          // parsed all objects, finish
          return Status::OK();
        case rj::kParseErrorTermination:
          // handler emitted an error
          return handler.Error();
        default:
          // rj emitted an error
          return ParseError(rj::GetParseError_En(ok.Code()), " in row ", num_rows_);
      }
    }
    return Status::Invalid("Exceeded maximum rows");
  }

 

This constant appears to be set in arrow/json/parser.h on line 53, and has been set this way since that file's initial commit.

constexpr int32_t kMaxParserNumRows = 100000;

 

There does not appear to be a comment in the code or in the commit or PR explaining this maximum number of lines.

 

I'm wondering what the reason for this maximum might be, and if it might be removed, increased, or made overridable in the C++ and the upstream Python. It is common to need to process JSON files of arbitrary length (logs from applications, third-party vendors, etc) where the user of the data does not have control over the size of the file.

Reporter: Ryan Stalets

Note: This issue was originally created as ARROW-13318. Please see the migration documentation for further details.

@Ox0400
Copy link
Contributor

Ox0400 commented Sep 9, 2023

Hello, have any new update for this issue ? I have the same error.

@kou
Copy link
Member

kou commented Sep 10, 2023

@bkietz Do you remember why we have kMaxParserNumRows in our JSON parser?

@bkietz
Copy link
Member

bkietz commented Sep 11, 2023

This was copied from the CSV parser, which was my model when writing the JSON parser. <csv comments> I believe originally the intent there was to ensure that a very large chunk of csv text is parsed in batches. I'm not sure the limit has ever been hit in CSV; that would be a very large chunk and we don't seem to have a test for it which goes through the reader. These days the limit seems to be used exclusively for limiting parsing to column headers when we don't need any actual data. @pitrou </csv comments>

The JSON parser doesn't have a use for this limit, I think.

@kou kou changed the title kMaxParserNumRows Value Increase/Removal [C++][JSON] kMaxParserNumRows Value Increase/Removal Sep 12, 2023
@kou
Copy link
Member

kou commented Sep 12, 2023

OK. Does someone want to work on this?

@Ox0400
Copy link
Contributor

Ox0400 commented Sep 12, 2023

Maybe can keep the kMaxParserNumRows, create an env like MAX_PARSER_NUM_ROWS for support rewrite the kMaxParserNumRows. Support unlimit for read json or csv when the value is-1 or 0.

@pitrou
Copy link
Member

pitrou commented Sep 12, 2023

kMaxParserNumRows in the CSV parser was probably misnamed. It's simply the default value for the max_num_rows parameter, but it can be increased. Perhaps we should rename it kDefaultNumRows?

The JSON parser should make this configurable as well.

@Ox0400
Copy link
Contributor

Ox0400 commented Sep 27, 2023

Hi guys, has some new information for the issue ?

@Ox0400
Copy link
Contributor

Ox0400 commented Oct 20, 2023

Any roadmap?

@bkietz
Copy link
Member

bkietz commented Oct 20, 2023

I'm not aware of anyone getting ready to work on this issue. If you wanted to open a PR, I could give you a review

@Ox0400
Copy link
Contributor

Ox0400 commented Nov 4, 2023

I'm not aware of anyone getting ready to work on this issue. If you wanted to open a PR, I could give you a review

Hi @bkietz, I am opened an PR:#38582 waiting you review.

Ox0400 added a commit to Ox0400/arrow that referenced this issue Nov 10, 2023
bkietz pushed a commit that referenced this issue Nov 27, 2023
### Rationale for this change
Unlimited parse max rows for parse json block. Raise `Row count overflowed int32_t` error when the loop times out of `init_32::max()`.

See issue: #28994

### What changes are included in this PR?
Delete const `kMaxParserNumRows`,  Minor code( C++ ) modifications and test code(python)

### Are these changes tested? Yes
New a test code for parse large (100100) rows json .

### Are there any user-facing changes?  No

* Closes: #28994

Lead-authored-by: zhipeng <zhipeng.py@gmail.com>
Co-authored-by: zhipeng <5310853+Ox0400@users.noreply.github.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
@bkietz bkietz added this to the 15.0.0 milestone Nov 27, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…pache#38582)

### Rationale for this change
Unlimited parse max rows for parse json block. Raise `Row count overflowed int32_t` error when the loop times out of `init_32::max()`.

See issue: apache#28994

### What changes are included in this PR?
Delete const `kMaxParserNumRows`,  Minor code( C++ ) modifications and test code(python)

### Are these changes tested? Yes
New a test code for parse large (100100) rows json .

### Are there any user-facing changes?  No

* Closes: apache#28994

Lead-authored-by: zhipeng <zhipeng.py@gmail.com>
Co-authored-by: zhipeng <5310853+Ox0400@users.noreply.github.com>
Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants