Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Skip method skips levels and not rows for repeated fields #42995

Open
asfimport opened this issue Aug 23, 2022 · 1 comment
Open

Comments

@asfimport
Copy link
Collaborator

The implementation of TypedColumnReader::Skip method with signature:

virtual int64_t Skip(int64_t num_levels_to_skip) = 0;

will skip levels for both repeated fields and non-repeated fields. We want to be able to skip rows for repeated fields, and skipping levels is not that useful.

For example, for the following rows:

message M { repeated int32 b = 1 }

rows: {}, {[10,10]}, {[20, 20, 20]}

values = {10, 10, 20, 20, 20};
def_levels = {0, 1, 1, 1, 1, 1};
rep_levels = {0, 0, 1, 0, 1, 1};

We want skip(2) to skip the first two rows, so that the next value that we read is 20. However, it will skip the first two levels, and the next value that we read is 10.

Reporter: fatemah / @fatemehp

Note: This issue was originally created as PARQUET-2175. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Micah Kornfield / @emkornfield:
I think the current signature is Skip(num_rows_to_skip) which is why this is confusing. The docs seem accurate. Given the accurate documents (although they can probably be clarified), I think a new SkipRows method makes sense and we should rename the variable as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant