ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

FawnD2 · 2021-03-06T22:03:34Z

In parquet-reader there are two ways to output the schema for a Parquet file: DebugPrint and JSONPrint. When output in JSON format, the Column name is short name instead of full-qualified name. For example, for schema (1), there will be 2 Columns with "Name": "key". That's very confusing.

In this PR we start using full-qualified name for Column in JSONPrint instead of short name, similar to DebugPrint.

(1):

required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
}

github-actions · 2021-03-06T22:04:55Z

https://issues.apache.org/jira/browse/ARROW-10514

pitrou · 2021-03-09T09:07:35Z

This looks reasonable to me. @emkornfield what do you think?

maysupan · 2021-03-09T12:42:58Z

Yes

emkornfield · 2021-03-09T23:46:29Z

LGTM to me as well.

pitrou · 2021-03-10T13:14:37Z

Ok, I'll merge this PR then. Thank you @FawnD2 !

…tput formats of parquet reader In parquet-reader there are two ways to output the schema for a Parquet file: DebugPrint and JSONPrint. When output in JSON format, the Column name is short name instead of full-qualified name. For example, for schema (1), there will be 2 Columns with `"Name": "key"`. That's very confusing. In this PR we start using full-qualified name for Column in JSONPrint instead of short name, similar to DebugPrint. (1): ``` required group field_id=0 spark_schema { optional group field_id=1 a (Map) { repeated group field_id=2 key_value { required binary field_id=3 key (String); optional group field_id=4 value (Map) { repeated group field_id=5 key_value { required int32 field_id=6 key; required boolean field_id=7 value; } } } } } ``` Closes apache#9649 from FawnD2/patch-1 Authored-by: FawnD2 <zzosimova@ya.ru> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added Component: C++ Component: Parquet labels Mar 6, 2021

FawnD2 added 2 commits March 7, 2021 01:17

Make the column name the same for both output formats of parquet reader

e407d98

Make linter happy

f54e8ea

FawnD2 force-pushed the patch-1 branch from 7da27a7 to f54e8ea Compare March 6, 2021 22:18

Replace tab with spaces

a72a248

pitrou closed this in ec7bf98 Mar 10, 2021

asfimport mentioned this pull request Mar 10, 2021

[C++][Parquet] Data inconsistency in parquet-reader output modes #26483

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

FawnD2 commented Mar 6, 2021

github-actions bot commented Mar 6, 2021

pitrou commented Mar 9, 2021

maysupan commented Mar 9, 2021

emkornfield commented Mar 9, 2021

pitrou commented Mar 10, 2021

ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

Conversation

FawnD2 commented Mar 6, 2021

github-actions bot commented Mar 6, 2021

pitrou commented Mar 9, 2021

maysupan commented Mar 9, 2021

emkornfield commented Mar 9, 2021

pitrou commented Mar 10, 2021