Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10514: [C++][Parquet] Make the column name the same for both output formats of parquet reader #9649

Closed
wants to merge 3 commits into from

Conversation

FawnD2
Copy link
Contributor

@FawnD2 FawnD2 commented Mar 6, 2021

In parquet-reader there are two ways to output the schema for a Parquet file: DebugPrint and JSONPrint. When output in JSON format, the Column name is short name instead of full-qualified name. For example, for schema (1), there will be 2 Columns with "Name": "key". That's very confusing.

In this PR we start using full-qualified name for Column in JSONPrint instead of short name, similar to DebugPrint.

(1):

required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
}

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

@pitrou
Copy link
Member

pitrou commented Mar 9, 2021

This looks reasonable to me. @emkornfield what do you think?

@maysupan
Copy link

maysupan commented Mar 9, 2021

Yes

@emkornfield
Copy link
Contributor

LGTM to me as well.

@pitrou
Copy link
Member

pitrou commented Mar 10, 2021

Ok, I'll merge this PR then. Thank you @FawnD2 !

@pitrou pitrou closed this in ec7bf98 Mar 10, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…tput formats of parquet reader

In parquet-reader there are two ways to output the schema for a Parquet file: DebugPrint and JSONPrint. When output in JSON format, the Column name is short name instead of full-qualified name. For example, for schema (1), there will be 2 Columns with `"Name": "key"`. That's very confusing.

In this PR we start using full-qualified name for Column in JSONPrint instead of short name, similar to DebugPrint.

(1):
```
required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
}
```

Closes apache#9649 from FawnD2/patch-1

Authored-by: FawnD2 <zzosimova@ya.ru>
Signed-off-by: Antoine Pitrou <antoine@python.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…tput formats of parquet reader

In parquet-reader there are two ways to output the schema for a Parquet file: DebugPrint and JSONPrint. When output in JSON format, the Column name is short name instead of full-qualified name. For example, for schema (1), there will be 2 Columns with `"Name": "key"`. That's very confusing.

In this PR we start using full-qualified name for Column in JSONPrint instead of short name, similar to DebugPrint.

(1):
```
required group field_id=0 spark_schema {
  optional group field_id=1 a (Map) {
    repeated group field_id=2 key_value {
      required binary field_id=3 key (String);
      optional group field_id=4 value (Map) {
        repeated group field_id=5 key_value {
          required int32 field_id=6 key;
          required boolean field_id=7 value;
        }
      }
    }
  }
}
```

Closes apache#9649 from FawnD2/patch-1

Authored-by: FawnD2 <zzosimova@ya.ru>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants