Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Athena MAP support needs MAP_KEY_VALUE type for inner group #50

Open
SeanLMcCullough opened this issue Sep 14, 2020 · 0 comments
Open

Comments

@SeanLMcCullough
Copy link

After experimenting around with the MAP type for Athena, it appears that the structure is not quite right.

Here is the schema output from parquet-tools for the MAP data generated by Kinesis Firehose:

  optional group my_data (MAP) {
    repeated group map (MAP_KEY_VALUE) {
      required binary key (STRING);
      optional binary value (STRING);
    }
  }

Noting the MAP_KEY_VALUE for repeated group map.

However, when generating the map data-type with this schema:

...
"my_data": {
  "type": "MAP",
  "fields": {
    "map": {
      "repeated": true,
      "fields": {
        "key": {
          "type": keyType,
          "optional": true
        },
        "value": {
          "type": valueType,
          "optional": true
        }
      }
    }
  }
}
...

The output of the library produces a schema observed by parquet-tools as such:

  optional group my_data (MAP) {
    repeated group map {
      optional binary key (STRING);
      optional binary value (STRING);
    }
  }

Note that repeated group map omits the MAP_KEY_VALUE in the schema.

This results in the AWS glue crawler seeing the two schemas differently.
For the Kinesis Firehose generated data, the parsed schema by glue appears as the following:
Screen Shot 2020-09-14 at 3 01 10 pm

However, the schema parsed by glue generated by this library presents the following:
Screen Shot 2020-09-14 at 3 01 01 pm

I am unsure if I am using the MAP part of this library incorrectly however, as it is an undocumented feature. The structure of this schema is based off parquet files generated by a Kinesis Firehose pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant