avro2athena

Generates a create table statement for AWS Athena from an avro schema.
The statement can be used to create an Athena table from (partitioned) avro files.

Usage

See the example in run_example.py. It works in python3.7 with avro-python3==1.9.1.
The partition statement is optional. It should be of the form PARTITIONED BY (year string, month string, day string). If you don't have partitions, set partition_statement = ''.
It is assumed that the most outer schema type is Record.
Aliases in the avro schema are allowed.
The schema tree is analyzed recursively, so trees of arbitrary depth are allowed.

Example

Taking the following standard example schema as input:

{
  "namespace": "com.linkedin.haivvreo",
  "name": "test_serializer",
  "type": "record",
  "fields": [
    { "name":"string1", "type":"string" },
    { "name":"int1", "type":"int" },
    { "name":"tinyint1", "type":"int" },
    { "name":"smallint1", "type":"int" },
    { "name":"bigint1", "type":"long" },
    { "name":"boolean1", "type":"boolean" },
    { "name":"float1", "type":"float" },
    { "name":"double1", "type":"double" },
    { "name":"list1", "type":{"type":"array", "items":"string"} },
    { "name":"map1", "type":{"type":"map", "values":"int"} },
    { "name":"struct1", "type":{"type":"record", "name":"struct1_name", "fields": [
          { "name":"sInt", "type":"int" }, { "name":"sBoolean", "type":"boolean" }, { "name":"sString", "type":"string" } ] } },
    { "name":"union1", "type":["float", "boolean", "string"] },
    { "name":"enum1", "type":{"type":"enum", "name":"enum1_values", "symbols":["BLUE","RED", "GREEN"]} },
    { "name":"nullableint", "type":["int", "null"] },
    { "name":"bytes1", "type":"bytes" },
    { "name":"fixed1", "type":{"type":"fixed", "name":"threebytes", "size":3} }
  ] }

we obtain the following Athena create table output:

    CREATE EXTERNAL TABLE IF NOT EXISTS 
    `my_database`.`my_table`
    (`string1` string, `int1` int, `tinyint1` int, `smallint1` int, `bigint1` bigint, `boolean1` boolean, `float1` float, `double1` double, `list1` array<string>, `map1` map<string,int>, `struct1` struct<sint:int,sboolean:boolean,sstring:string>, `union1` float, `enum1` string, `nullableint` int, `bytes1` bytes, `fixed1` string) 
    PARTITIONED BY (year string, month string, day string)
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    WITH SERDEPROPERTIES ('avro.schema.literal'='{"namespace": "com.linkedin.haivvreo", "name": "test_serializer", "type": "record", "fields": [{"name": "string1", "type": "string"}, {"name": "int1", "type": "int"}, {"name": "tinyint1", "type": "int"}, {"name": "smallint1", "type": "int"}, {"name": "bigint1", "type": "long"}, {"name": "boolean1", "type": "boolean"}, {"name": "float1", "type": "float"}, {"name": "double1", "type": "double"}, {"name": "list1", "type": {"type": "array", "items": "string"}}, {"name": "map1", "type": {"type": "map", "values": "int"}}, {"name": "struct1", "type": {"type": "record", "name": "struct1_name", "fields": [{"name": "sInt", "type": "int"}, {"name": "sBoolean", "type": "boolean"}, {"name": "sString", "type": "string"}]}}, {"name": "union1", "type": ["float", "boolean", "string"]}, {"name": "enum1", "type": {"type": "enum", "name": "enum1_values", "symbols": ["BLUE", "RED", "GREEN"]}}, {"name": "nullableint", "type": ["int", "null"]}, {"name": "bytes1", "type": "bytes"}, {"name": "fixed1", "type": {"type": "fixed", "name": "threebytes", "size": 3}}]}') 
    STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
    LOCATION 's3://my_bucket/my_folder/'

Translations of data types

For a Primitive, long is translated to bigint.
Array becomes array<>.
Map becomes map<>.
Record becomes struct<>.
For a Union, the first non null type is chosen.
Enum and Fixed become string, respectively.

Some hints for importing avro files as AWS Athena table

Unions with mutliple primitive types: an example is ["float", "boolean", "string"]. In general, these kind of data types might cause trouble in Athena. Our schema creator will pick the first element float form the union. Consider changing such multiple types to string.
Infer the schema directly from .avro data files: see util/avro_file_schema_parser.py for how to do that from a single file.
- Useful if, for whatever reason, the schema of the files is not available.
- Useful if Athena does not like the schema for some reason - even though it works with other systems. We experienced examples where inferring the schema from a data file solves the problem. In this case in create_athena_table_statement_from_avsc replace parse_literal_schema_from_file by infer_schema_from_avro_file.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
create_statement_creator		create_statement_creator
example_schemas		example_schemas
util		util
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
requirements.txt		requirements.txt
run_example.py		run_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_statement_creator

create_statement_creator

example_schemas

example_schemas

util

util

.gitignore

.gitignore

LICENSE

LICENSE

Readme.md

Readme.md

requirements.txt

requirements.txt

run_example.py

run_example.py

Repository files navigation

avro2athena

Usage

Example

Translations of data types

Some hints for importing avro files as AWS Athena table

About

Releases

Packages

Contributors 2

Languages

License

dataqube-GmbH/avro2athena

Folders and files

Latest commit

History

Repository files navigation

avro2athena

Usage

Example

Translations of data types

Some hints for importing avro files as AWS Athena table

About

Resources

License

Stars

Watchers

Forks

Languages