From 79e8b26b57252d2787374403dfd34deb70696482 Mon Sep 17 00:00:00 2001 From: Theo Hultberg Tolv Date: Wed, 13 May 2020 16:25:16 +0200 Subject: [PATCH] Update json.md The current documentation for the case.insensitive property is misleading or wrong. Athena does not require the data to have lower case keys, as is implied. It also left out a very important part of how to use the property, that without explicit mappings the properties will not be found. What happens is that Athena will lower case keys. Column names will always be lower cased when you create the table through Athena (not sure if it's the same if you do it through Glue). This means that when `case.insensitive` is false Athena will look for a lower case key, but the serde will have preserved the casing of the keys, and you end up with NULL for all columns where the underlying key has upper case characters. With the behaviour just described, the only reason to set this property is to get around duplicate key errors, and I've provided guidance for that in the documentation. By default, if you have the properties "URL" and "Url" you will get a duplicate key error (`HIVE_CURSOR_ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"`) because they will both be lower cased to the same string. By setting the property to false and providing mappings you can get around that problem. --- doc_source/json.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc_source/json.md b/doc_source/json.md index 593596b..c256475 100644 --- a/doc_source/json.md +++ b/doc_source/json.md @@ -72,7 +72,7 @@ Optional\. When set to `TRUE`, lets you skip malformed JSON syntax\. The default Optional\. The default is `FALSE`\. When set to `TRUE`, allows the SerDe to replace the dots in key names with underscores\. For example, if the JSON dataset contains a key with the name `"a.b"`, you can use this property to define the column name to be `"a_b"` in Athena\. By default \(without this SerDe\), Athena does not allow dots in column names\. **case\.insensitive** -Optional\. By default, Athena requires that all keys in your JSON dataset use lowercase\. The default is `TRUE`\. When set to `TRUE`, the SerDe converts all uppercase columns to lowercase\. Using `WITH SERDEPROPERTIES ("case.insensitive"= FALSE;)` allows you to use case\-sensitive key names in your data\. +Optional\. The default is `TRUE`\. When set to `TRUE`, the SerDe converts all uppercase keys to lowercase\. Using `WITH SERDEPROPERTIES ("case.insensitive" = "FALSE")` allows you to use case\-sensitive key names in your data\. For every key that is not already all\-lowercase, you must also provide a mapping from the column name to the property name, e\.\g. `WITH SERDEPROPERTIES ("case.insensitive" = "FALSE", "mapping.userid" = "userId")`\. If you have two keys that are the same when lower cased, you can use this property to map them to different names, e\.g\. `WITH SERDEPROPERTIES ("case.insensitive" = "FALSE", "mapping.url1" = "URL", "mapping.url2" = "Url")`\. **ColumnToJsonKeyMappings** Optional\. Maps column names to JSON keys that aren't identical to the column names\. This is useful when the JSON data contains keys that are [keywords](reserved-words.md)\. For example, if you have a JSON key named `timestamp`, set this parameter to `{"ts": "timestamp"}` to map this key to a column named `ts`\. This parameter takes values of type string\. It uses the following key pattern: `^\S+$` and the following value pattern: `^(?!\s*$).+` @@ -170,4 +170,4 @@ CREATE external TABLE complex_json ( ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' LOCATION 's3://mybucket/myjsondata/'; -``` \ No newline at end of file +```