Merge pull request #17 from bxparks/develop

0.3: support quoted values of BOOLEAN, INTEGER and FLOAT types
bxparks · Dec 17, 2018 · fbd71cb · fbd71cb
2 parents a3dd1c1 + 1678335
commit fbd71cb
Show file tree

Hide file tree

Showing 7 changed files with 396 additions and 60 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,29 +1,37 @@
 # Changelog
 
+* 0.3 (2018-12-17)
+    * Tighten TIMESTAMP and DATE validation (thanks jtschichold@).
+    * Inspect the internals of STRING values to infer BOOLEAN, INTEGER or FLOAT
+      types (thanks jtschichold@).
+    * Handle conversion of these string types when mixed with their non-quoted
+      equivalents, matching the conversion logic followed by 'bq load'.
 * 0.2.1 (2018-07-18)
-  * Add `anonymizer.py` script to create anonymized data files for benchmarking.
-  * Add benchmark numbers to README.md.
-  * Add `DEVELOPER.md` file to record how to upload to PyPI.
-  * Fix some minor warnings from pylint3.
+    * Add `anonymizer.py` script to create anonymized data files for
+      benchmarking.
+    * Add benchmark numbers to README.md.
+    * Add `DEVELOPER.md` file to record how to upload to PyPI.
+    * Fix some minor warnings from pylint3.
 * 0.2.0 (2018-02-10)
-  * Add support for `DATE` and `TIME` types.
-  * Update type conversion rules to be more compatible with **bq load**.
-    * Allow `DATE`, `TIME` and `TIMESTAMP` to gracefully degrade to `STRING`.
-    * Allow type conversions of elements within arrays
-      (e.g. array of `INTEGER` and `FLOAT`, or array of mixed `DATE`, `TIME`, or
-      `TIMESTAMP` elements).
-    * Better detection of invalid values (e.g. arrays of arrays).
+    * Add support for `DATE` and `TIME` types.
+    * Update type conversion rules to be more compatible with **bq load**.
+        * Allow `DATE`, `TIME` and `TIMESTAMP` to gracefully degrade to
+          `STRING`.
+        * Allow type conversions of elements within arrays
+          (e.g. array of `INTEGER` and `FLOAT`, or array of mixed `DATE`,
+          `TIME`, or `TIMESTAMP` elements).
+        * Better detection of invalid values (e.g. arrays of arrays).
 * 0.1.6 (2018-01-26)
-  * Pass along command line arguments to `generate-schema`.
+    * Pass along command line arguments to `generate-schema`.
 * 0.1.5 (2018-01-25)
-  * Updated installation instructions for MacOS.
+    * Updated installation instructions for MacOS.
 * 0.1.4 (2018-01-23)
-  * Attempt #3 to fix exception during pip3 install.
+    * Attempt #3 to fix exception during pip3 install.
 * 0.1.3 (2018-01-23)
-  * Attempt #2 to fix exception during pip3 install.
+    * Attempt #2 to fix exception during pip3 install.
 * 0.1.2 (2018-01-23)
-  * Attemp to fix exception during pip3 install. Didn't work. Pulled.
+    * Attemp to fix exception during pip3 install. Didn't work. Pulled.
 * 0.1.1 (2018-01-03)
-  * Install `generate-schema` script in `/usr/local/bin`
+    * Install `generate-schema` script in `/usr/local/bin`
 * 0.1 (2018-01-02)
-  * Iniitial release to PyPI.
+    * Iniitial release to PyPI.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ Usage:
 $ generate-schema < file.data.json > file.schema.json
 ```
 
-Version: 0.2.1 (2018-07-18)
+Version: 0.3 (2018-12-17)
 
 ## Background
 
@@ -109,9 +109,9 @@ This is essentially what the `generate-schema` command does.
 
 **3) Python script**
 
-If you retrieved this code from its [GitHub
-repository](https://github.com/bxparks/bigquery-schema-generator), then you can invoke
-the Python script directly:
+If you retrieved this code from its
+[GitHub repository](https://github.com/bxparks/bigquery-schema-generator),
+then you can invoke the Python script directly:
 ```
 $ ./generate_schema.py < file.data.json > file.schema.json
 ```
@@ -121,21 +121,33 @@ $ ./generate_schema.py < file.data.json > file.schema.json
 The resulting schema file can be given to the **bq load** command using the
 `--schema` flag:
 ```
+
 $ bq load --source_format NEWLINE_DELIMITED_JSON \
         --ignore_unknown_values \
         --schema file.schema.json \
         mydataset.mytable \
         file.data.json
 ```
-
 where `mydataset.mytable` is the target table in BigQuery.
 
-A useful flag for **bq load** is `--ignore_unknown_values`, which causes **bq load**
-to ignore fields in the input data which are not defined in the schema. When
-`generate_schema.py` detects an inconsistency in the definition of a particular
-field in the input data, it removes the field from the schema definition.
-Without the `--ignore_unknown_values`, the **bq load** fails when the
-inconsistent data record is read.
+For debugging purposes, here is the equivalent `bq load` command using schema
+autodetection:
+
+```
+$ bq load --source_format NEWLINE_DELIMITED_JSON \
+    --ignore_unknown_values \
+    --autodetect
+    mydataset.mytable \
+    file.data.json
+```
+
+A useful flag for `bq load` is `--ignore_unknown_values`, which causes `bq
+load` to ignore fields in the input data which are not defined in the schema.
+When `generate_schema.py` detects an inconsistency in the definition of a
+particular field in the input data, it removes the field from the schema
+definition. Without the `--ignore_unknown_values`, the `bq load` fails when
+the inconsistent data record is read. Another useful flag during development and
+debugging is `--replace` which replaces any existing BigQuery table.
 
 After the BigQuery table is loaded, the schema can be retrieved using:
 
@@ -238,7 +250,7 @@ $ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
 
 Instead of printing out the BigQuery schema, the `--debugging_map` prints out
 the bookkeeping metadata map which is used internally to keep track of the
-various fields and theirs types that were inferred using the data file. This
+various fields and their types that were inferred using the data file. This
 flag is intended to be used for debugging.
 
 ```
@@ -282,7 +294,7 @@ compatibility rules implemented by **bq load**:
       upgraded to a `FLOAT`
     * the reverse does not happen, once a field is a `FLOAT`, it will remain a
       `FLOAT`
-* conflicting `TIME`, `DATE`, `TIMESTAMP` types downgrades to `STRING`
+* conflicting `TIME`, `DATE`, `TIMESTAMP` types upgrades to `STRING`
     * if a field is determined to have one type of "time" in one record, then
       subsequently a different "time" type, then the field will be assigned a
       `STRING` type
@@ -299,6 +311,12 @@ compatibility rules implemented by **bq load**:
     * we follow the same logic as **bq load** and always infer these as
       `TIMESTAMP`
 
+The BigQuery loader looks inside string values to determine if they are actually
+BOOLEAN, INTEGER or FLOAT types instead. In other words, `"True"` is considered
+a BOOLEAN type, `"1"` is considered an INTEGER type, and `"2.1"` is consiered a
+FLOAT type. Luigi Mori (jtschichold@) added additional logic to replicate the
+type conversion logic used by `bq load` for these strings.
+
 ## Examples
 
 Here is an example of a single JSON data record on the STDIN (the `^D` below
@@ -387,14 +405,16 @@ took 77s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
 This project was initially developed on Ubuntu 17.04 using Python 3.5.3. I have
 tested it on:
 
+* Ubuntu 18.04, Python 3.6.7
 * Ubuntu 17.10, Python 3.6.3
 * Ubuntu 17.04, Python 3.5.3
 * Ubuntu 16.04, Python 3.5.2
 * MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
 
-## Author
+## Authors
 
-Created by Brian T. Park (brian@xparks.net).
+* Created by Brian T. Park (brian@xparks.net).
+* Additional type inferrence logic by Luigi Mori (jtschichold@).
 
 ## License
 

diff --git a/bigquery_schema_generator/generate_schema.py b/bigquery_schema_generator/generate_schema.py
@@ -50,11 +50,15 @@ class SchemaGenerator:
         r'(([+-]\d{1,2}(:\d{1,2})?)|Z)?$')
 
     # Detect a DATE field of the form YYYY-[M]M-[D]D.
-    DATE_MATCHER = re.compile(r'^\d{4}-\d{1,2}-\d{1,2}$')
+    DATE_MATCHER = re.compile(
+        r'^\d{4}-(?:[1-9]|0[1-9]|1[012])-(?:[1-9]|0[1-9]|[12][0-9]|3[01])$')
 
     # Detect a TIME field of the form [H]H:[M]M:[S]S[.DDDDDD]
     TIME_MATCHER = re.compile(r'^\d{1,2}:\d{1,2}:\d{1,2}(\.\d{1,6})?$')
 
+    INTEGER_MATCHER = re.compile(r'^[-]?\d+$')
+    FLOAT_MATCHER = re.compile(r'^[-]?\d+\.\d+$')
+
     def __init__(self,
                  keep_nulls=False,
                  debugging_interval=1000,
@@ -136,8 +140,8 @@ def deduce_schema_for_line(self, json_object, schema_map):
             schema_entry = schema_map.get(key)
             try:
                 new_schema_entry = self.get_schema_entry(key, value)
-                merged_schema_entry = self.merge_schema_entry(schema_entry,
-                                                              new_schema_entry)
+                merged_schema_entry = self.merge_schema_entry(
+                    schema_entry, new_schema_entry)
             except Exception as e:
                 self.log_error(str(e))
                 continue
@@ -200,8 +204,8 @@ def merge_schema_entry(self, old_schema_entry, new_schema_entry):
             elif old_mode == 'REPEATED' and new_mode == 'NULLABLE':
                 # TODO: Maybe remove this warning output. It was helpful during
                 # development, but maybe it's just natural.
-                self.log_error('Leaving schema for "%s" as REPEATED RECORD' %
-                               old_name)
+                self.log_error(
+                    'Leaving schema for "%s" as REPEATED RECORD' % old_name)
 
             # RECORD type needs a recursive merging of sub-fields. We merge into
             # the 'old_schema_entry' which assumes that the 'old_schema_entry'
@@ -240,6 +244,8 @@ def get_schema_entry(self, key, value):
         object, instead of a primitive.
         """
         value_mode, value_type = self.infer_bigquery_type(value)
+
+        # yapf: disable
         if value_type == 'RECORD':
             # recursively figure out the RECORD
             fields = OrderedDict()
@@ -284,6 +290,7 @@ def get_schema_entry(self, key, value):
                                             ('name', key),
                                             ('type', value_type),
                                         ]))])
+        # yapf: enable
         return schema_entry
 
     def infer_bigquery_type(self, node_value):
@@ -300,8 +307,8 @@ def infer_bigquery_type(self, node_value):
         array_type = self.infer_array_type(node_value)
         if not array_type:
             raise Exception(
-                "All array elements must be the same compatible type: %s"
-                % node_value)
+                "All array elements must be the same compatible type: %s" %
+                node_value)
 
         # Disallow array of special types (with '__' not supported).
         # EXCEPTION: allow (REPEATED __empty_record) ([{}]) because it is
@@ -326,6 +333,12 @@ def infer_value_type(self, value):
                 return 'DATE'
             elif self.TIME_MATCHER.match(value):
                 return 'TIME'
+            elif self.INTEGER_MATCHER.match(value):
+                return 'QINTEGER'  # quoted integer
+            elif self.FLOAT_MATCHER.match(value):
+                return 'QFLOAT'  # quoted float
+            elif value.lower() in ['true', 'false']:
+                return 'QBOOLEAN'  # quoted boolean
             else:
                 return 'STRING'
         # Python 'bool' is a subclass of 'int' so we must check it first
@@ -403,27 +416,81 @@ def run(self):
 
 def convert_type(atype, btype):
     """Return the compatible type between 'atype' and 'btype'. Return 'None'
-    if there is no compatible type. Type conversions are:
-
-    * INTEGER, FLOAT => FLOAT
-    * DATE, TIME, TIMESTAMP, STRING => STRING
+    if there is no compatible type. Type conversions (in order of precedence)
+    are:
+
+    * type + type => type
+    * [Q]BOOLEAN + [Q]BOOLEAN => BOOLEAN
+    * [Q]INTEGER + [Q]INTEGER => INTEGER
+    * [Q]FLOAT + [Q]FLOAT => FLOAT
+    * QINTEGER + QFLOAT = QFLOAT
+    * QFLOAT + QINTEGER = QFLOAT
+    * [Q]INTEGER + [Q]FLOAT => FLOAT (except QINTEGER + QFLOAT)
+    * [Q]FLOAT + [Q]INTEGER => FLOAT (except QFLOAT + QINTEGER)
+    * (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
+        (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
     """
+    # type + type => type
     if atype == btype:
         return atype
+
+    # [Q]BOOLEAN + [Q]BOOLEAN => BOOLEAN
+    if atype == 'BOOLEAN' and btype == 'QBOOLEAN':
+        return 'BOOLEAN'
+    if atype == 'QBOOLEAN' and btype == 'BOOLEAN':
+        return 'BOOLEAN'
+
+    # [Q]INTEGER + [Q]INTEGER => INTEGER
+    if atype == 'QINTEGER' and btype == 'INTEGER':
+        return 'INTEGER'
+    if atype == 'INTEGER' and btype == 'QINTEGER':
+        return 'INTEGER'
+
+    # [Q]FLOAT + [Q]FLOAT => FLOAT
+    if atype == 'QFLOAT' and btype == 'FLOAT':
+        return 'FLOAT'
+    if atype == 'FLOAT' and btype == 'QFLOAT':
+        return 'FLOAT'
+
+    # QINTEGER + QFLOAT => QFLOAT
+    if atype == 'QINTEGER' and btype == 'QFLOAT':
+        return 'QFLOAT'
+
+    # QFLOAT + QINTEGER => QFLOAT
+    if atype == 'QFLOAT' and btype == 'QINTEGER':
+        return 'QFLOAT'
+
+    # [Q]INTEGER + [Q]FLOAT => FLOAT (except QINTEGER + QFLOAT => QFLOAT)
     if atype == 'INTEGER' and btype == 'FLOAT':
         return 'FLOAT'
+    if atype == 'INTEGER' and btype == 'QFLOAT':
+        return 'FLOAT'
+    if atype == 'QINTEGER' and btype == 'FLOAT':
+        return 'FLOAT'
+
+    # [Q]FLOAT + [Q]INTEGER => FLOAT (except # QFLOAT + QINTEGER => QFLOAT)
     if atype == 'FLOAT' and btype == 'INTEGER':
         return 'FLOAT'
+    if atype == 'FLOAT' and btype == 'QINTEGER':
+        return 'FLOAT'
+    if atype == 'QFLOAT' and btype == 'INTEGER':
+        return 'FLOAT'
+
+    # All remaining combination of:
+    # (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
+    #   (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
     if is_string_type(atype) and is_string_type(btype):
         return 'STRING'
+
     return None
 
 
 def is_string_type(thetype):
     """Returns true if the type is one of: STRING, TIMESTAMP, DATE, or
     TIME."""
-    return (thetype == 'STRING' or thetype == 'TIMESTAMP' or
-            thetype == 'DATE' or thetype == 'TIME')
+    return thetype in [
+        'STRING', 'TIMESTAMP', 'DATE', 'TIME', 'QINTEGER', 'QFLOAT', 'QBOOLEAN'
+    ]
 
 
 def flatten_schema_map(schema_map, keep_nulls=False):
@@ -433,8 +500,8 @@ def flatten_schema_map(schema_map, keep_nulls=False):
     data.
     """
     if not isinstance(schema_map, dict):
-        raise Exception("Unexpected type '%s' for schema_map" %
-                        type(schema_map))
+        raise Exception(
+            "Unexpected type '%s' for schema_map" % type(schema_map))
 
     # Build the BigQuery schema from the internal 'schema_map'.
     schema = []
@@ -466,6 +533,8 @@ def flatten_schema_map(schema_map, keep_nulls=False):
                 else:
                     # Recursively flatten the sub-fields of a RECORD entry.
                     new_value = flatten_schema_map(value, keep_nulls)
+            elif key == 'type' and value in ['QINTEGER', 'QFLOAT', 'QBOOLEAN']:
+                new_value = value[1:]
             else:
                 new_value = value
             new_info[key] = new_value
@@ -510,7 +579,8 @@ def main():
         default=1000)
     parser.add_argument(
         '--debugging_map',
-        help='Print the metadata schema_map instead of the schema for debugging',
+        help=
+        'Print the metadata schema_map instead of the schema for debugging',
         action="store_true")
     args = parser.parse_args()
 

diff --git a/scripts/generate-schema b/scripts/generate-schema
diff --git a/setup.py b/setup.py
@@ -14,13 +14,18 @@
         long_description = 'BigQuery schema generator.'
 
 setup(name='bigquery-schema-generator',
-      version='0.2.1',
+      version='0.3',
       description='BigQuery schema generator',
       long_description=long_description,
       url='https://github.com/bxparks/bigquery-schema-generator',
       author='Brian T. Park',
       author_email='brian@xparks.net',
       license='Apache 2.0',
       packages=['bigquery_schema_generator'],
-      scripts=['scripts/generate-schema'],
-      python_requires='~=3.5')
+      python_requires='~=3.5',
+      entry_points={
+          'console_scripts': [
+            'generate-schema = bigquery_schema_generator.generate_schema:main'
+        ]
+      }
+)