Skip to content

Commit

Permalink
Merge pull request #17 from bxparks/develop
Browse files Browse the repository at this point in the history
0.3: support quoted values of BOOLEAN, INTEGER and FLOAT types
  • Loading branch information
bxparks committed Dec 17, 2018
2 parents a3dd1c1 + 1678335 commit fbd71cb
Show file tree
Hide file tree
Showing 7 changed files with 396 additions and 60 deletions.
44 changes: 26 additions & 18 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,37 @@
# Changelog

* 0.3 (2018-12-17)
* Tighten TIMESTAMP and DATE validation (thanks jtschichold@).
* Inspect the internals of STRING values to infer BOOLEAN, INTEGER or FLOAT
types (thanks jtschichold@).
* Handle conversion of these string types when mixed with their non-quoted
equivalents, matching the conversion logic followed by 'bq load'.
* 0.2.1 (2018-07-18)
* Add `anonymizer.py` script to create anonymized data files for benchmarking.
* Add benchmark numbers to README.md.
* Add `DEVELOPER.md` file to record how to upload to PyPI.
* Fix some minor warnings from pylint3.
* Add `anonymizer.py` script to create anonymized data files for
benchmarking.
* Add benchmark numbers to README.md.
* Add `DEVELOPER.md` file to record how to upload to PyPI.
* Fix some minor warnings from pylint3.
* 0.2.0 (2018-02-10)
* Add support for `DATE` and `TIME` types.
* Update type conversion rules to be more compatible with **bq load**.
* Allow `DATE`, `TIME` and `TIMESTAMP` to gracefully degrade to `STRING`.
* Allow type conversions of elements within arrays
(e.g. array of `INTEGER` and `FLOAT`, or array of mixed `DATE`, `TIME`, or
`TIMESTAMP` elements).
* Better detection of invalid values (e.g. arrays of arrays).
* Add support for `DATE` and `TIME` types.
* Update type conversion rules to be more compatible with **bq load**.
* Allow `DATE`, `TIME` and `TIMESTAMP` to gracefully degrade to
`STRING`.
* Allow type conversions of elements within arrays
(e.g. array of `INTEGER` and `FLOAT`, or array of mixed `DATE`,
`TIME`, or `TIMESTAMP` elements).
* Better detection of invalid values (e.g. arrays of arrays).
* 0.1.6 (2018-01-26)
* Pass along command line arguments to `generate-schema`.
* Pass along command line arguments to `generate-schema`.
* 0.1.5 (2018-01-25)
* Updated installation instructions for MacOS.
* Updated installation instructions for MacOS.
* 0.1.4 (2018-01-23)
* Attempt #3 to fix exception during pip3 install.
* Attempt #3 to fix exception during pip3 install.
* 0.1.3 (2018-01-23)
* Attempt #2 to fix exception during pip3 install.
* Attempt #2 to fix exception during pip3 install.
* 0.1.2 (2018-01-23)
* Attemp to fix exception during pip3 install. Didn't work. Pulled.
* Attemp to fix exception during pip3 install. Didn't work. Pulled.
* 0.1.1 (2018-01-03)
* Install `generate-schema` script in `/usr/local/bin`
* Install `generate-schema` script in `/usr/local/bin`
* 0.1 (2018-01-02)
* Iniitial release to PyPI.
* Iniitial release to PyPI.
50 changes: 35 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Usage:
$ generate-schema < file.data.json > file.schema.json
```

Version: 0.2.1 (2018-07-18)
Version: 0.3 (2018-12-17)

## Background

Expand Down Expand Up @@ -109,9 +109,9 @@ This is essentially what the `generate-schema` command does.

**3) Python script**

If you retrieved this code from its [GitHub
repository](https://github.com/bxparks/bigquery-schema-generator), then you can invoke
the Python script directly:
If you retrieved this code from its
[GitHub repository](https://github.com/bxparks/bigquery-schema-generator),
then you can invoke the Python script directly:
```
$ ./generate_schema.py < file.data.json > file.schema.json
```
Expand All @@ -121,21 +121,33 @@ $ ./generate_schema.py < file.data.json > file.schema.json
The resulting schema file can be given to the **bq load** command using the
`--schema` flag:
```
$ bq load --source_format NEWLINE_DELIMITED_JSON \
--ignore_unknown_values \
--schema file.schema.json \
mydataset.mytable \
file.data.json
```

where `mydataset.mytable` is the target table in BigQuery.

A useful flag for **bq load** is `--ignore_unknown_values`, which causes **bq load**
to ignore fields in the input data which are not defined in the schema. When
`generate_schema.py` detects an inconsistency in the definition of a particular
field in the input data, it removes the field from the schema definition.
Without the `--ignore_unknown_values`, the **bq load** fails when the
inconsistent data record is read.
For debugging purposes, here is the equivalent `bq load` command using schema
autodetection:

```
$ bq load --source_format NEWLINE_DELIMITED_JSON \
--ignore_unknown_values \
--autodetect
mydataset.mytable \
file.data.json
```

A useful flag for `bq load` is `--ignore_unknown_values`, which causes `bq
load` to ignore fields in the input data which are not defined in the schema.
When `generate_schema.py` detects an inconsistency in the definition of a
particular field in the input data, it removes the field from the schema
definition. Without the `--ignore_unknown_values`, the `bq load` fails when
the inconsistent data record is read. Another useful flag during development and
debugging is `--replace` which replaces any existing BigQuery table.

After the BigQuery table is loaded, the schema can be retrieved using:

Expand Down Expand Up @@ -238,7 +250,7 @@ $ generate-schema --debugging_interval 50 < file.data.json > file.schema.json

Instead of printing out the BigQuery schema, the `--debugging_map` prints out
the bookkeeping metadata map which is used internally to keep track of the
various fields and theirs types that were inferred using the data file. This
various fields and their types that were inferred using the data file. This
flag is intended to be used for debugging.

```
Expand Down Expand Up @@ -282,7 +294,7 @@ compatibility rules implemented by **bq load**:
upgraded to a `FLOAT`
* the reverse does not happen, once a field is a `FLOAT`, it will remain a
`FLOAT`
* conflicting `TIME`, `DATE`, `TIMESTAMP` types downgrades to `STRING`
* conflicting `TIME`, `DATE`, `TIMESTAMP` types upgrades to `STRING`
* if a field is determined to have one type of "time" in one record, then
subsequently a different "time" type, then the field will be assigned a
`STRING` type
Expand All @@ -299,6 +311,12 @@ compatibility rules implemented by **bq load**:
* we follow the same logic as **bq load** and always infer these as
`TIMESTAMP`

The BigQuery loader looks inside string values to determine if they are actually
BOOLEAN, INTEGER or FLOAT types instead. In other words, `"True"` is considered
a BOOLEAN type, `"1"` is considered an INTEGER type, and `"2.1"` is consiered a
FLOAT type. Luigi Mori (jtschichold@) added additional logic to replicate the
type conversion logic used by `bq load` for these strings.

## Examples

Here is an example of a single JSON data record on the STDIN (the `^D` below
Expand Down Expand Up @@ -387,14 +405,16 @@ took 77s on a Dell Precision M4700 laptop with an Intel Core i7-3840QM CPU @
This project was initially developed on Ubuntu 17.04 using Python 3.5.3. I have
tested it on:

* Ubuntu 18.04, Python 3.6.7
* Ubuntu 17.10, Python 3.6.3
* Ubuntu 17.04, Python 3.5.3
* Ubuntu 16.04, Python 3.5.2
* MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)

## Author
## Authors

Created by Brian T. Park (brian@xparks.net).
* Created by Brian T. Park (brian@xparks.net).
* Additional type inferrence logic by Luigi Mori (jtschichold@).

## License

Expand Down
102 changes: 86 additions & 16 deletions bigquery_schema_generator/generate_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,11 +50,15 @@ class SchemaGenerator:
r'(([+-]\d{1,2}(:\d{1,2})?)|Z)?$')

# Detect a DATE field of the form YYYY-[M]M-[D]D.
DATE_MATCHER = re.compile(r'^\d{4}-\d{1,2}-\d{1,2}$')
DATE_MATCHER = re.compile(
r'^\d{4}-(?:[1-9]|0[1-9]|1[012])-(?:[1-9]|0[1-9]|[12][0-9]|3[01])$')

# Detect a TIME field of the form [H]H:[M]M:[S]S[.DDDDDD]
TIME_MATCHER = re.compile(r'^\d{1,2}:\d{1,2}:\d{1,2}(\.\d{1,6})?$')

INTEGER_MATCHER = re.compile(r'^[-]?\d+$')
FLOAT_MATCHER = re.compile(r'^[-]?\d+\.\d+$')

def __init__(self,
keep_nulls=False,
debugging_interval=1000,
Expand Down Expand Up @@ -136,8 +140,8 @@ def deduce_schema_for_line(self, json_object, schema_map):
schema_entry = schema_map.get(key)
try:
new_schema_entry = self.get_schema_entry(key, value)
merged_schema_entry = self.merge_schema_entry(schema_entry,
new_schema_entry)
merged_schema_entry = self.merge_schema_entry(
schema_entry, new_schema_entry)
except Exception as e:
self.log_error(str(e))
continue
Expand Down Expand Up @@ -200,8 +204,8 @@ def merge_schema_entry(self, old_schema_entry, new_schema_entry):
elif old_mode == 'REPEATED' and new_mode == 'NULLABLE':
# TODO: Maybe remove this warning output. It was helpful during
# development, but maybe it's just natural.
self.log_error('Leaving schema for "%s" as REPEATED RECORD' %
old_name)
self.log_error(
'Leaving schema for "%s" as REPEATED RECORD' % old_name)

# RECORD type needs a recursive merging of sub-fields. We merge into
# the 'old_schema_entry' which assumes that the 'old_schema_entry'
Expand Down Expand Up @@ -240,6 +244,8 @@ def get_schema_entry(self, key, value):
object, instead of a primitive.
"""
value_mode, value_type = self.infer_bigquery_type(value)

# yapf: disable
if value_type == 'RECORD':
# recursively figure out the RECORD
fields = OrderedDict()
Expand Down Expand Up @@ -284,6 +290,7 @@ def get_schema_entry(self, key, value):
('name', key),
('type', value_type),
]))])
# yapf: enable
return schema_entry

def infer_bigquery_type(self, node_value):
Expand All @@ -300,8 +307,8 @@ def infer_bigquery_type(self, node_value):
array_type = self.infer_array_type(node_value)
if not array_type:
raise Exception(
"All array elements must be the same compatible type: %s"
% node_value)
"All array elements must be the same compatible type: %s" %
node_value)

# Disallow array of special types (with '__' not supported).
# EXCEPTION: allow (REPEATED __empty_record) ([{}]) because it is
Expand All @@ -326,6 +333,12 @@ def infer_value_type(self, value):
return 'DATE'
elif self.TIME_MATCHER.match(value):
return 'TIME'
elif self.INTEGER_MATCHER.match(value):
return 'QINTEGER' # quoted integer
elif self.FLOAT_MATCHER.match(value):
return 'QFLOAT' # quoted float
elif value.lower() in ['true', 'false']:
return 'QBOOLEAN' # quoted boolean
else:
return 'STRING'
# Python 'bool' is a subclass of 'int' so we must check it first
Expand Down Expand Up @@ -403,27 +416,81 @@ def run(self):

def convert_type(atype, btype):
"""Return the compatible type between 'atype' and 'btype'. Return 'None'
if there is no compatible type. Type conversions are:
* INTEGER, FLOAT => FLOAT
* DATE, TIME, TIMESTAMP, STRING => STRING
if there is no compatible type. Type conversions (in order of precedence)
are:
* type + type => type
* [Q]BOOLEAN + [Q]BOOLEAN => BOOLEAN
* [Q]INTEGER + [Q]INTEGER => INTEGER
* [Q]FLOAT + [Q]FLOAT => FLOAT
* QINTEGER + QFLOAT = QFLOAT
* QFLOAT + QINTEGER = QFLOAT
* [Q]INTEGER + [Q]FLOAT => FLOAT (except QINTEGER + QFLOAT)
* [Q]FLOAT + [Q]INTEGER => FLOAT (except QFLOAT + QINTEGER)
* (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
(DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
"""
# type + type => type
if atype == btype:
return atype

# [Q]BOOLEAN + [Q]BOOLEAN => BOOLEAN
if atype == 'BOOLEAN' and btype == 'QBOOLEAN':
return 'BOOLEAN'
if atype == 'QBOOLEAN' and btype == 'BOOLEAN':
return 'BOOLEAN'

# [Q]INTEGER + [Q]INTEGER => INTEGER
if atype == 'QINTEGER' and btype == 'INTEGER':
return 'INTEGER'
if atype == 'INTEGER' and btype == 'QINTEGER':
return 'INTEGER'

# [Q]FLOAT + [Q]FLOAT => FLOAT
if atype == 'QFLOAT' and btype == 'FLOAT':
return 'FLOAT'
if atype == 'FLOAT' and btype == 'QFLOAT':
return 'FLOAT'

# QINTEGER + QFLOAT => QFLOAT
if atype == 'QINTEGER' and btype == 'QFLOAT':
return 'QFLOAT'

# QFLOAT + QINTEGER => QFLOAT
if atype == 'QFLOAT' and btype == 'QINTEGER':
return 'QFLOAT'

# [Q]INTEGER + [Q]FLOAT => FLOAT (except QINTEGER + QFLOAT => QFLOAT)
if atype == 'INTEGER' and btype == 'FLOAT':
return 'FLOAT'
if atype == 'INTEGER' and btype == 'QFLOAT':
return 'FLOAT'
if atype == 'QINTEGER' and btype == 'FLOAT':
return 'FLOAT'

# [Q]FLOAT + [Q]INTEGER => FLOAT (except # QFLOAT + QINTEGER => QFLOAT)
if atype == 'FLOAT' and btype == 'INTEGER':
return 'FLOAT'
if atype == 'FLOAT' and btype == 'QINTEGER':
return 'FLOAT'
if atype == 'QFLOAT' and btype == 'INTEGER':
return 'FLOAT'

# All remaining combination of:
# (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) +
# (DATE, TIME, TIMESTAMP, QBOOLEAN, QINTEGER, QFLOAT, STRING) => STRING
if is_string_type(atype) and is_string_type(btype):
return 'STRING'

return None


def is_string_type(thetype):
"""Returns true if the type is one of: STRING, TIMESTAMP, DATE, or
TIME."""
return (thetype == 'STRING' or thetype == 'TIMESTAMP' or
thetype == 'DATE' or thetype == 'TIME')
return thetype in [
'STRING', 'TIMESTAMP', 'DATE', 'TIME', 'QINTEGER', 'QFLOAT', 'QBOOLEAN'
]


def flatten_schema_map(schema_map, keep_nulls=False):
Expand All @@ -433,8 +500,8 @@ def flatten_schema_map(schema_map, keep_nulls=False):
data.
"""
if not isinstance(schema_map, dict):
raise Exception("Unexpected type '%s' for schema_map" %
type(schema_map))
raise Exception(
"Unexpected type '%s' for schema_map" % type(schema_map))

# Build the BigQuery schema from the internal 'schema_map'.
schema = []
Expand Down Expand Up @@ -466,6 +533,8 @@ def flatten_schema_map(schema_map, keep_nulls=False):
else:
# Recursively flatten the sub-fields of a RECORD entry.
new_value = flatten_schema_map(value, keep_nulls)
elif key == 'type' and value in ['QINTEGER', 'QFLOAT', 'QBOOLEAN']:
new_value = value[1:]
else:
new_value = value
new_info[key] = new_value
Expand Down Expand Up @@ -510,7 +579,8 @@ def main():
default=1000)
parser.add_argument(
'--debugging_map',
help='Print the metadata schema_map instead of the schema for debugging',
help=
'Print the metadata schema_map instead of the schema for debugging',
action="store_true")
args = parser.parse_args()

Expand Down
1 change: 0 additions & 1 deletion scripts/generate-schema

This file was deleted.

11 changes: 8 additions & 3 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,18 @@
long_description = 'BigQuery schema generator.'

setup(name='bigquery-schema-generator',
version='0.2.1',
version='0.3',
description='BigQuery schema generator',
long_description=long_description,
url='https://github.com/bxparks/bigquery-schema-generator',
author='Brian T. Park',
author_email='brian@xparks.net',
license='Apache 2.0',
packages=['bigquery_schema_generator'],
scripts=['scripts/generate-schema'],
python_requires='~=3.5')
python_requires='~=3.5',
entry_points={
'console_scripts': [
'generate-schema = bigquery_schema_generator.generate_schema:main'
]
}
)

0 comments on commit fbd71cb

Please sign in to comment.