Skip to content

Commit

Permalink
Merge pull request #62 from bxparks/develop
Browse files Browse the repository at this point in the history
merge 1.3 into master
  • Loading branch information
bxparks committed Dec 5, 2020
2 parents 0f63dd0 + e5a50af commit d5c3cd3
Show file tree
Hide file tree
Showing 8 changed files with 1,474 additions and 165 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# Changelog

* Unreleased
* 1.3 (2020-12-05)
* Allow an existing schema file to be specified using
`--existing_schema_path` flag, so that new data can be merged into it.
See #40, #57, and #61.
(Thanks to abroglesc@ and bozzzzo@).
* 1.2 (2020-10-27)
* Print full path of nested JSON elements in error messages (See #52;
thanks abroglesc@).
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tests:
python3 -m unittest

flake8:
flake8 bigquery_schema_generator \
flake8 bigquery_schema_generator tests \
--count \
--ignore W503 \
--show-source \
Expand Down
93 changes: 69 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
$ generate-schema --input_format csv < file.data.csv > file.schema.json
```

Version: 1.2 (2020-10-27)
Version: 1.3 (2020-12-05)

Changelog: [CHANGELOG.md](CHANGELOG.md)

Expand Down Expand Up @@ -235,13 +235,14 @@ as shown by the `--help` flag below.

Print the built-in help strings:

```
```bash
$ generate-schema --help
usage: generate_schema.py [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
[--quoted_values_are_strings] [--infer_mode]
[--debugging_interval DEBUGGING_INTERVAL]
[--debugging_map] [--sanitize_names]
[--ignore_invalid_lines]
usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
[--quoted_values_are_strings] [--infer_mode]
[--debugging_interval DEBUGGING_INTERVAL]
[--debugging_map] [--sanitize_names]
[--ignore_invalid_lines]
[--existing_schema_path EXISTING_SCHEMA_PATH]

Generate BigQuery schema from JSON or CSV file.

Expand All @@ -261,6 +262,10 @@ optional arguments:
standard
--ignore_invalid_lines
Ignore lines that cannot be parsed instead of stopping
--existing_schema_path EXISTING_SCHEMA_PATH
File that contains the existing BigQuery schema for a
table. This can be fetched with: `bq show --schema
<project_id>:<dataset>:<table_name>
```

#### Input Format (`--input_format`)
Expand All @@ -282,7 +287,7 @@ array or empty record as its value, the field is suppressed in the schema file.
This flag enables this field to be included in the schema file.

In other words, using a data file containing just nulls and empty values:
```
```bash
$ generate_schema
{ "s": null, "a": [], "m": {} }
^D
Expand All @@ -291,7 +296,7 @@ INFO:root:Processed 1 lines
```
With the `keep_nulls` flag, we get:
```
```bash
$ generate-schema --keep_nulls
{ "s": null, "a": [], "m": {} }
^D
Expand Down Expand Up @@ -331,7 +336,7 @@ consistent with the algorithm used by `bq load`. However, for the `BOOLEAN`,
normal strings instead. This flag disables type inference for `BOOLEAN`,
`INTEGER` and `FLOAT` types inside quoted strings.
```
```bash
$ generate-schema
{ "name": "1" }
^D
Expand Down Expand Up @@ -365,6 +370,12 @@ feature for JSON files, but too difficult to implement in practice because
fields are often completely missing from a given JSON record (instead of
explicitly being defined to be `null`).
In addition to the above, this option, when used in conjunction with
`--existing_schema_map`, will allow fields to be relaxed from REQUIRED to
NULLABLE if they were REQUIRED in the existing schema and NULL rows are found in
the new data we are inferring a schema from. In this case it can be used with
either input_format, CSV or JSON.
See [Issue #28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
for implementation details.
Expand All @@ -374,7 +385,7 @@ By default, the `generate_schema.py` script prints a short progress message
every 1000 lines of input data. This interval can be changed using the
`--debugging_interval` flag.
```
```bash
$ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
```
Expand All @@ -385,7 +396,7 @@ the bookkeeping metadata map which is used internally to keep track of the
various fields and their types that were inferred using the data file. This
flag is intended to be used for debugging.
```
```bash
$ generate-schema --debugging_map < file.data.json > file.schema.json
```
Expand All @@ -411,9 +422,9 @@ generate the schema file. The transformations are:
My recollection is that the `bq load` command does *not* normalize the JSON key
names. Instead it prints an error message. So the `--sanitize_names` flag is
useful mostly for CSV files. For JSON files, you'll have to do a second pass
through the data files to cleanup the column names anyway. See [Issue
#14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and [Issue
#33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
through the data files to cleanup the column names anyway. See
[Issue #14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and
[Issue #33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
#### Ignore Invalid Lines (`--ignore_invalid_lines`)
Expand All @@ -432,14 +443,46 @@ does throw an exception on a given line, we would not be able to catch it and
continue processing. Fortunately, CSV files are fairly robust, and the schema
deduction logic will handle any missing or extra columns gracefully.
Fixes [Issue
#49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
Fixes
[Issue #49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
#### Existing Schema Path (`--existing_schema_path`)
There are cases where we would like to start from an existing BigQuery table
schema rather than starting from scratch with a new batch of data we would like
to load. In this case we can specify the path to a local file on disk that is
our existing bigquery table schema. This can be generated via the following `bq
show --schema` command:
```bash
bq show --schema <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> > existing_table_schema.json
```
We can then run generate-schema with the additional option
```bash
--existing_schema_path existing_table_schema.json
```
There is some subtle interaction between the `--existing_schema_path` and fields
which are marked with a `mode` of `REQUIRED` in the existing schema. If the new
data contains a `null` value (either in a CSV or JSON data file), it is not
clear if the schema should be changed to `mode=NULLABLE` or whether the new data
should be ignored and the schema should remain `mode=REQUIRED`. The choice is
determined by overloading the `--infer_mode` flag:
* If `--infer_mode` is given, the new schema will be allowed to revert back to
`NULLABLE`.
* If `--infer_mode` is not given, the offending new record will be ignored
and the new schema will remain `REQUIRED`.
See discussion in
[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
more details.
## Schema Types
### Supported Types
The **bq show --schema** command produces a JSON schema file that uses the
The `bq show --schema` command produces a JSON schema file that uses the
older [Legacy SQL date types](https://cloud.google.com/bigquery/data-types).
For compatibility, **generate-schema** script will also generate a schema file
using the legacy data types.
Expand Down Expand Up @@ -534,7 +577,7 @@ compatibility rules implemented by **bq load**:
Here is an example of a single JSON data record on the STDIN (the `^D` below
means typing Control-D, which indicates "end of file" under Linux and MacOS):
```
```bash
$ generate-schema
{ "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
^D
Expand Down Expand Up @@ -569,7 +612,7 @@ INFO:root:Processed 1 lines
```
In most cases, the data file will be stored in a file:
```
```bash
$ cat > file.data.json
{ "a": [1, 2] }
{ "i": 3 }
Expand All @@ -596,7 +639,7 @@ $ cat file.schema.json
Here is the schema generated from a CSV input file. The first line is the header
containing the names of the columns, and the schema lists the columns in the
same order as the header:
```
```bash
$ generate-schema --input_format csv
e,b,c,d,a
1,x,true,,2.0
Expand Down Expand Up @@ -634,7 +677,7 @@ INFO:root:Processed 3 lines
```
Here is an example of the schema generated with the `--infer_mode` flag:
```
```bash
$ generate-schema --input_format csv --infer_mode
name,surname,age
John
Expand Down Expand Up @@ -701,15 +744,15 @@ json.dump(schema, output_file, indent=2)
I wrote the `bigquery_schema_generator/anonymize.py` script to create an
anonymized data file `tests/testdata/anon1.data.json.gz`:
```
```bash
$ ./bigquery_schema_generator/anonymize.py < original.data.json \
> anon1.data.json
$ gzip anon1.data.json
```
This data file is 290MB (5.6MB compressed) with 103080 data records.
Generating the schema using
```
```bash
$ bigquery_schema_generator/generate_schema.py < anon1.data.json \
> anon1.schema.json
```
Expand Down Expand Up @@ -748,6 +791,8 @@ and 3.8.
* Bug fix in `--sanitize_names` by Riccardo M. Cefala (riccardomc@).
* Print full path of nested JSON elements in error messages, by Austin Brogle
(abroglesc@).
* Allow an existing schema file to be specified using `--existing_schema_path`,
by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
## License
Expand Down

0 comments on commit d5c3cd3

Please sign in to comment.