Merge pull request #62 from bxparks/develop

merge 1.3 into master
bxparks · Dec 5, 2020 · d5c3cd3 · d5c3cd3
2 parents 0f63dd0 + e5a50af
commit d5c3cd3
Show file tree

Hide file tree

Showing 8 changed files with 1,474 additions and 165 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,11 @@
 # Changelog
 
 * Unreleased
+* 1.3 (2020-12-05)
+    * Allow an existing schema file to be specified using
+      `--existing_schema_path` flag, so that new data can be merged into it.
+      See #40, #57, and #61.
+      (Thanks to abroglesc@ and bozzzzo@).
 * 1.2 (2020-10-27)
     * Print full path of nested JSON elements in error messages (See #52;
       thanks abroglesc@).

diff --git a/Makefile b/Makefile
@@ -6,7 +6,7 @@ tests:
 	python3 -m unittest
 
 flake8:
-	flake8 bigquery_schema_generator \
+	flake8 bigquery_schema_generator tests \
 		--count \
 		--ignore W503 \
 		--show-source \

diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ $ generate-schema < file.data.json > file.schema.json
 $ generate-schema --input_format csv < file.data.csv > file.schema.json
 ```
 
-Version: 1.2 (2020-10-27)
+Version: 1.3 (2020-12-05)
 
 Changelog: [CHANGELOG.md](CHANGELOG.md)
 
@@ -235,13 +235,14 @@ as shown by the `--help` flag below.
 
 Print the built-in help strings:
 
-```
+```bash
 $ generate-schema --help
-usage: generate_schema.py [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
-                          [--quoted_values_are_strings] [--infer_mode]
-                          [--debugging_interval DEBUGGING_INTERVAL]
-                          [--debugging_map] [--sanitize_names]
-                          [--ignore_invalid_lines]
+usage: generate-schema [-h] [--input_format INPUT_FORMAT] [--keep_nulls]
+                       [--quoted_values_are_strings] [--infer_mode]
+                       [--debugging_interval DEBUGGING_INTERVAL]
+                       [--debugging_map] [--sanitize_names]
+                       [--ignore_invalid_lines]
+                       [--existing_schema_path EXISTING_SCHEMA_PATH]
 
 Generate BigQuery schema from JSON or CSV file.
 
@@ -261,6 +262,10 @@ optional arguments:
                         standard
   --ignore_invalid_lines
                         Ignore lines that cannot be parsed instead of stopping
+  --existing_schema_path EXISTING_SCHEMA_PATH
+                        File that contains the existing BigQuery schema for a
+                        table. This can be fetched with: `bq show --schema
+                        <project_id>:<dataset>:<table_name>
 ```
 
 #### Input Format (`--input_format`)
@@ -282,7 +287,7 @@ array or empty record as its value, the field is suppressed in the schema file.
 This flag enables this field to be included in the schema file.
 
 In other words, using a data file containing just nulls and empty values:
-```
+```bash
 $ generate_schema
 { "s": null, "a": [], "m": {} }
 ^D
@@ -291,7 +296,7 @@ INFO:root:Processed 1 lines
 ```
 
 With the `keep_nulls` flag, we get:
-```
+```bash
 $ generate-schema --keep_nulls
 { "s": null, "a": [], "m": {} }
 ^D
@@ -331,7 +336,7 @@ consistent with the algorithm used by `bq load`. However, for the `BOOLEAN`,
 normal strings instead. This flag disables type inference for `BOOLEAN`,
 `INTEGER` and `FLOAT` types inside quoted strings.
 
-```
+```bash
 $ generate-schema
 { "name": "1" }
 ^D
@@ -365,6 +370,12 @@ feature for JSON files, but too difficult to implement in practice because
 fields are often completely missing from a given JSON record (instead of
 explicitly being defined to be `null`).
 
+In addition to the above, this option, when used in conjunction with
+`--existing_schema_map`, will allow fields to be relaxed from REQUIRED to
+NULLABLE if they were REQUIRED in the existing schema and NULL rows are found in
+the new data we are inferring a schema from. In this case it can be used with
+either input_format, CSV or JSON.
+
 See [Issue #28](https://github.com/bxparks/bigquery-schema-generator/issues/28)
 for implementation details.
 
@@ -374,7 +385,7 @@ By default, the `generate_schema.py` script prints a short progress message
 every 1000 lines of input data. This interval can be changed using the
 `--debugging_interval` flag.
 
-```
+```bash
 $ generate-schema --debugging_interval 50 < file.data.json > file.schema.json
 ```
 
@@ -385,7 +396,7 @@ the bookkeeping metadata map which is used internally to keep track of the
 various fields and their types that were inferred using the data file. This
 flag is intended to be used for debugging.
 
-```
+```bash
 $ generate-schema --debugging_map < file.data.json > file.schema.json
 ```
 
@@ -411,9 +422,9 @@ generate the schema file. The transformations are:
 My recollection is that the `bq load` command does *not* normalize the JSON key
 names. Instead it prints an error message. So the `--sanitize_names` flag is
 useful mostly for CSV files. For JSON files, you'll have to do a second pass
-through the data files to cleanup the column names anyway. See [Issue
-#14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and [Issue
-#33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
+through the data files to cleanup the column names anyway. See
+[Issue #14](https://github.com/bxparks/bigquery-schema-generator/issues/14) and
+[Issue #33](https://github.com/bxparks/bigquery-schema-generator/issues/33).
 
 #### Ignore Invalid Lines (`--ignore_invalid_lines`)
 
@@ -432,14 +443,46 @@ does throw an exception on a given line, we would not be able to catch it and
 continue processing. Fortunately, CSV files are fairly robust, and the schema
 deduction logic will handle any missing or extra columns gracefully.
 
-Fixes [Issue
-#49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
+Fixes
+[Issue #49](https://github.com/bxparks/bigquery-schema-generator/issues/49).
+
+#### Existing Schema Path (`--existing_schema_path`)
+
+There are cases where we would like to start from an existing BigQuery table
+schema rather than starting from scratch with a new batch of data we would like
+to load. In this case we can specify the path to a local file on disk that is
+our existing bigquery table schema. This can be generated via the following `bq
+show --schema` command:
+```bash
+bq show --schema <PROJECT_ID>:<DATASET_NAME>.<TABLE_NAME> > existing_table_schema.json
+```
+
+We can then run generate-schema with the additional option
+```bash
+--existing_schema_path existing_table_schema.json
+```
+
+There is some subtle interaction between the `--existing_schema_path` and fields
+which are marked with a `mode` of `REQUIRED` in the existing schema. If the new
+data contains a `null` value (either in a CSV or JSON data file), it is not
+clear if the schema should be changed to `mode=NULLABLE` or whether the new data
+should be ignored and the schema should remain `mode=REQUIRED`. The choice is
+determined by overloading the `--infer_mode` flag:
+
+* If `--infer_mode` is given, the new schema will be allowed to revert back to
+  `NULLABLE`.
+* If `--infer_mode` is not given, the offending new record will be ignored
+  and the new schema will remain `REQUIRED`.
+
+See discussion in
+[PR #57](https://github.com/bxparks/bigquery-schema-generator/pull/57) for
+more details.
 
 ## Schema Types
 
 ### Supported Types
 
-The **bq show --schema** command produces a JSON schema file that uses the
+The `bq show --schema` command produces a JSON schema file that uses the
 older [Legacy SQL date types](https://cloud.google.com/bigquery/data-types).
 For compatibility, **generate-schema** script will also generate a schema file
 using the legacy data types.
@@ -534,7 +577,7 @@ compatibility rules implemented by **bq load**:
 Here is an example of a single JSON data record on the STDIN (the `^D` below
 means typing Control-D, which indicates "end of file" under Linux and MacOS):
 
-```
+```bash
 $ generate-schema
 { "s": "string", "b": true, "i": 1, "x": 3.1, "t": "2017-05-22T17:10:00-07:00" }
 ^D
@@ -569,7 +612,7 @@ INFO:root:Processed 1 lines
 ```
 
 In most cases, the data file will be stored in a file:
-```
+```bash
 $ cat > file.data.json
 { "a": [1, 2] }
 { "i": 3 }
@@ -596,7 +639,7 @@ $ cat file.schema.json
 Here is the schema generated from a CSV input file. The first line is the header
 containing the names of the columns, and the schema lists the columns in the
 same order as the header:
-```
+```bash
 $ generate-schema --input_format csv
 e,b,c,d,a
 1,x,true,,2.0
@@ -634,7 +677,7 @@ INFO:root:Processed 3 lines
 ```
 
 Here is an example of the schema generated with the `--infer_mode` flag:
-```
+```bash
 $ generate-schema --input_format csv --infer_mode
 name,surname,age
 John
@@ -701,15 +744,15 @@ json.dump(schema, output_file, indent=2)
 
 I wrote the `bigquery_schema_generator/anonymize.py` script to create an
 anonymized data file `tests/testdata/anon1.data.json.gz`:
-```
+```bash
 $ ./bigquery_schema_generator/anonymize.py < original.data.json \
     > anon1.data.json
 $ gzip anon1.data.json
 ```
 This data file is 290MB (5.6MB compressed) with 103080 data records.
 
 Generating the schema using
-```
+```bash
 $ bigquery_schema_generator/generate_schema.py < anon1.data.json \
     > anon1.schema.json
 ```
@@ -748,6 +791,8 @@ and 3.8.
 * Bug fix in `--sanitize_names` by Riccardo M. Cefala (riccardomc@).
 * Print full path of nested JSON elements in error messages, by Austin Brogle
   (abroglesc@).
+* Allow an existing schema file to be specified using `--existing_schema_path`,
+  by Austin Brogle (abroglesc@) and Bozo Dragojevic (bozzzzo@).
 
 
 ## License