Skip to content

Commit

Permalink
Merge pull request #88 from bxparks/develop
Browse files Browse the repository at this point in the history
merge 1.5.1 into master
  • Loading branch information
bxparks committed Dec 4, 2022
2 parents 2830dd0 + a564447 commit 2d983fa
Show file tree
Hide file tree
Showing 11 changed files with 251 additions and 41 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pythonpackage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Changelog

* Unreleased
* 1.5.1 (2022-12-04)
* Add `examples/*.py` to demonstrate how to use `SchemaGenerator` as a
library.
* Update README.md to state that `bq load --autodetect` uses the first
500 records. Previously, it scanned only the 100 records.
* This is a maintenance release with no new features or bug fixes.
* 1.5 (2021-11-14)
* Make the column order in the BQ schema file match the order of appearance
in the JSON data file using the `--preserve_input_sort_order` flag.
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tests:
python3 -m unittest

flake8:
flake8 bigquery_schema_generator tests \
flake8 bigquery_schema_generator tests examples \
--count \
--ignore W503 \
--show-source \
Expand Down
161 changes: 123 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,18 @@

This script generates the BigQuery schema from the newline-delimited data
records on the STDIN. The records can be in JSON format or CSV format. The
BigQuery data importer (`bq load`) uses only the first 100 lines when the schema
auto-detection feature is enabled. In contrast, this script uses all data
records to generate the schema.
BigQuery data importer (`bq load`) uses only the
[first 500 records](https://cloud.google.com/bigquery/docs/schema-detect)
when the schema auto-detection feature is enabled. In contrast, this script uses
all data records to generate the schema.

Usage:
```
$ generate-schema < file.data.json > file.schema.json
$ generate-schema --input_format csv < file.data.csv > file.schema.json
```

**Version**: 1.5 (2021-11-14)
**Version**: 1.5.1 (2022-12-04)

**Changelog**: [CHANGELOG.md](CHANGELOG.md)

Expand All @@ -24,6 +25,8 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
* [Installation](#Installation)
* [Ubuntu Linux](#UbuntuLinux)
* [MacOS](#MacOS)
* [MacOS 11 (Big Sur)](#MacOS11)
* [MacOS 10.14 (Mojave)](#MacOS1014)
* [Usage](#Usage)
* [Command Line](#CommandLine)
* [Schema Output](#SchemaOutput)
Expand All @@ -42,10 +45,11 @@ $ generate-schema --input_format csv < file.data.csv > file.schema.json
(`--preserve_input_sort_order`)](#PreserveInputSortOrder)
* [Using as a Library](#UsingAsLibrary)
* [`SchemaGenerator.run()`](#SchemaGeneratorRun)
* [`SchemaGenerator.deduce_schema()`](#SchemaGeneratorDeduceSchema)
* [`SchemaGenerator.deduce_schema()` with File](#SchemaGeneratorDeduceSchemaFromFile)
* [`SchemaGenerator.deduce_schema()` with Dict](#SchemaGeneratorDeduceSchemaFromDict)
* [Schema Types](#SchemaTypes)
* [Supported Types](#SupportedTypes)
* [Type Inferrence](#TypeInferrence)
* [Type Inference](#TypeInference)
* [Examples](#Examples)
* [Benchmarks](#Benchmarks)
* [System Requirements](#SystemRequirements)
Expand All @@ -66,7 +70,7 @@ schema can be defined manually or the schema can be
[auto-detected](https://cloud.google.com/bigquery/docs/schema-detect#auto-detect).

When the auto-detect feature is used, the BigQuery data importer examines only
the [first 100 records](https://cloud.google.com/bigquery/docs/schema-detect)
the [first 500 records](https://cloud.google.com/bigquery/docs/schema-detect)
of the input data. In many cases, this is sufficient
because the data records were dumped from another database and the exact schema
of the source table was known. However, for data extracted from a service
Expand Down Expand Up @@ -127,7 +131,7 @@ depending on how your Python environment is configured. See below for
some notes for Ubuntu Linux and MacOS.

<a name="UbuntuLinux"></a>
### Ubuntu Linux (18.04, 20.04)
### Ubuntu Linux (18.04, 20.04, 22.04)

After running `pip3 install bigquery_schema_generator`, the `generate-schema`
script may be installed in one the following locations:
Expand All @@ -138,27 +142,59 @@ script may be installed in one the following locations:
* `$HOME/.virtualenvs/{your_virtual_env}/bin/generate-schema`

<a name="MacOS"></a>
### MacOS (10.14 Mojave)
### MacOS

I don't use my Mac for software development these days, and I won't upgrade to
Catalina (10.15) or later, but here are some notes if they help.
I don't have any Macs which are able to run the latest macOS, and I don't use
them much for software development these days, but here are some notes if they
help.

If you installed Python from
[Python Releases for Mac OS X](https://www.python.org/downloads/mac-osx/),
then `/usr/local/bin/pip3` is a symlink to
`/Library/Frameworks/Python.framework/Versions/3.6/bin/pip3`. So
`generate-schema` is installed at
<a name="MacOS11"></a>
#### MacOS 11 (Big Sur)

I believe Big Sur comes preinstalled with Python 3.8. If you install
`bigquery_schema_generator` using:

```
$ pip3 install --user bigquery_schema_generator
```

then the `generate-schema` wrapper script will be installed at:

```
/User/{your-login}/Library/Python/3.8/bin/generate-schema
```

<a name="MacOS1014"></a>
#### MacOS 10.14 (Mojave)

This MacOS version comes with Python 2.7 only. To install Python 3, you can
install using:

1)) Downloading the [macos installer directly from
Python.org](https://www.python.org/downloads/macos/).

The python3 binary will be located at `/usr/local/bin/python3`, and the
`/usr/local/bin/pip3` is a symlink to
`/Library/Frameworks/Python.framework/Versions/3.6/bin/pip3`.

So running

```
$ pip3 install --user bigquery_schema_generator
```

will install `generate-schema` at
`/Library/Frameworks/Python.framework/Versions/3.6/bin/generate-schema`.

The Python installer updates `$HOME/.bash_profile` to add
`/Library/Frameworks/Python.framework/Versions/3.6/bin` to the `$PATH`
environment variable. So you should be able to run the `generate-schema`
command without typing in the full path.

You can install Python3 using
[Homebrew](https://docs.brew.sh/Homebrew-and-Python). In this environment, the
`generate-schema` script will probably be installed in `/usr/local/bin` but I'm
not completely certain.
2)) Using [Homebrew](https://docs.brew.sh/Homebrew-and-Python).

In this environment, the `generate-schema` script will probably be installed in
`/usr/local/bin` but I'm not completely certain.

<a name="Usage"></a>
## Usage
Expand Down Expand Up @@ -665,42 +701,56 @@ generator = SchemaGenerator(
ignore_invalid_lines=ignore_invalid_lines,
preserve_input_sort_order=preserve_input_sort_order,
)
generator.run(input_file=input_file, output_file=output_file)
FILENAME = "..."
with open(FILENAME) as input_file:
generator.run(input_file=input_file, output_file=output_file)
```
The `input_format` is one of `json`, `csv`, and `dict` as described in the
[Input Format](#InputFormat) section above. The `input_file` must match the
format given by this parameter.
See the `TestSchemaGeneratorDeduce.test_run_with_input_and_output()` test
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
an example of an `input_file` of type `json`.
See [generatorrun.py](examples/generatorrun.py) for an example.
<a name="SchemaGeneratorDeduceSchema"></a>
#### `SchemaGenerator.deduce_schema()`
<a name="SchemaGeneratorDeduceSchemaFromFile"></a>
#### `SchemaGenerator.deduce_schema()` from File
If you need to process the generated schema programmatically, use the
`deduce_schema()` method and process the resulting `schema_map` and `error_log`
data structures like this:
```python
import json
import logging
import sys
from bigquery_schema_generator.generate_schema import SchemaGenerator
...
FILENAME = "jsonfile.json"
generator = SchemaGenerator(
...(same as above)...
input_format='json',
quoted_values_are_strings=True,
)
with open(FILENAME) as file:
schema_map, errors = generator.deduce_schema(file)
schema_map, error_logs = generator.deduce_schema(input_data=input_data)
# Print errors if desired.
for error in error_logs:
logging.info("Problem on line %s: %s", error['line_number'], error['msg'])
schema = generator.flatten_schema(schema_map)
json.dump(schema, output_file, indent=2)
json.dump(schema, sys.stdout, indent=2)
print()
```
The `deduce_schema()` now supports starting from an existing `schema_map`
See [csvreader.py](examples/csvreader.py) and
[jsoneader.py](examples/jsoneader.py) for 2 examples.
The `deduce_schema()` also supports starting from an existing `schema_map`
instead of starting from scratch. This is the internal version of the
`--existing_schema_path` functionality.
Expand All @@ -714,9 +764,36 @@ schema_map2, error_logs = generator.deduce_schema(
The `input_data` must match the `input_format` given in the constructor. The
format is described in the [Input Format](#InputFormat) section above.
See the `TestSchemaGeneratorDeduce.test_deduce_schema_with_dict_input()` test
case in [examples/test_generate_schema.py](examples/test_generate_schema.py) for
an example of an `input_data` of type `dict`.
<a name="SchemaGeneratorDeduceSchemaFromDict"></a>
#### `SchemaGenerator.deduce_schema()` from Dict
If the JSON data set has already been read into memory into a Python `dict`
object, the `SchemaGenerator` can process that too like this:
```Python
import json
import logging
import sys
from bigquery_schema_generator.generate_schema import SchemaGenerator
generator = SchemaGenerator(input_format='dict')
input_data = [
{
's': 'string',
'b': True,
},
{
'd': '2021-08-18',
'x': 3.1
},
]
schema_map, error_logs = generator.deduce_schema(input_data)
schema = generator.flatten_schema(schema_map)
json.dump(schema, sys.stdout, indent=2)
print()
```
See [dictreader.py](examples/dictreader.py) for an example.
<a name="SchemaTypes"></a>
## Schema Types
Expand Down Expand Up @@ -773,8 +850,8 @@ The following types are _not_ supported at all:
* `BYTES`
* `DATETIME` (unable to distinguish from `TIMESTAMP`)
<a name="TypeInferrence"></a>
### Type Inferrence Rules
<a name="TypeInference"></a>
### Type Inference Rules
The `generate-schema` script attempts to emulate the various type conversion and
compatibility rules implemented by **bq load**:
Expand Down Expand Up @@ -977,16 +1054,24 @@ now requires Python 3.6 or higher, I think mostly due to the use of f-strings.
I have tested it on:
* Ubuntu 22.04, Python 3.10.6
* Ubuntu 20.04, Python 3.8.5
* Ubuntu 18.04, Python 3.7.7
* Ubuntu 18.04, Python 3.6.7
* Ubuntu 17.10, Python 3.6.3
* MacOS 10.14.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
* MacOS 10.13.2, [Python 3.6.4](https://www.python.org/downloads/release/python-364/)
* MacOS 11.7.1 (Big Sur), Python 3.8.9
* MacOS 10.14.2 (Mojave), Python 3.6.4
* MacOS 10.13.2 (High Sierra), Python 3.6.4
The GitHub Actions continuous integration pipeline validates on Python 3.6, 3.7
and 3.8.
The unit tests are invoked with `$ make tests` target, and depends only on the
built-in Python `unittest` package.
The coding style check is invoked using `$ make flake8` and depends on the
`flake8` package. It can be installed using `$ pip3 install --user flake8`.
<a name="License"></a>
## License
Expand Down
2 changes: 1 addition & 1 deletion bigquery_schema_generator/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '1.5'
__version__ = '1.5.1'
5 changes: 5 additions & 0 deletions examples/csvfile.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
name,surname,age
John,Smith,23
Michael,Johnson,27
Maria,Smith,30
Joanna,Anders,21
37 changes: 37 additions & 0 deletions examples/csvreader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/usr/bin/env python3
#
# Example of using SchemaGenerator as a library instead of a command line
# script. Read the CSV file named 'csvfile.csv' in the current directory, deduce
# its schema, and print it out on the stdout.
#
# This is the equivalent of:
# $ generate-schema
# --input_format=csv
# --infer_mode
# --quoted_values_are_strings
# --sanitize_names
# < csvfile.csv

import json
import logging
import sys
from bigquery_schema_generator.generate_schema import SchemaGenerator

FILENAME = "csvfile.csv"

generator = SchemaGenerator(
input_format='csv',
infer_mode=True,
quoted_values_are_strings=True,
sanitize_names=True,
)

with open(FILENAME) as file:
schema_map, errors = generator.deduce_schema(file)

for error in errors:
logging.info("Problem on line %s: %s", error['line_number'], error['msg'])

schema = generator.flatten_schema(schema_map)
json.dump(schema, sys.stdout, indent=2)
print()
25 changes: 25 additions & 0 deletions examples/dictreader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
#!/usr/bin/env python3
#
# Example of using SchemaGenerator programmatically instead of a command line
# script. This example consumes a JSON data set that has *already* been read
# into memory as a Python array of dict.

import json
import sys
from bigquery_schema_generator.generate_schema import SchemaGenerator

generator = SchemaGenerator(input_format='dict')
input_data = [
{
's': 'string',
'b': True,
},
{
'd': '2021-08-18',
'x': 3.1
},
]
schema_map, error_logs = generator.deduce_schema(input_data)
schema = generator.flatten_schema(schema_map)
json.dump(schema, sys.stdout, indent=2)
print()

0 comments on commit 2d983fa

Please sign in to comment.