Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source File: fix schema generation for json files containing an array #16772

Conversation

davydov-d
Copy link
Collaborator

@davydov-d davydov-d commented Sep 15, 2022

What

https://github.com/airbytehq/oncall/issues/547
When a json file schema is discovered, it returns a schema of an array if it is a root element of a file.

How

Instead, return a schema of an inside dict elements

@github-actions github-actions bot added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Sep 15, 2022
@davydov-d
Copy link
Collaborator Author

davydov-d commented Sep 15, 2022

/test connector=connectors/source-file

🕑 connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/3060038973
✅ connectors/source-file https://github.com/airbytehq/airbyte/actions/runs/3060038973
Python tests coverage:

Name                      Stmts   Miss  Cover
---------------------------------------------
source_file/__init__.py       2      0   100%
source_file/client.py       275     42    85%
source_file/source.py        51     27    47%
---------------------------------------------
TOTAL                       328     69    79%
Name                      Stmts   Miss  Cover
---------------------------------------------
source_file/source.py        51      0   100%
source_file/__init__.py       2      0   100%
source_file/client.py       275     34    88%
---------------------------------------------
TOTAL                       328     34    90%
	 Name                                                 Stmts   Miss  Cover   Missing
	 ----------------------------------------------------------------------------------
	 source_acceptance_test/base.py                          10      4    60%   15-18
	 source_acceptance_test/config.py                        83      6    93%   78-80, 84-86
	 source_acceptance_test/conftest.py                     164    164     0%   6-282
	 source_acceptance_test/plugin.py                        48     48     0%   6-104
	 source_acceptance_test/tests/test_core.py              329    111    66%   39, 50-58, 63-70, 74-75, 79-80, 164, 202-219, 228-236, 240-245, 251, 284-289, 327-334, 374-376, 379, 439-448, 477-478, 484, 487, 520-530, 543-568, 573-577
	 source_acceptance_test/tests/test_full_refresh.py       52      2    96%   34, 65
	 source_acceptance_test/tests/test_incremental.py       121     25    79%   21-23, 29-31, 36-43, 48-61, 208-216
	 source_acceptance_test/utils/asserts.py                 37      2    95%   57-58
	 source_acceptance_test/utils/common.py                  77     17    78%   15-16, 24-30, 47-54, 64, 67
	 source_acceptance_test/utils/compare.py                 62     23    63%   21-51, 68, 97-99
	 source_acceptance_test/utils/connector_runner.py       110     48    56%   23-26, 32, 36, 39-64, 67-69, 72-74, 77-79, 82-84, 87-89, 92-110, 144-146
	 source_acceptance_test/utils/json_schema_helper.py     105     13    88%   30-31, 38, 41, 65-68, 96, 120, 190-192
	 ----------------------------------------------------------------------------------
	 TOTAL                                                 1325    463    65%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/source_acceptance_test/plugin.py:60: Skipping TestIncremental.test_two_sequential_reads because not found in the config
=================== 26 passed, 1 skipped in 65.29s (0:01:05) ===================

if "items" in result:
# this means we have a json list e.g. [{...}, {...}]
# but need to emit schema of an inside dict
result = result["items"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but is there any thing to do / other validation if the json file is an array of anything other than object?

For example, if the json file is just ["apple", "orange", "kiwi"] this is probably invalid. (maybe a validation in check should be added here, but feel free to make a different issue for this and tackle it separately)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be valid

>>> builder = genson.SchemaBuilder()
>>> builder.add_object(['a', 'b', 'c'])
>>> builder.to_schema()["items"]
{'type': 'string'}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the schema is technically valid for this, but what I mean is this will also cause issues with normalization since we expect each stream to have an object with properties as its schema (in this case it will just be a string). We probably shouldn't accept json files that aren't objects or an array of objects. Does that make sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see now
sure it makes sense, I will create a separate issue to work on this problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davydov-d
Copy link
Collaborator Author

davydov-d commented Sep 19, 2022

/publish connector=connectors/source-file

🕑 Publishing the following connectors:
connectors/source-file
https://github.com/airbytehq/airbyte/actions/runs/3080911294


Connector Did it publish? Were definitions generated?
connectors/source-file

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@davydov-d
Copy link
Collaborator Author

davydov-d commented Sep 19, 2022

/publish connector=connectors/source-file-secure

🕑 Publishing the following connectors:
connectors/source-file-secure
https://github.com/airbytehq/airbyte/actions/runs/3081050614


Connector Did it publish? Were definitions generated?
connectors/source-file-secure

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@davydov-d davydov-d merged commit d780141 into master Sep 19, 2022
@davydov-d davydov-d deleted the ddavydov/#547-oncall-source-file-fix-schema-generation-for-json branch September 19, 2022 09:18
robbinhan pushed a commit to robbinhan/airbyte that referenced this pull request Sep 29, 2022
…airbytehq#16772)

* airbytehq#547 oncall Source File: fix schema generation for json files containing arrays

* source file: upda changelog

* airbytehq#547 oncall: source file - upgrade source-file-secure

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request Oct 31, 2022
…airbytehq#16772)

* airbytehq#547 oncall Source File: fix schema generation for json files containing arrays

* source file: upda changelog

* airbytehq#547 oncall: source file - upgrade source-file-secure

* auto-bump connector version [ci skip]

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/file connectors/source/file-secure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants