FIX: Nested Records of objects of objects.#110
Conversation
26bd2f4 to
7f8857e
Compare
tests/test_denest.py
Outdated
|
|
||
| root_table_checked = False | ||
| nested_table_checked = False | ||
| nested_nested_table_checked = False |
There was a problem hiding this comment.
Should we split up the test in three parts like this in the same test? We could place these three in separate test functions or just have one loop through denested in one test function in my opinion. What do you think is the way to go here?
There was a problem hiding this comment.
I like that idea. I'll refactor for that. My main concern originally with going that route, was that there'll be a modicum of "helper" code around these tests which will be obfuscating the actual test. Additionally, I was worried that previously we did really precise tests which looked for specific things and hence had specific schemas. The bug we're experiencing is because we didn't have a gnarlier schema and records to pick apart etc. As such, I'm going to plan on the three tests sharing the same schema and records.
That all sound right to you?
tests/test_denest.py
Outdated
|
|
||
| assert {} == table_batch['streamed_schema']['schema']['properties'] | ||
|
|
||
| assert 5 == len(table_batch['records']) |
There was a problem hiding this comment.
This is not really related to this PR:
When I was running the tests and looking at the printed denested data I started thinking about this line. Why is there empty dicts ({}) added to the records field in the _denest_record function? Even if the record doesn't contain any items we still add an empty dict. Is that desired?
We could change this with a simple check in _denest_record when appending denested records:
if denested_record != {}:
records_map[table_path].append(denested_record)If we want to change this we should do it in another PR. Also, this change makes some length asserts fail and that needs to be addressed as well in that case.
There was a problem hiding this comment.
Yeah, it's necessary. So if you get rid of that, then there's no way for our code to create a key for the denested values to relate back to.
ie, think of if you had a schema like this:
{"properties":
{"a": {"items": [...]},
"b": {"items": [...]}}}So a and b relate to each other in that they're on the same parent right? But when we denest, this'll turn into:
root: {}
root__a: {...}
root__b: {...}
Since we generate uuids for tables which don't have key_properties set, we can then relate the child tables and records back to each other in postgres.
If we get rid of the record, it indicates to everything else that there's just nothing there...
(now, mind you, the code above here where we're setting up the Singer Stream etc., actually injects a key_properties and uuids etc. so you can't ever have this case of empty tables. There'll always be the StitchData _sdc columns/fields/properties)
There was a problem hiding this comment.
Ah I see, thanks for the explanation! Makes perfect sense 👍
tests/test_denest.py
Outdated
| if ('a', 'b') == table_batch['streamed_schema']['path']: | ||
| nested_table_checked = True | ||
|
|
||
| assert {'type': 'integer'} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')] |
There was a problem hiding this comment.
Needs to be assert {'type': ['integer']} == ... to match result from table_batch
tests/test_denest.py
Outdated
| assert 7 == len(table_batch['records']) | ||
|
|
||
| for record in table_batch['records']: | ||
| assert 'integer' == record[('c', 'd')][0] |
There was a problem hiding this comment.
Above this line I needed to add:
if record == {}:
continuedue to one empty dict. Accessing key ('c', 'd') will otherwise fail.
tests/test_denest.py
Outdated
| if ('a', 'b', 'c', 'e') == table_batch['streamed_schema']['path']: | ||
| nested_nested_table_checked = True | ||
|
|
||
| assert {'type': 'string'} == table_batch['streamed_schema']['schema']['properties'][('f',)] |
There was a problem hiding this comment.
Same goes for line under but instead: 'boolean' -> ['boolean']
tests/test_denest.py
Outdated
|
|
||
| for record in table_batch['records']: | ||
| assert 'string' == record[('f',)][1] | ||
| assert record[('g',)][1] |
There was a problem hiding this comment.
I would change this to:
for record in table_batch['records']:
assert 'string' == record[('f',)][0]
assert str == type(record[('f',)][1])
assert 'boolean' == record[('g',)][0]
assert bool == type(record[('g',)][1])There was a problem hiding this comment.
Just to have it consistent with what we have in nested_table_checked loop:
assert 'integer' == record[('c', 'd')][0]
assert int == type(record[('c', 'd')][1])
tests/test_denest.py
Outdated
| assert nested_nested_table_checked | ||
|
|
||
| for table_batch in denested: | ||
| assert [] == errors(table_batch) |
There was a problem hiding this comment.
At the end maybe we should add:
print('PASSED')
print()I have seen this in other test functions so to keep the consistency we should add that as well.
There was a problem hiding this comment.
Ah, I'd actually vote to remove it from the other tests. It's a relic from when we didn't have Pytest setup to be as helpful
CStejmar
left a comment
There was a problem hiding this comment.
With some of the changes I proposed in comments I get this to run and pass all tests with my fix in #111. If you fix those that are crucial we could merge this. I'll post my version of the test function below that I got it working with and you can check if I made any mistakes or errors. Great work with setting this test up so quickly @AlexanderMann! To me it seems to cover all complex nesting cases 👍. I tried it on master as well and there it failed with (as expected):
> assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
E KeyError: ('c', 'd')
tests/test_denest.py:245: KeyError
------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------
denested: [{'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': (), 'level': None, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {}}}, 'records': [{}, {}, {}, {}, {}]}, {'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': ('a', 'b', 'c', 'e'), 'level': 1, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('f',): {'type': ['string']}, ('g',): {'type': ['boolean']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}, ('_sdc_level_1_id',): {'type': ['integer']}}}}, 'records': []}, {'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': ('a', 'b'), 'level': 0, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('a', 'b', 'c', 'd'): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}}}, 'records': [{('c', 'd'): ('integer', 1)}, {('c', 'd'): ('integer', 12)}, {('c', 'd'): ('integer', 123)}, {('c', 'd'): ('integer', 1234)}, {('c', 'd'): ('integer', 12345)}, {('c', 'd'): ('integer', 123456)}, {}]}]
table batch properties: {('a', 'b', 'c', 'd'): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}
=================================================================================== 1 failed, 72 passed in 54.83 seconds ===================================================================================
Modified test function
def test__records__nested():
denested = denest.to_table_batches({
"properties": {
"a": {"type": "object",
"properties": {
"b": {
"type": "array",
"items": {
"type": "object",
"properties": {
"c": {
"type": "object",
"properties": {
"d": {"type": "integer"},
"e": {"type": "array",
"items": {"type": "object",
"properties": {
"f": {"type": "string"},
"g": {"type": "boolean"}}}}}}}}}}}}},
[],
[{"a": {"b": []}},
{"a": {"b": [{"c": {"d": 1}}]}},
{"a": {"b": [{"c": {"d": 12}},
{"c": {"d": 123}}]}},
{"a": {"b": [{"c": {"d": 1234}},
{"c": {"d": 12345}},
{"c": {"d": 123456}}]}},
{"a": {"b": [{"c": {"e": [{"f": "hello",
"g": True},
{"f": "goodbye",
"g": True}]}}]}}])
print('denested:', denested)
assert 3 == len(denested)
root_table_checked = False
nested_table_checked = False
nested_nested_table_checked = False
for table_batch in denested:
if tuple() == table_batch['streamed_schema']['path']:
root_table_checked = True
assert {} == table_batch['streamed_schema']['schema']['properties']
assert 5 == len(table_batch['records'])
for record in table_batch['records']:
assert {} == record
for table_batch in denested:
if ('a', 'b') == table_batch['streamed_schema']['path']:
nested_table_checked = True
assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
assert 7 == len(table_batch['records'])
for record in table_batch['records']:
# Don't try to access key "('c', 'd')" if record is empty
if record == {}:
continue
assert 'integer' == record[('c', 'd')][0]
assert int == type(record[('c', 'd')][1])
for table_batch in denested:
if ('a', 'b', 'c', 'e') == table_batch['streamed_schema']['path']:
nested_nested_table_checked = True
assert {'type': ['string']} == table_batch['streamed_schema']['schema']['properties'][('f',)]
assert {'type': ['boolean']} == table_batch['streamed_schema']['schema']['properties'][('g',)]
assert 2 == len(table_batch['records'])
for record in table_batch['records']:
assert 'string' == record[('f',)][0]
assert str == type(record[('f',)][1])
assert 'boolean' == record[('g',)][0]
assert bool == type(record[('g',)][1])
assert root_table_checked
assert nested_table_checked
assert nested_nested_table_checked
for table_batch in denested:
assert [] == errors(table_batch)
print('PASSED')
print()| "g": True}, | ||
| {"f": "goodbye", | ||
| "g": True}]}}]}}]) | ||
| NESTED_SCHEMA = { |
| "f": {"type": "string"}, | ||
| "g": {"type": "boolean"}}}}}}}}}}}}} | ||
|
|
||
| NESTED_RECORDS = [{"a": {"b": []}}, |
|
|
||
|
|
||
| def test__records__nested__tables(): | ||
| denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS) |
tests/test_denest.py
Outdated
| def _get_table_batch_with_path(table_batches, path): | ||
| for table_batch in table_batches: | ||
| if path == table_batch['streamed_schema']['path']: | ||
| return path |
There was a problem hiding this comment.
Here we need to return table_batch instead of path otherwise we fail the tests on trying to access 'schema' etc. I know you meant do that since the function name is _get_table_batch_with_path :).
There was a problem hiding this comment.
🤦♂️ this has been a hilariously tough thing for me to do as I don't actually have the working code etc.
There was a problem hiding this comment.
Yeah I can imagine that 🙂. Anyway, great work! 🚀
CStejmar
left a comment
There was a problem hiding this comment.
Fix the problem in _get_table_batch_with_path that I commented on and we are good to go! I like this refactoring 👍. I will approve! All tests pass with my changes from PR #111 and fails on current master 👍 .
Here is what is failing without my denesting fix:
================================================================================================= FAILURES =================================================================================================
_________________________________________________________________________________ test__records__nested__child_table__a_b __________________________________________________________________________________
def test__records__nested__child_table__a_b():
denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS)
table_batch = _get_table_batch_with_path(denested,
('a', 'b'))
> assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
E KeyError: ('c', 'd')
tests/test_denest.py:255: KeyError
_______________________________________________________________________________ test__records__nested__child_table__a_b_c_e ________________________________________________________________________________
def test__records__nested__child_table__a_b_c_e():
denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS)
table_batch = _get_table_batch_with_path(denested,
('a', 'b', 'c', 'e'))
assert {'type': ['string']} == table_batch['streamed_schema']['schema']['properties'][('f',)]
assert {'type': ['boolean']} == table_batch['streamed_schema']['schema']['properties'][('g',)]
> assert 2 == len(table_batch['records'])
E assert 2 == 0
E + where 0 = len([])
tests/test_denest.py:275: AssertionError
=================================================================================== 2 failed, 74 passed in 53.54 seconds ===================================================================================
which is as expected. It passes test__records__nested__tables() and test__records__nested__root_empty() which is expected as well.
Motivation
#109
This PR exists to lay down test cases which our denesting should handle but currently doesn't.
Suggested Musical Pairing
Springtime in Milwaukee.