Skip to content

FIX: Nested Records of objects of objects.#110

Merged
AlexanderMann merged 4 commits intodatamill-co:masterfrom
AlexanderMann:fix/nested-records--array-object-object
Apr 11, 2019
Merged

FIX: Nested Records of objects of objects.#110
AlexanderMann merged 4 commits intodatamill-co:masterfrom
AlexanderMann:fix/nested-records--array-object-object

Conversation

@AlexanderMann
Copy link
Copy Markdown
Collaborator

@AlexanderMann AlexanderMann commented Apr 4, 2019

Motivation

#109

This PR exists to lay down test cases which our denesting should handle but currently doesn't.

Suggested Musical Pairing

Springtime in Milwaukee.

@AlexanderMann AlexanderMann force-pushed the fix/nested-records--array-object-object branch from 26bd2f4 to 7f8857e Compare April 8, 2019 20:34
@AlexanderMann AlexanderMann marked this pull request as ready for review April 8, 2019 20:36

root_table_checked = False
nested_table_checked = False
nested_nested_table_checked = False
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split up the test in three parts like this in the same test? We could place these three in separate test functions or just have one loop through denested in one test function in my opinion. What do you think is the way to go here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. I'll refactor for that. My main concern originally with going that route, was that there'll be a modicum of "helper" code around these tests which will be obfuscating the actual test. Additionally, I was worried that previously we did really precise tests which looked for specific things and hence had specific schemas. The bug we're experiencing is because we didn't have a gnarlier schema and records to pick apart etc. As such, I'm going to plan on the three tests sharing the same schema and records.

That all sound right to you?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds perfect!


assert {} == table_batch['streamed_schema']['schema']['properties']

assert 5 == len(table_batch['records'])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really related to this PR:
When I was running the tests and looking at the printed denested data I started thinking about this line. Why is there empty dicts ({}) added to the records field in the _denest_record function? Even if the record doesn't contain any items we still add an empty dict. Is that desired?

We could change this with a simple check in _denest_record when appending denested records:

if denested_record != {}:
        records_map[table_path].append(denested_record)

If we want to change this we should do it in another PR. Also, this change makes some length asserts fail and that needs to be addressed as well in that case.

Copy link
Copy Markdown
Collaborator Author

@AlexanderMann AlexanderMann Apr 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's necessary. So if you get rid of that, then there's no way for our code to create a key for the denested values to relate back to.

ie, think of if you had a schema like this:

{"properties":
  {"a": {"items": [...]},
   "b": {"items": [...]}}}

So a and b relate to each other in that they're on the same parent right? But when we denest, this'll turn into:

root: {}
root__a: {...}
root__b: {...}

Since we generate uuids for tables which don't have key_properties set, we can then relate the child tables and records back to each other in postgres.

If we get rid of the record, it indicates to everything else that there's just nothing there...

(now, mind you, the code above here where we're setting up the Singer Stream etc., actually injects a key_properties and uuids etc. so you can't ever have this case of empty tables. There'll always be the StitchData _sdc columns/fields/properties)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks for the explanation! Makes perfect sense 👍

if ('a', 'b') == table_batch['streamed_schema']['path']:
nested_table_checked = True

assert {'type': 'integer'} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be assert {'type': ['integer']} == ... to match result from table_batch

assert 7 == len(table_batch['records'])

for record in table_batch['records']:
assert 'integer' == record[('c', 'd')][0]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above this line I needed to add:

if record == {}:
    continue

due to one empty dict. Accessing key ('c', 'd') will otherwise fail.

if ('a', 'b', 'c', 'e') == table_batch['streamed_schema']['path']:
nested_nested_table_checked = True

assert {'type': 'string'} == table_batch['streamed_schema']['schema']['properties'][('f',)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'string' -> ['string']

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same goes for line under but instead: 'boolean' -> ['boolean']


for record in table_batch['records']:
assert 'string' == record[('f',)][1]
assert record[('g',)][1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to:

for record in table_batch['records']:
    assert 'string' == record[('f',)][0]
    assert str == type(record[('f',)][1])

    assert 'boolean' == record[('g',)][0]
    assert bool == type(record[('g',)][1])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to have it consistent with what we have in nested_table_checked loop:

assert 'integer' == record[('c', 'd')][0]
assert int == type(record[('c', 'd')][1])

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

assert nested_nested_table_checked

for table_batch in denested:
assert [] == errors(table_batch)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end maybe we should add:

print('PASSED')
print()

I have seen this in other test functions so to keep the consistency we should add that as well.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'd actually vote to remove it from the other tests. It's a relic from when we didn't have Pytest setup to be as helpful

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok great! Then we remove it :)

Copy link
Copy Markdown

@CStejmar CStejmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some of the changes I proposed in comments I get this to run and pass all tests with my fix in #111. If you fix those that are crucial we could merge this. I'll post my version of the test function below that I got it working with and you can check if I made any mistakes or errors. Great work with setting this test up so quickly @AlexanderMann! To me it seems to cover all complex nesting cases 👍. I tried it on master as well and there it failed with (as expected):

>               assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
E               KeyError: ('c', 'd')

tests/test_denest.py:245: KeyError
------------------------------------------------------------------------------------------- Captured stdout call -------------------------------------------------------------------------------------------
denested: [{'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': (), 'level': None, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {}}}, 'records': [{}, {}, {}, {}, {}]}, {'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': ('a', 'b', 'c', 'e'), 'level': 1, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('f',): {'type': ['string']}, ('g',): {'type': ['boolean']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}, ('_sdc_level_1_id',): {'type': ['integer']}}}}, 'records': []}, {'streamed_schema': {'type': 'TABLE_SCHEMA', 'path': ('a', 'b'), 'level': 0, 'key_properties': [], 'mappings': [], 'schema': {'type': 'object', 'additionalProperties': False, 'properties': {('a', 'b', 'c', 'd'): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}}}, 'records': [{('c', 'd'): ('integer', 1)}, {('c', 'd'): ('integer', 12)}, {('c', 'd'): ('integer', 123)}, {('c', 'd'): ('integer', 1234)}, {('c', 'd'): ('integer', 12345)}, {('c', 'd'): ('integer', 123456)}, {}]}]
table batch properties:  {('a', 'b', 'c', 'd'): {'type': ['integer']}, ('_sdc_sequence',): {'type': ['null', 'integer']}, ('_sdc_level_0_id',): {'type': ['integer']}}
=================================================================================== 1 failed, 72 passed in 54.83 seconds ===================================================================================

Modified test function

def test__records__nested():
    denested = denest.to_table_batches({
        "properties": {
            "a": {"type": "object",
                  "properties": {
                      "b": {
                          "type": "array",
                          "items": {
                              "type": "object",
                              "properties": {
                                  "c": {
                                      "type": "object",
                                      "properties": {
                                          "d": {"type": "integer"},
                                          "e": {"type": "array",
                                                "items": {"type": "object",
                                                          "properties": {
                                                              "f": {"type": "string"},
                                                              "g": {"type": "boolean"}}}}}}}}}}}}},
        [],
        [{"a": {"b": []}},
         {"a": {"b": [{"c": {"d": 1}}]}},
         {"a": {"b": [{"c": {"d": 12}},
                      {"c": {"d": 123}}]}},
         {"a": {"b": [{"c": {"d": 1234}},
                      {"c": {"d": 12345}},
                      {"c": {"d": 123456}}]}},
         {"a": {"b": [{"c": {"e": [{"f": "hello",
                                    "g": True},
                                   {"f": "goodbye",
                                    "g": True}]}}]}}])

    print('denested:', denested)

    assert 3 == len(denested)

    root_table_checked = False
    nested_table_checked = False
    nested_nested_table_checked = False

    for table_batch in denested:
        if tuple() == table_batch['streamed_schema']['path']:
            root_table_checked = True

            assert {} == table_batch['streamed_schema']['schema']['properties']

            assert 5 == len(table_batch['records'])

            for record in table_batch['records']:
                assert {} == record

    for table_batch in denested:
        if ('a', 'b') == table_batch['streamed_schema']['path']:
            nested_table_checked = True

            assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]

            assert 7 == len(table_batch['records'])

            for record in table_batch['records']:
                # Don't try to access key "('c', 'd')" if record is empty
                if record == {}:
                    continue
                assert 'integer' == record[('c', 'd')][0]
                assert int == type(record[('c', 'd')][1])

    for table_batch in denested:
        if ('a', 'b', 'c', 'e') == table_batch['streamed_schema']['path']:
            nested_nested_table_checked = True

            assert {'type': ['string']} == table_batch['streamed_schema']['schema']['properties'][('f',)]
            assert {'type': ['boolean']} == table_batch['streamed_schema']['schema']['properties'][('g',)]

            assert 2 == len(table_batch['records'])

            for record in table_batch['records']:
                assert 'string' == record[('f',)][0]
                assert str == type(record[('f',)][1])

                assert 'boolean' == record[('g',)][0]
                assert bool == type(record[('g',)][1])

    assert root_table_checked
    assert nested_table_checked
    assert nested_nested_table_checked

    for table_batch in denested:
        assert [] == errors(table_batch)

    print('PASSED')
    print()

@AlexanderMann
Copy link
Copy Markdown
Collaborator Author

@CStejmar 🏓

"g": True},
{"f": "goodbye",
"g": True}]}}]}}])
NESTED_SCHEMA = {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

"f": {"type": "string"},
"g": {"type": "boolean"}}}}}}}}}}}}}

NESTED_RECORDS = [{"a": {"b": []}},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍



def test__records__nested__tables():
denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

def _get_table_batch_with_path(table_batches, path):
for table_batch in table_batches:
if path == table_batch['streamed_schema']['path']:
return path
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we need to return table_batch instead of path otherwise we fail the tests on trying to access 'schema' etc. I know you meant do that since the function name is _get_table_batch_with_path :).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ this has been a hilariously tough thing for me to do as I don't actually have the working code etc.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I can imagine that 🙂. Anyway, great work! 🚀

Copy link
Copy Markdown

@CStejmar CStejmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix the problem in _get_table_batch_with_path that I commented on and we are good to go! I like this refactoring 👍. I will approve! All tests pass with my changes from PR #111 and fails on current master 👍 .

Here is what is failing without my denesting fix:

================================================================================================= FAILURES =================================================================================================
_________________________________________________________________________________ test__records__nested__child_table__a_b __________________________________________________________________________________

    def test__records__nested__child_table__a_b():
        denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS)
        table_batch = _get_table_batch_with_path(denested,
                                                 ('a', 'b'))
    
>       assert {'type': ['integer']} == table_batch['streamed_schema']['schema']['properties'][('c', 'd')]
E       KeyError: ('c', 'd')

tests/test_denest.py:255: KeyError
_______________________________________________________________________________ test__records__nested__child_table__a_b_c_e ________________________________________________________________________________

    def test__records__nested__child_table__a_b_c_e():
        denested = denest.to_table_batches(NESTED_SCHEMA, [], NESTED_RECORDS)
        table_batch = _get_table_batch_with_path(denested,
                                                 ('a', 'b', 'c', 'e'))
    
        assert {'type': ['string']} == table_batch['streamed_schema']['schema']['properties'][('f',)]
        assert {'type': ['boolean']} == table_batch['streamed_schema']['schema']['properties'][('g',)]
    
>       assert 2 == len(table_batch['records'])
E       assert 2 == 0
E        +  where 0 = len([])

tests/test_denest.py:275: AssertionError
=================================================================================== 2 failed, 74 passed in 53.54 seconds ===================================================================================

which is as expected. It passes test__records__nested__tables() and test__records__nested__root_empty() which is expected as well.

@AlexanderMann AlexanderMann merged commit 8d34050 into datamill-co:master Apr 11, 2019
@AlexanderMann AlexanderMann deleted the fix/nested-records--array-object-object branch April 11, 2019 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants