Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to emit the root document if the desired sub-document does not exist #189

Merged
merged 7 commits into from
Jun 25, 2021

Conversation

snopoke
Copy link
Contributor

@snopoke snopoke commented Jun 24, 2021

The focus of the changes here is to allow exporting data from the messaging event API even when there are no messages. Each event may have multiple messages so in order to get one row per message we need an expression like this:

messaging-event.messages.[*]

However if there are no messages, as is the case when the event errors, we still want to export one row for the event. To accomplish this a new command line option is added to the CLI:

--export-root-if-no-subdocument
    Use this when you are exporting a nested document e.g. form.form..case, messaging-event.messages.[*]
    And you want to have a record exported even if the nested document does not exist or is empty.

When this option is used in conjunction with a Data Source query like those given in the examples, the minilnq expression
is adjusted so that if the main expression returns no results, the root document is used instead:

- Reference('messages.[*]')
+ Apply(Reference('_or_raw'), Reference('messages.[*]'), Reference("$"))

The _or_raw function is similar to or but it does not fully unwrap the values.

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and feature makes sense. Just had some small suggestions on testing.

Reference("id"), Reference('baz'), Reference('$.id'), Reference('$.foo.id'), Reference('$.foo.name')
]))
self.check_case(mmap.eval(env), [
['1.bar.1.bar.[0]', 'a1', '1', '1.bid', 'zip'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id is parent_id.key.child_id? and then if no child id is present then it's parent_id.key.parent_id.key[index]?

(mostly just confused by the first entry in this list)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if I'm being honest I'm having a hard time wrapping my head around several of these test cases. I appreciate the density of the test, but I wonder if some comments or named test cases might help a bit? Or if you think it's clear and the problem is my own understanding of minilinq that's fair too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is confusing and results from both a bug in jsonpath-rw and a bug in commcare-export:

I would like to fix these issues but it would results in breaking certain exports (any that are using the feature that allows you to specify the root doc expression in the Data Source column.

I'm going to think about how we could do the rollout. If you have thought's I'd be keen to know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way they break is that the IDs change, yeah? So if one were using SQL, and, e.g. a case was modified, then you could end up with duplicate rows for the children?

Are you worried about migrating exports that Dimagi maintains or breaking external usages? In general introducing breaking changes seems maybe acceptable so long as it's a major version bump and it's clearly documented (including how to fix - maybe in this case just by wiping your data and starting over?). Certainly the expectation would be that one should read release notes for any major library upgrade.

We could also add a warning statement in a point release to log when someone would be affected by the upgrade.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way they break is that the IDs change, yeah? So if one were using SQL, and, e.g. a case was modified, then you could end up with duplicate rows for the children?

That's correct.

Are you worried about migrating exports that Dimagi maintains or breaking external usages? In general introducing breaking changes seems maybe acceptable so long as it's a major version bump and it's clearly documented (including how to fix - maybe in this case just by wiping your data and starting over?). Certainly the expectation would be that one should read release notes for any major library upgrade.

I'm don't know of any that Dimagi maintains which use this feature. But if Dimagi is maintaining it we could work around the issue. I'm mostly concerned about those 'in the wild'.

We could also add a warning statement in a point release to log when someone would be affected by the upgrade.

Yes, that's also a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually if I'm being honest I'm having a hard time wrapping my head around several of these test cases. I appreciate the density of the test, but I wonder if some comments or named test cases might help a bit? Or if you think it's clear and the problem is my own understanding of minilinq that's fair too.

Refactored tests: 4d194d2 (much better, thanks for the prompt)

['1.bar.1.bar.[0]', 'a1', '1', '1.bid', 'zip'],
['1.bar.bazzer', 'a2', '1', '1.bid', 'zip'],
['2', [], '2', '2.bid', 'zap'],
['3.bar.3.bar.[0]', [], '3', '3.bid', 'mip'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also find it odd how different the id is in these two examples. it's because the dict counts as an entry but an empty list does not?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your confusion is well deserved. There are some 'bugs' and oddities in commcare-export as well as jsonpath-rw that cause this. I pushed a test to master earlier in the day to 'document' the weirdness: https://github.com/dimagi/commcare-export/blob/master/tests/test_minilinq.py#L58

I am planning to replace jsonpath-rw with jsonpath-ng which has a bunch of bugfixes and improvements and will resolve this specific issue (for this case 'id' will become '3').

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks I missed that. Sounds like I can mostly ignore the first column in these then? If so then everything is much clearer.

Copy link
Member

@czue czue Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(could we remove the id column from this test if it's already tested elsewhere and not part of what's being tested here?)

edit: actually I don't think this is a good idea since it is relevant to disambiguate what the base object is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commcare_export/excel_query.py Outdated Show resolved Hide resolved
value_or_root = get_value_or_root_expression('bar.[*]')
flatmap = FlatMap(source=Literal(data), body=value_or_root)
mmap = Map(source=flatmap, body=List([
Reference("id"), Reference('baz'), Reference('$.id'), Reference('$.foo.id'), Reference('$.foo.name')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth adding a test case where baz exists on the parent (and the child is empty)? Is the behavior there that it would use it or that it would remain empty if it has fallen back to the parent emit (my guess is the desired behavior is to be empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, it would actually take the value from the parent:

data = {"id": 6, "foo": {'id': 'bid', 'name': 'mop'}, "baz": "root_bazz"}
# id, baz, $.id, $.foo.id, $.foo.name
result = ['6', 'root_bazz', '6', '6.bid', 'mop']

That definitely isn't great. I have no idea how to work around that. I might be able to hack it by setting a special var in the env if we're in the case where we're using the root doc and then change the JsonPathEnv to only allow 'root' expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - I guess it's not possible to emit something like {'$': root} instead of the root document itself? (not sure if that even makes sense)

I don't know that it's an important enough edge-case to worry about, though does seem worth at least documenting the "expected" behavior in a test

Copy link
Contributor Author

@snopoke snopoke Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not loving this but.... 8ac159f

This works around the special case where we have replaced the doc
with the root doc. In this case any field reference that is not
'root based' should be ignored.
@snopoke snopoke mentioned this pull request Jun 25, 2021
@czue
Copy link
Member

czue commented Jun 25, 2021

looks like the other test now also needs to be updated

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good by me once tests pass. Thanks for filling in the gaps in my knowledge!

Reference("id"), Reference('baz'), Reference('$.id'), Reference('$.foo.id'), Reference('$.foo.name')
]))
self.check_case(mmap.eval(env), [
['1.bar.1.bar.[0]', 'a1', '1', '1.bid', 'zip'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way they break is that the IDs change, yeah? So if one were using SQL, and, e.g. a case was modified, then you could end up with duplicate rows for the children?

Are you worried about migrating exports that Dimagi maintains or breaking external usages? In general introducing breaking changes seems maybe acceptable so long as it's a major version bump and it's clearly documented (including how to fix - maybe in this case just by wiping your data and starting over?). Certainly the expectation would be that one should read release notes for any major library upgrade.

We could also add a warning statement in a point release to log when someone would be affected by the upgrade.

Copy link
Member

@czue czue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for test cleanup!

@snopoke snopoke merged commit a06e219 into master Jun 25, 2021
@snopoke snopoke deleted the sk/value-or-root branch June 25, 2021 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants