Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@odk/xpath: initial XPath implementation #3

Merged
merged 32 commits into from
Oct 19, 2023
Merged

Conversation

eyelidlessness
Copy link
Member

@eyelidlessness eyelidlessness commented Oct 17, 2023

What's in the box

An XPath implementation, covering almost all of the behavior in Enketo's current evaluator (openrosa-xpath-evaluator [ORXE]) with a few very specific exceptions:

  • Backwards-compatible support for custom functions. The interface provides a different mechanism which should support this use case if we decide it's important.
  • A small handful of test failures where the actual expected behavior isn't totally clear, namely:
    • Date[time] strings lacking expected zero padding (i.e. date("1970-1-2")). I have a vague memory where I had thought I saw discussion of this, but I've since been unable to find it.
    • Bare equality comparison of date and datetime strings outside of any casting context (i.e. "2018-06-25" = "2018-06-25T00:00:00.000-07:00" as a complete expression). I think a case could be made for this, but it creates a grammar ambiguiuty I'm not entirely comfortable with. And supporting it would require some compromises I'd prefer to justify before diving in for a fix.
    • Formatting date[time] strings expressed as a colloquial/localized format (i.e. format-date('Mon Oct 16 2023 15:57:39 GMT-0700 (Pacific Daylight Time)', '%e | %a')). The string as provided is a serialization produced (and parseable) by JavaScript's Date object; my suspicion is that's why this is supported. If it is actually expected, it shouldn't be a huge lift, but my inclination is to avoid it otherwise.
    • Exponent notation in numeric literals (i.e. int(7.922021953507237e-12)). This is already supported when casting from a string literal, but as a numeric literal it deviates from the XPath grammar. I was able to find JavaRosa test cases for the string-number cast, but I didn't see any for the numeric literal case. Again, my suspicion as to why this was supported in Enketo is the likely casting mechanism (Number(sourceLiteral) or similar). If we do want to support this, casting isn't so much an issue, but it would either require deviation in the tree-sitter grammar as it's currently a parse error, or some effort towards handling it as a parse error.
    • Ambiguity between runtimes on support for time strings with greater than 60 seconds (i.e. decimal-time("06:00:60.000-07:00")). The tests brought in from Enketo expect this to be invalid. Chrome/Chromium/Node and Firefox all treat it as invalid, Safari/WebKit does not.
  • Possibly a lingering set of time-sensitive bugs. But I have a feeling that was actually a discrepancy between CI's time zone and the time zone specified at runtime.

What else is in the box:

  • As implied above: Node support (tested with jsdom, the default DOM compatibility layer used by Vitest; others are currently untested and may vary significantly depending on their divergence from native DOM implementations). As should be obvious, the implementation is currently explicitly tied to DOM APIs—but notably, it is not tied to an underlying XPath implementation.
  • Also as implied above: time zone is parameterized at runtime, rather than derived entirely from the environment (though that remains the default).
  • Conversion of all of Enketo/ORXE's integration tests to TypeScript, along with some significant improvements to test isolation (i.e. fixture setup is test/suite local, rather than implicitly test-run-global) and clarity of test parameterization. All of this, I believe, could pretty easily make its way back to ORXE if we so choose.
  • A pretty thorough exercise/validation of tree-sitter-xpath from a correctness/completion standpoint.

What's not in the box

Several functions specified in the ODK spec are not yet implemented, as they were not in ORXE:

  • current
  • indexed-repeat
  • instance
  • pulldata
  • version

These were all implemented in Enketo Core, for reasons either specified there in comments or for somewhat guessable reasons. Whatever the historical reasons, I believe these can all reasonably be implemented here. I chose to defer their implementation to limit the already quite sizeable scope of this current init, and to leave them open for design discussion as some probably warrant (and as they may dovetail with corresponding work on related XForms-spec functionality).

Why/goals

Normally I like to start PR notes like this on "why?" But I think the above "what?" fit better first in this case.

  1. This was originally going to be a personal exercise, to validate tree-sitter-xpath (as in, "it's not a library if it doesn't have users"). This remained a goal as the implementation felt more worth pursuing for its own sake. (Conclusion: I believe this has accomplished that!)
  2. Accomplish a real/tangible/usable/completeable thing in this early effort toward a potential rewrite. (Conclusion: here's the PR, complete-ish pending failures discussed above and likely cleanup/rework in review)
  3. Accomplish a thing which could potentially also benefit Enketo. (Conclusion: apart from lacking support for custom functions, it should be a drop-in replacement. It would be more work to integrate support for the remaining functions.)
  4. Establish some early precedent for certain principles of design and approach (details below).

Notes on design and approach

Domain focus

I wanted to take this as an early opportunity to establish focus on the problem domain as a key design principle. In this case, "the problem domain" is a set of technical specifications, which isn't as great an exercise of the principle as some other areas we'll get to. But it's a principle that's informed several aspects of the design.

Naming: as much as possible, where a name is used in the pertinent specs or where the spec language suggests obvious naming for particular concepts, that is the naming used in code. Where derivative types or values are used, their names are intentionally derivative of the underlying spec concepts. (There is at least one particular area of concern on this front, LocationPathEvaluation, which is further discussed in code comments.)

Domain logic: as much as possible, implementation logic is intended to follow the language and/or inferrable spirit of the respective aspects of the specs under implementation. While comments are presently sparse, should we desire more extensive commentary the expectation is that spec language should be able to reasonably describe these implementation details. There are cases where implementation is more platform specific, generally where the underlying APIs have important semantic differences. And in a few cases (which we can expect to grow as appropriate), there are implementation considerations specifically for their performance characteristics.

Compliance: specific effort has been made to ensure layers of implementation are compliant with the specs they’re concerned with; where specs differ (ie where the ODK XForms spec notes exceptions to the standards they reference), those divergences are achieved by explicit extension and composition mechanisms rather than baked into lower level spec-conforming aspects. A notable example is datetimes: where they have implications for grammar or function behavior, those are achieved at a higher level than the implementation of underlying XPath data types and built-in functions. And at least some preliminary effort has been made to position future extensibility as analogous to the mechanisms in underlying/related specs (for example some notion of support for function namespacing).

Scoped configurability/extensibility

To the extent the evaluator is sensitive to contextual state—and to the extent that its behavior is extensible—this is achieved in a local instantiation of the evaluation environment. There is still room for improvement on this approach! But the intent is to establish clear patterns of contextual customization from the outset, and to eschew hard coding non-spec assumptions into spec-focused code as a general rule. Exceptions to that rule should ideally have to meet a high bar to be justified.

Performance

The design as implemented is performance-minded, if not currently optimal. Which is to say that explicit care was taken to choose a general approach and specific aspects of implementation, which are some combination of:

  • Known to perform well in the general case (strong emphasis on monomorphization, using known-fast platform capabilities)
  • Likely to perform well by deferring or avoiding work where possible (eg lazy evaluation, obvious opportunities to short-circuit)
  • Likely to perform well by reducing redundant work (eg caching likely-expensive computations)

Early prototyping revealed several clear optimization opportunities, which informed further efforts, but for the most part I put that on hold as I focused on “make it work” and “make it right”. TL;DR, this is probably fast-ish, can probably be even faster, and was intentionally built with further optimization in mind.

To set/inspire initial expectations: in my early prototyping, I saw significantly better overall performance than the browsers' native XPath evaluators for the same expressions. Having been some time since I last measured, I expect performance has degraded from that high bar. But I also believe "faster than native" is a realistic potentiality, if we choose to priotirize that.

Regardless, I want to make benchmarking a part of our automated suite of quality control tools, but I think that's best saved for later.

Maintainability and beyond

Here I don't (only) mean the ability to reasonably deliver bug fixes, or add functionality. Although I do think this is a good first effort on that front.

I also wanted a design which lends itself to tooling to aid future development and integration. The tooling we already use, e.g. code navigation, runtime debugging. And tooling we may want to develop for our users or ourselves, e.g. potential to introduce static analysis to help understand expressions' behavior, or to document information flow and relationships between parts of a form. Or even to, say, help building something like a form builder 😉

eyelidlessness and others added 28 commits October 18, 2023 14:31
Generally defer to `number` function (overridden in xforms function library) for date/datetime-like values in argument positions and numeric comparisons
Whatever cache it’s hitting is probably wrong
This seems like it was probably the cause of several time-of-day-specific test failures. Setting it (temporarily, I hope) in CI should probably mean we can keep testing WebKit (which isn’t otherwise affected by e.g. the `TZ` environment variable like the other browsers)
Annotations might be good for pointing to the remaining test failures for discussion
I’ve missed formatting so many times without this. It should also be a check in CI.
Fix build p2
Tests are left as they’re still part of the original suite
@lognaturel
Copy link
Member

Date[time] strings lacking expected zero padding (i.e. date("1970-1-2")).

This does work in JavaRosa so there are definitely forms using it. It would be good to support but seems fine to just file an issue.

Bare equality comparison of date and datetime strings outside of any casting context (i.e. "2018-06-25" = "2018-06-25T00:00:00.000-07:00" as a complete expression)

I think this is expected to be string comparison and evaluate to false().

my suspicion is that's why this is supported

Yes, I don't think that's expected to be supported. Or very useful, frankly!

exponent notation in numeric literals (i.e. int(7.922021953507237e-12)

I don't think it's expected to be supported.

decimal-time("06:00:60.000-07:00")

I'd expect it to be invalid. JavaRosa agrees.

@@ -0,0 +1 @@
# @odk/xpath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not have changelogs if we can help it. Seems git and Github are sufficient sources of history. I know some people really want the changes to be baked into the source but I am not one of them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also discussed in chat: I believe this is just a placeholder and will hopefully be updated by changeset when preparing a release (this is how it appears from other packages using the tool). I also don't want to keep manual changelogs, so if that turns out to be wrong I'd gladly remove it (and others)

- Context-based extensibility of behavior beyond the XPath 1.0 function library
- [ODK XForms](https://getodk.github.io/xforms-spec/) XPath extensions
- Non-browser environments (provided a subset of standard DOM APIs)[^1]
- Minor accommodations for certain [Enketo](https://github.com/enketo/openrosa-xpath-evaluator) function aliases[^2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to monorepo? I think we're still planning on archiving that repo shortly.


The DOM standard is very complex, and compatibility libraries sometimes differ in terms of standards compliance and compatibility with the major browser implementations. It is likely, perhaps inevitable, that there will be edge cases and minor compatiblity issues when using such a library. At time of writing, no such issues are known in our tested environments[^1], but a couple of minor issues have already been found and fixed during the initial test setup.

[^1]: Currently tested in Node, with [jsdom](https://github.com/jsdom/jsdom). There may be issues with other runtimes and/or DOM compatibility APIs. Node compatibility **does not** require any native extensions. DOM compatibility **does not** require any underlying XPath evaluation functionality (though it does currently rely on global constants like `XPathResult`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stylistic nit but I feel like all this content should be inline rather than in footnotes 1. ^1 could be a Environments tested in section.

Other than that, great readme. 👍

Footnotes

  1. I love DFW

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to augment with some of the design considerations from the PR description and some things we discussed about deferring to native like the current Enketo evaluator does vs not

@eyelidlessness
Copy link
Member Author

I think I've addressed everything here except design notes in the README. I'll merge for now and add an issue to expand on that.

@eyelidlessness eyelidlessness merged commit 575ff32 into main Oct 19, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants