Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get the span as a result of a match #242

Closed
charles-paperman opened this issue Sep 4, 2023 · 8 comments · Fixed by #275
Closed

Get the span as a result of a match #242

charles-paperman opened this issue Sep 4, 2023 · 8 comments · Fixed by #275
Assignees
Labels
area: result Improvements in query result reporting good first issue Good for newcomers help wanted External contributions welcome type: feature New feature or request
Milestone

Comments

@charles-paperman
Copy link
Collaborator

Given a string in RAM, it would be nice to collect all the span of the result so that they could be parsed and loaded with an external solution.

@github-actions github-actions bot added the acceptance: triage Waiting for owner's input label Sep 4, 2023
@github-actions
Copy link

github-actions bot commented Sep 4, 2023

Tagging @V0ldek for notifications

@V0ldek
Copy link
Member

V0ldek commented Sep 5, 2023

We already have matches which returns a Match that includes the MatchSpan, but it includes the copied bytes as well. I understand the request is about a zero-copy endpoint.

Unfortunately, we can't exactly get zero-copy, because of the additional work required to get the end of a match – starting indices are easy (hence indices exists), but to get the end we need to store the matched bytes and then trim the trailing whitespace; so the copy would happen anyway, even if we didn't return it to the caller.

However, if the purpose is for external to parse the results, then trimming the whitespace is probably not required. I'd expect any parser to just discard the trailing whitespace. Maybe we can expose this and explicitly document that it's an approximate span, where the ends might point to some whitespace after the actual match. Call it approx_spans? That would be significantly faster than the full matches function now.

@V0ldek V0ldek added mod: engine area: result Improvements in query result reporting type: feature New feature or request labels Sep 5, 2023
@V0ldek
Copy link
Member

V0ldek commented Sep 5, 2023

To make the issue concrete, there are currently two real cases where a match includes whitespace that needs to be trimmed. One, if the JSON is weird and has spaces before commas delimiting items in an object or list:

{
  "a": 42   ,
  "b": 43
}

with query $.a would match 42␣␣␣. or

[ 42  , 43 ]

where $[0] would match 42␣␣␣.

This is a corner-case, since no sane person or formatter writes JSONs like that. The other happens all the time, however, and it's when the match happens at the end of an object/list and so it will include all whitespace until the parent's closing brace/bracket:

{
  "a": {
    "b": 42
  }
}

where $.a.b would match 42⏎␣␣.

In the last case, the correct span to report is 20-21, but to get that we need to trim the whitespace after the match. In the fast approximate span endpoint the report would be 20-25, which would include all the whitespace.

@V0ldek
Copy link
Member

V0ldek commented Sep 5, 2023

@charles-paperman Is there value in adding this to rq proper as an output mode or is the use case purely for the library?

@charles-paperman
Copy link
Collaborator Author

I believe it is purely for the lib. I don't think reporting the span in rq makes a lot of sense.

@V0ldek
Copy link
Member

V0ldek commented Sep 6, 2023

So the API here would be:

fn approx_spans<I, S>(&self, input: &I, sink: &mut S) -> Result<(), EngineError>
       where I: Input,
             S: Sink<MatchSpan>;

Internally, a different recorder that would eschew all the byte-copying and whitespace trimming and work with reported offsets only. It'd still need to use the stack to make sure the responses are ordered properly, we should follow a similar design to NodesRecorder.

The documentation MUST explicitly and prominently state that the spans will be inaccurate at the ends, but with the guarantee that only trailing JSON whitespace characters are included.

@V0ldek V0ldek added this to the v1.0.0 milestone Sep 6, 2023
@github-actions github-actions bot added acceptance: go ahead Reviewed, implementation can start and removed acceptance: triage Waiting for owner's input labels Sep 6, 2023
@V0ldek V0ldek added help wanted External contributions welcome good first issue Good for newcomers labels Sep 6, 2023
@charles-paperman
Copy link
Collaborator Author

Efficient approximate copy-free span would be neat!

@V0ldek V0ldek self-assigned this Sep 18, 2023
V0ldek added a commit that referenced this issue Sep 20, 2023
- Engine can return an approximate span of the match,
    where "approximate" means the start index is correct,
    but the end index might include trailing whitespace after the match.
- This mode is much faster that full `matches`, close to the performance
    of `count`, especially for large result sets.
- This is a library-only feature.

Ref: #242
V0ldek added a commit that referenced this issue Sep 20, 2023
- Engine can return an approximate span of the match,
    where "approximate" means the start index is correct,
    but the end index might include trailing whitespace after the match.
- This mode is much faster that full `matches`, close to the performance
    of `count`, especially for large result sets.
- This is a library-only feature.

Ref: #242
V0ldek added a commit that referenced this issue Sep 20, 2023
- Engine can return an approximate span of the match,
    where "approximate" means the start index is correct,
    but the end index might include trailing whitespace after the match.
- This mode is much faster that full `matches`, close to the performance
    of `count`, especially for large result sets.
- This is a library-only feature.

Ref: #242
V0ldek added a commit that referenced this issue Sep 20, 2023
- Engine can return an approximate span of the match,
    where "approximate" means the start index is correct,
    but the end index might include trailing whitespace after the match.
- This mode is much faster that full `matches`, close to the performance
    of `count`, especially for large result sets.
- This is a library-only feature.

Ref: #242
@github-actions github-actions bot removed the acceptance: go ahead Reviewed, implementation can start label Sep 20, 2023
@V0ldek
Copy link
Member

V0ldek commented Sep 20, 2023

Implemented in v0.8.1 as approximate_spans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: result Improvements in query result reporting good first issue Good for newcomers help wanted External contributions welcome type: feature New feature or request
Projects
Status: Released
Development

Successfully merging a pull request may close this issue.

2 participants