Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

Node position offset is not bytes #122

Closed
m09 opened this issue May 28, 2019 · 6 comments · Fixed by #123
Closed

Node position offset is not bytes #122

m09 opened this issue May 28, 2019 · 6 comments · Fixed by #123
Assignees
Labels

Comments

@m09
Copy link

m09 commented May 28, 2019

In Java, the position is returned based on the offset in the utf8 content, as shown here (no variation between an ascii string and a more complex utf8 string):

  • ascii string, the end offset of the highlighted string is 381:
    java-ascii
  • more complex utf8 string, the end offset of the highlighted string is 381:
    java-utf8

In JS, the position is returned based on the offset in the bytes content, as shown here (variation between an ascii string and a more complex utf8 string):

  • ascii string, the end offset of the highlighted string is 93:
    js-ascii
  • more complex utf8 string, the end offset of the highlighted string is 97:
    js-utf8
@dennwc
Copy link
Member

dennwc commented May 28, 2019

@m09 Thanks for reporting it! It is indeed a bug in the Java driver. Moving to that repository.

@dennwc dennwc transferred this issue from bblfsh/sdk May 28, 2019
@dennwc dennwc self-assigned this May 28, 2019
@dennwc dennwc added the bug label May 28, 2019
@bzz
Copy link
Contributor

bzz commented May 29, 2019

no variation between an ascii string and a more complex utf8 string

It is indeed a bug in the Java driver.

@dennwc Would you be so kind to elaborate, why do you think it's not a bug in JS driver instead?

@dennwc
Copy link
Member

dennwc commented May 29, 2019

@bzz Babelfish defines positions as byte offsets, and it's consistent with how JS does this, according to the bug report:

In JS, the position is returned based on the offset in the bytes content

@m09
Copy link
Author

m09 commented May 29, 2019

I should add that we maintain custom logic to do the conversion (since the bytes offset is rarely what we want) in several projects right now. I think an option to specify whether we want utf8 or bytes position would be appreciated by the ML team (and utf8 default would be the handiest for us).

@bzz
Copy link
Contributor

bzz commented May 29, 2019

@bzz Babelfish defines positions as byte offsets, and it's consistent with how JS does this, according to the bug report:

In JS, the position is returned based on the offset in the bytes content

I'm playing the devil's advocate here, so bare with me for a while pease - my question was more like: how is the user supposed to figure this one out, without asking? :)

I know the answer, https://doc.bblf.sh/uast/semantic-uast.html#position but I also think it may be a signal for us, that it's something not clear/under-communicated to our users (even given the fact of bblfsh/libuast#102)

I also presume that this issue could be an opportunity to verify and post results of this behavior across all the drivers - who knows, may be as with string and number literals before, it could be treated differently in more different drivers. WDYT?

@bzz bzz changed the title Position logic is inconsistent regarding encoding Node position offset is not bytes May 29, 2019
@dennwc
Copy link
Member

dennwc commented May 29, 2019

@m09 I'm working on the conversion function that returns UTF-8 offsets for the Python client, but it may take some time.

@bzz You are right, the docs should give a bit more insight into how positions work. Also, it's definitely worth adding a test for UTF-8 offsets. I remember doing a pass over some drivers, but maybe I missed Java that time.

dennwc pushed a commit to dennwc/java-driver that referenced this issue May 30, 2019
…h#122

Signed-off-by: Denys Smirnov <denys@sourced.tech>
dennwc pushed a commit to dennwc/java-driver that referenced this issue May 30, 2019
…h#122

Signed-off-by: Denys Smirnov <denys@sourced.tech>
dennwc pushed a commit that referenced this issue May 31, 2019
Signed-off-by: Denys Smirnov <denys@sourced.tech>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants