Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

[feature] Include the information about full tokens #291

Open
vmarkovtsev opened this issue Aug 14, 2018 · 6 comments
Open

[feature] Include the information about full tokens #291

vmarkovtsev opened this issue Aug 14, 2018 · 6 comments
Assignees

Comments

@vmarkovtsev
Copy link

Strings, comments, etc. have their Token set to the inner value of the token. E.g. in Python
"hello" has Token hello (no quotes). This is all good and logical.

However, we discard information about the real token - quote characters, comment characters, etc.
It is needed to reproduce the original source code from a UAST. I have two possible solution proposals:

  1. Add "FullToken" for those nodes which need it.
  2. Add "TokenPrefix" and "TokenSuffix".
@juanjux
Copy link
Contributor

juanjux commented Aug 14, 2018

Comments should have the character used, prefix and suffix in the semantic UAST "Comment" object. For strings, at least in the Python and Ruby drivers, unfortunately the native AST doesn't provide the string type so this won't be possible for all drivers unless we parse the source code ourselves.

I'll leave this open just in case we find a workable solution in the future.

@dennwc dennwc self-assigned this Aug 14, 2018
@vmarkovtsev
Copy link
Author

vmarkovtsev commented Aug 14, 2018

The current workaround is simple: I look at the difference between file_contents[start_position.offset:end_position.offset] and Token and record prefixes and suffixes.

@dennwc
Copy link
Member

dennwc commented Aug 14, 2018

Token as a concept won't work in the long run, so I think we should provide a helper that selects a source file content based on positions of nodes, as @vmarkovtsev mentioned.

For example, what is the token of do ... while? This will get more and more complex once we start working with semantic concepts for classes.

@juanjux
Copy link
Contributor

juanjux commented Aug 14, 2018

They work pretty well... for identifiers and literals. For statements and reserved words, as you proved, they're problematic (same happens with "from x import y" in Python which is a single node with children).

Maybe we should make a distinction between a token and a representation.

@dennwc
Copy link
Member

dennwc commented Aug 14, 2018

The token is something that exists in the source code, Egor mentioned a few times that he expects tokens to be valid for all node types, which cannot be the case with the current model.

I would rather go with semantic concepts, so Comments have text, prefix, etc and String (literal) has a value and quotes. Tokens can be provided with positional info. Since UAST v2 allows more than 2 positional fields, we can define few more to represent start/end positions of different keywords in the statement.

@juanjux
Copy link
Contributor

juanjux commented Aug 14, 2018

Even with semantic objects it would be nice to keep the concept either as a single unified name or as some kind of field metadata so XPath queries doesn't have to match every semantic object to retrieve a different field in each which happens now as @smacker said the other day.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants