Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[flèche] Store ranges in protocol-native encoding #624

Merged
merged 6 commits into from May 23, 2024

Conversation

ineol
Copy link
Collaborator

@ineol ineol commented Nov 24, 2023

Following discussion on #624, we now store the Flèche locations on
protocol encoding instead of in unicode character points.

This avoids conversions on all protocol calls. Still, conversion back
to UTF-8 offsets is sometimes needed when requests want to access
Contents.t.

Fixes #620 fixes #621

lsp/core.mli Outdated
val lsp_point_to_doc_point : doc:Fleche.Doc.t -> int * int -> int * int

(** Translate a Fleche position into an UTF-16 LSP position. *)
val doc_point_to_lsp_point : doc:Fleche.Doc.t -> Lang.Point.t -> Lang.Point.t
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I think we shouldn't need this function at all. The convention in the codebase is that types carry the encoding. Thus:

  • Loc.t: Coq native position, in UTF-8 code units (all offsets)
  • Lang.Point.t: LSP native position, in UTF-16 code units (only character, not offset)

Thus, the conversion between Loc.t -> Range.t is the point where we should handle this conversion. That way, types guarantee encoding.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that Lang.Point.t are used as unicode code points positions rather that UTF-16 positions in many places.

In general, Fleche is not UTF-16 aware, and I don't know that it should be. Also, if later we want to be parametric to the encoding to use the encoding feature of the LSP protocol it will make it more complicated.

My initial plan was to try to do the UTF-16/unicode translation in the controller directory, but that may be a bad idea.

Maybe we could have Lsp.Point and Lsp.Range for the UTF-16 encodings?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed at first I had Lang.Point.t to be unicode code points, that is indeed wrong as they are supposed to be protocol-level locations, which should be UTF-16. I don't know why I did this, I guess I was just lazy / focused on more important stuff. The intent for Lang.Point.t is that these are "LSP" locations.

I think doing the translation in controller is right, at least for input positions that are served to requests; but I'm still struggling to understand what the best choice is in general, but now that you mention it having Lsp.Point and Lsp.Range could be of help yes.

But there is a tricky factor here, and that is that we need to provide plugin writers etc... text manipulation functions that are easy to use, and I'm afraid I don't have enough experience in this domain as to understand what is best, moreover when the LSP protocol now has charset negotiation.

Let's recap the setup today:

  • Coq will only accept UTF-8 encoded buffers.
  • Doc.contents : Contents.t is encoded in UTF-8 (as Coq natively)
  • Coq's Loc.t are in UTF-8 byte offsets
  • Lang.Point.t are in UTF chars, but that's a bug, their intent was for them to be in UTF-16 code units
  • Locations inside fleche are stored in protocol-based format

So we have several options on how to move forward:

  1. we fix Lang.Point.t to be protocol-level locations, (UTF-16 code units for now), we keep the rest as is
  2. we update Contents.t to have a UTF-32 representation of the document, we introduce LSP.Point.t, we keep Lang.Point.t as unicode char
  3. we fix Lang.Point.t to be UTF-8, and we do the conversion at the protocol level (tho this has the drawback of exposing some internals)

A nice property of 1 is that once more clients start to support encoding negotiation, we could actually skip the conversion as we could have Lang.Point.t to be UTF-8 code units, so in this case things would match and we could avoid the conversion overhead, which is not trivial often.

However, the main question I have is what is the best API for people to manipulate Contents.t ? It seems to me that the best in this case would be for things to be encoded in UTF-32, that way things like "look back 3 chars" are easy and work in OCaml as people expect.

We could make the conversion to UTF-32 lazier by requiring plugins to setup a "view" of the document, thus access to Contents.t can only happen via a conversion function that takes the range (or point) and returns the set of converted lines.

But indeed, it is hard to predict what will work best, and what the overheads are; gonna sleep a bit more over it.

For now I'll cherry pick the fix to hover. Thanks for your thoughts and for looking into this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have several options on how to move forward:

  1. we fix Lang.Point.t to be protocol-level locations, (UTF-16 code units for now), we keep the rest as is
  2. we update Contents.t to have a UTF-32 representation of the document, we introduce LSP.Point.t, we keep Lang.Point.t as unicode char
  3. we fix Lang.Point.t to be UTF-8, and we do the conversion at the protocol level (tho this has the drawback of exposing some internals)

I'm a bit confused, to whom do we expose the internals?

A nice property of 1 is that once more clients start to support encoding negotiation, we could actually skip the conversion as we could have Lang.Point.t to be UTF-8 code units, so in this case things would match and we could avoid the conversion overhead, which is not trivial often.

This is also a property of 3, isn't it?

However, the main question I have is what is the best API for people to manipulate Contents.t ? It seems to me that the best in this case would be for things to be encoded in UTF-32, that way things like "look back 3 chars" are easy and work in OCaml as people expect.

I am skeptical: it's 2023, people should expect to deal with unicode. And UTF-32 makes it easy to deal with code points, in the context of an editor the right notion is a grapheme cluster. I think an API that allows the plugin developer to consciously make these choices would be better in the long run.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, to whom do we expose the internals?

Sorry I was sloppy, I was thinking of the need in that case for the protocol layer to access line as to compute the character offset, I regard this part internal, but indeed it doesn't seem like a big deal.

This is also a property of 3, isn't it?

Yes!

I am skeptical: it's 2023, people should expect to deal with unicode.

I fully agree, on the other hand I'd be great if we could have some typing to prevent mishandling of this, but no need to over engineer now.

And UTF-32 makes it easy to deal with code points, in the context of an editor the right notion is a grapheme cluster.

Yes, however I think in the context of the server grapheme clusters are not so relevant right?

I think an API that allows the plugin developer to consciously make these choices would be better in the long run.

I do agree. @ineol so if I understand correctly, you think that option 3 , while keeping all the internal locations and text buffers in UTF-8 code units, is the path to take, right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ineol , sorry for the delay w.r.t. this, I will take care of it soon.

After a bit of thinking, I think the solution that makes most sense is to have Lang.Range.t to contain positions that are native to the protocol, with the conversion happening at the point of range / position generation.

I think this solution makes sense due to:

  • it is much better to do the conversion once (at a single point) than to have to do the conversion all around, after some tries, this did matter quite a bit
  • this is not in conflict with the encoding negotiation at the protocol level, as enconding negotiation does happen at server init time. Still VSCode only offers UTF-16, so I haven't added support for utf8, but protocol2coq and coq2protocol could read the setting and choose the right encoding (and hopefully become the identity at some point with VSCode)
  • we can document this, and help programmers to have protocol2byte etc... functions, so they can manipulate text as they desire.

WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ejgallego, thank you for following up! Your proposal seems reasonable to me. For the other direction (the client sending a range to the server), where should the conversion happen?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types that come from the client don't need conversion as they are already in protocol format, so request handlers can use Range.t normally with the incoming points.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where care is needed in this apporach in when manipulation of Contents.t is needed, as Contents.t is in UTF-8 but Range.t are in protocol encoding; for that, the idea is to provide all the needed functions as Contents.t API, so the conversion for example for Contents.slice is handled there.

In general, as we discussed, it is recommended that request handler use a byte-based index, so they need to do protocol2byte if they need it.

ejgallego added a commit to ineol/coq-lsp that referenced this pull request May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol-level format instead of in unicode character points.

This avoids conversions on all protocol call, conversion back to UTF-8
based encoding is sometimes needed as to manipulate `Contents.t`.
@ejgallego ejgallego changed the title [Hover] Follow LSP and treat positions as UTF16 code units [flèche] Store ranges in protocol-native format. May 22, 2024
@ejgallego ejgallego changed the title [flèche] Store ranges in protocol-native format. [flèche] Store ranges in protocol-native format May 22, 2024
ejgallego added a commit to ineol/coq-lsp that referenced this pull request May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol-level format instead of in unicode character points.

This avoids conversions on all protocol call, conversion back to UTF-8
based encoding is sometimes needed as to manipulate `Contents.t`.
This is in anticipation of making Lang to use protocol-level
locations.
ejgallego added a commit to ineol/coq-lsp that referenced this pull request May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol-level format instead of in unicode character points.

This avoids conversions on all protocol call, conversion back to UTF-8
based encoding is sometimes needed as to manipulate `Contents.t`.
@ejgallego
Copy link
Owner

@ineol , this is almost ready IMO, modulo doing some proper cleanup. Let me know if you have any thoughts, if not I'll go ahead and merge after the cleanups.

IMVHO it seems to work quite well.

ejgallego added a commit to ineol/coq-lsp that referenced this pull request May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol-level format instead of in unicode character points.

This avoids conversions on all protocol call, conversion back to UTF-8
based encoding is sometimes needed as to manipulate `Contents.t`.
ejgallego added a commit to ineol/coq-lsp that referenced this pull request May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol-level format instead of in unicode character points.

This avoids conversions on all protocol call, conversion back to UTF-8
based encoding is sometimes needed as to manipulate `Contents.t`.

Fixes ejgallego#620 fixes ejgallego#621
@ejgallego ejgallego changed the title [flèche] Store ranges in protocol-native format [flèche] Store ranges in protocol-native encoding May 22, 2024
Following discussion on ejgallego#624, we now store the Flèche locations on
protocol encoding instead of in unicode character points.

This avoids conversions on all protocol calls. Still, conversion back
to UTF-8 offsets is sometimes needed when requests want to access
`Contents.t`.

Fixes ejgallego#620 fixes ejgallego#621
@ejgallego ejgallego merged commit 5c735a7 into ejgallego:main May 23, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants