[flèche] Store ranges in protocol-native encoding #624

ineol · 2023-11-24T21:52:55Z

Following discussion on #624, we now store the Flèche locations on
protocol encoding instead of in unicode character points.

This avoids conversions on all protocol calls. Still, conversion back
to UTF-8 offsets is sometimes needed when requests want to access
Contents.t.

Fixes #620 fixes #621

ejgallego · 2023-11-24T22:25:44Z

lsp/core.mli

+val lsp_point_to_doc_point : doc:Fleche.Doc.t -> int * int -> int * int
+
+(** Translate a Fleche position into an UTF-16 LSP position. *)
+val doc_point_to_lsp_point : doc:Fleche.Doc.t -> Lang.Point.t -> Lang.Point.t


Thanks for the PR!

I think we shouldn't need this function at all. The convention in the codebase is that types carry the encoding. Thus:

Loc.t: Coq native position, in UTF-8 code units (all offsets)

Lang.Point.t: LSP native position, in UTF-16 code units (only character, not offset)

Thus, the conversion between Loc.t -> Range.t is the point where we should handle this conversion. That way, types guarantee encoding.

I think the problem is that Lang.Point.t are used as unicode code points positions rather that UTF-16 positions in many places.

In general, Fleche is not UTF-16 aware, and I don't know that it should be. Also, if later we want to be parametric to the encoding to use the encoding feature of the LSP protocol it will make it more complicated.

My initial plan was to try to do the UTF-16/unicode translation in the controller directory, but that may be a bad idea.

Maybe we could have Lsp.Point and Lsp.Range for the UTF-16 encodings?

Indeed at first I had Lang.Point.t to be unicode code points, that is indeed wrong as they are supposed to be protocol-level locations, which should be UTF-16. I don't know why I did this, I guess I was just lazy / focused on more important stuff. The intent for Lang.Point.t is that these are "LSP" locations.

I think doing the translation in controller is right, at least for input positions that are served to requests; but I'm still struggling to understand what the best choice is in general, but now that you mention it having Lsp.Point and Lsp.Range could be of help yes.

But there is a tricky factor here, and that is that we need to provide plugin writers etc... text manipulation functions that are easy to use, and I'm afraid I don't have enough experience in this domain as to understand what is best, moreover when the LSP protocol now has charset negotiation.

Let's recap the setup today:

Coq will only accept UTF-8 encoded buffers.

Doc.contents : Contents.t is encoded in UTF-8 (as Coq natively)

Coq's Loc.t are in UTF-8 byte offsets

Lang.Point.t are in UTF chars, but that's a bug, their intent was for them to be in UTF-16 code units

Locations inside fleche are stored in protocol-based format

So we have several options on how to move forward:

we fix Lang.Point.t to be protocol-level locations, (UTF-16 code units for now), we keep the rest as is

we update Contents.t to have a UTF-32 representation of the document, we introduce LSP.Point.t, we keep Lang.Point.t as unicode char

we fix Lang.Point.t to be UTF-8, and we do the conversion at the protocol level (tho this has the drawback of exposing some internals)

A nice property of 1 is that once more clients start to support encoding negotiation, we could actually skip the conversion as we could have Lang.Point.t to be UTF-8 code units, so in this case things would match and we could avoid the conversion overhead, which is not trivial often.

However, the main question I have is what is the best API for people to manipulate Contents.t ? It seems to me that the best in this case would be for things to be encoded in UTF-32, that way things like "look back 3 chars" are easy and work in OCaml as people expect.

We could make the conversion to UTF-32 lazier by requiring plugins to setup a "view" of the document, thus access to Contents.t can only happen via a conversion function that takes the range (or point) and returns the set of converted lines.

But indeed, it is hard to predict what will work best, and what the overheads are; gonna sleep a bit more over it.

For now I'll cherry pick the fix to hover. Thanks for your thoughts and for looking into this.

So we have several options on how to move forward:

we fix Lang.Point.t to be protocol-level locations, (UTF-16 code units for now), we keep the rest as is

we update Contents.t to have a UTF-32 representation of the document, we introduce LSP.Point.t, we keep Lang.Point.t as unicode char

we fix Lang.Point.t to be UTF-8, and we do the conversion at the protocol level (tho this has the drawback of exposing some internals)

I'm a bit confused, to whom do we expose the internals?

A nice property of 1 is that once more clients start to support encoding negotiation, we could actually skip the conversion as we could have Lang.Point.t to be UTF-8 code units, so in this case things would match and we could avoid the conversion overhead, which is not trivial often.

This is also a property of 3, isn't it?

However, the main question I have is what is the best API for people to manipulate Contents.t ? It seems to me that the best in this case would be for things to be encoded in UTF-32, that way things like "look back 3 chars" are easy and work in OCaml as people expect.

I am skeptical: it's 2023, people should expect to deal with unicode. And UTF-32 makes it easy to deal with code points, in the context of an editor the right notion is a grapheme cluster. I think an API that allows the plugin developer to consciously make these choices would be better in the long run.

I'm a bit confused, to whom do we expose the internals?

Sorry I was sloppy, I was thinking of the need in that case for the protocol layer to access line as to compute the character offset, I regard this part internal, but indeed it doesn't seem like a big deal.

This is also a property of 3, isn't it?

Yes!

I am skeptical: it's 2023, people should expect to deal with unicode.

I fully agree, on the other hand I'd be great if we could have some typing to prevent mishandling of this, but no need to over engineer now.

And UTF-32 makes it easy to deal with code points, in the context of an editor the right notion is a grapheme cluster.

Yes, however I think in the context of the server grapheme clusters are not so relevant right?

I think an API that allows the plugin developer to consciously make these choices would be better in the long run.

I do agree. @ineol so if I understand correctly, you think that option 3 , while keeping all the internal locations and text buffers in UTF-8 code units, is the path to take, right?

Hi @ineol , sorry for the delay w.r.t. this, I will take care of it soon.

After a bit of thinking, I think the solution that makes most sense is to have Lang.Range.t to contain positions that are native to the protocol, with the conversion happening at the point of range / position generation.

I think this solution makes sense due to:

it is much better to do the conversion once (at a single point) than to have to do the conversion all around, after some tries, this did matter quite a bit

this is not in conflict with the encoding negotiation at the protocol level, as enconding negotiation does happen at server init time. Still VSCode only offers UTF-16, so I haven't added support for utf8, but protocol2coq and coq2protocol could read the setting and choose the right encoding (and hopefully become the identity at some point with VSCode)

we can document this, and help programmers to have protocol2byte etc... functions, so they can manipulate text as they desire.

WDYT?

Hi @ejgallego, thank you for following up! Your proposal seems reasonable to me. For the other direction (the client sending a range to the server), where should the conversion happen?

Types that come from the client don't need conversion as they are already in protocol format, so request handlers can use Range.t normally with the incoming points.

Where care is needed in this apporach in when manipulation of Contents.t is needed, as Contents.t is in UTF-8 but Range.t are in protocol encoding; for that, the idea is to provide all the needed functions as Contents.t API, so the conversion for example for Contents.slice is handled there.

In general, as we discussed, it is recommended that request handler use a byte-based index, so they need to do protocol2byte if they need it.

Preparation to move the UTF machinery to `Lang`.

Following discussion on ejgallego#624, we now store the Flèche locations on protocol-level format instead of in unicode character points. This avoids conversions on all protocol call, conversion back to UTF-8 based encoding is sometimes needed as to manipulate `Contents.t`.

This is in anticipation of making Lang to use protocol-level locations.

Following discussion on ejgallego#624, we now store the Flèche locations on protocol-level format instead of in unicode character points. This avoids conversions on all protocol call, conversion back to UTF-8 based encoding is sometimes needed as to manipulate `Contents.t`.

ejgallego · 2024-05-22T19:32:20Z

@ineol , this is almost ready IMO, modulo doing some proper cleanup. Let me know if you have any thoughts, if not I'll go ahead and merge after the cleanups.

IMVHO it seems to work quite well.

Following discussion on ejgallego#624, we now store the Flèche locations on protocol-level format instead of in unicode character points. This avoids conversions on all protocol call, conversion back to UTF-8 based encoding is sometimes needed as to manipulate `Contents.t`.

Following discussion on ejgallego#624, we now store the Flèche locations on protocol-level format instead of in unicode character points. This avoids conversions on all protocol call, conversion back to UTF-8 based encoding is sometimes needed as to manipulate `Contents.t`. Fixes ejgallego#620 fixes ejgallego#621

Following discussion on ejgallego#624, we now store the Flèche locations on protocol encoding instead of in unicode character points. This avoids conversions on all protocol calls. Still, conversion back to UTF-8 offsets is sometimes needed when requests want to access `Contents.t`. Fixes ejgallego#620 fixes ejgallego#621

ejgallego reviewed Nov 24, 2023

View reviewed changes

ejgallego added this to the 0.1.9 milestone Nov 28, 2023

ejgallego force-pushed the hover branch from feda594 to fbc77ce Compare May 2, 2024 20:06

ineol and others added 2 commits May 20, 2024 10:49

[Hover] Follow LSP and treat positions as UTF16 code units

8fa486a

[lang] [utf] Tweak naming conventions.

2a4ba2d

Preparation to move the UTF machinery to `Lang`.

ejgallego force-pushed the hover branch from fbc77ce to 73a034e Compare May 22, 2024 18:24

ejgallego changed the title ~~[Hover] Follow LSP and treat positions as UTF16 code units~~ [flèche] Store ranges in protocol-native format. May 22, 2024

ejgallego changed the title ~~[flèche] Store ranges in protocol-native format.~~ [flèche] Store ranges in protocol-native format May 22, 2024

ejgallego force-pushed the hover branch from 73a034e to 58b3a0d Compare May 22, 2024 18:25

ejgallego added kind: fix kind: redesign kind: internal part: flèche labels May 22, 2024

[lang] [utf] Move UTF functions to lang.

eb3c86d

This is in anticipation of making Lang to use protocol-level locations.

ejgallego force-pushed the hover branch from 58b3a0d to faa0d3c Compare May 22, 2024 19:31

fixup for ppx_inline_test + backtraces

c2e9ece

ejgallego force-pushed the hover branch from faa0d3c to 65fa194 Compare May 22, 2024 20:21

ejgallego force-pushed the hover branch from 65fa194 to e6cd68c Compare May 22, 2024 20:21

ejgallego changed the title ~~[flèche] Store ranges in protocol-native format~~ [flèche] Store ranges in protocol-native encoding May 22, 2024

ejgallego added 2 commits May 22, 2024 23:23

[utf] [lang] Some more efficient functions and tests

ae527fc

ejgallego force-pushed the hover branch from 15b12f3 to ae527fc Compare May 22, 2024 21:24

ejgallego merged commit 5c735a7 into ejgallego:main May 23, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flèche] Store ranges in protocol-native encoding #624

[flèche] Store ranges in protocol-native encoding #624

ineol commented Nov 24, 2023 •

edited by ejgallego

ejgallego Nov 24, 2023

ineol Nov 28, 2023

ejgallego Nov 28, 2023

ineol Dec 1, 2023

ejgallego Dec 1, 2023

ejgallego May 2, 2024

ineol May 5, 2024

ejgallego May 5, 2024

ejgallego May 5, 2024

ejgallego commented May 22, 2024

[flèche] Store ranges in protocol-native encoding #624

[flèche] Store ranges in protocol-native encoding #624

Conversation

ineol commented Nov 24, 2023 • edited by ejgallego

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejgallego commented May 22, 2024

ineol commented Nov 24, 2023 •

edited by ejgallego