Support converting to and from UTF-16 code unit indices. #26

cessen · 2019-10-05T01:44:02Z

Some software and APIs, especially on the Windows platform, still operate in terms of UTF-16 code units when working with text. One example is the Language Server Protocol, which specifies text offsets in UTF-16 code units.

Being able to efficiently interoperate with these APIs is useful, especially for code editors. To better support this use-case, and to generally round-out Ropey's Unicode support, we should provide a way to convert between Unicode scalar value indices (which Ropey uses) and UTF-16 code unit indices.

I think tracking the data for this can be done with relatively little overhead. I'm not as concerned that the actual conversions be super optimized (though they should still be efficient). But I don't want the feature to negatively impact clients that aren't using it.

Update:
This feature is now implemented and in master, and will be in the next release.

cessen · 2019-10-06T00:19:56Z

Started work on this in the utf16 branch.

mcobzarenco · 2020-03-15T16:45:03Z

@cessen First, awesome crate! -- I'm using it in a text editor and I would like to implement LSP support. I was curious how far do you think the utf16 branch is from a complete implementation. I am happy to help with this in any way -- I will certainly be testing quite a bit :-)

cessen · 2020-03-15T21:14:47Z

@mcobzarenco Thanks for the kind words!

I don't think it's too far off. IIRC, I already have it tracking the indices in the tree, so the remaining work is implementing the conversion functions. So, I guess it's about 50% done. But it's not a big feature--the index tracking only took a day or two, IIRC.

I got really busy part way through implementation, which is why it stalled. And then I just never got back to it. But I'm on break now, and since there's someone with an actual use-case (you!) that definitely provides motivation. So I'll get back on this soon. Thanks for the push!

I would love help with testing. I'll post here again once I've got an initial version working. If you'd be willing to test at that point I would super appreciate it.

mcobzarenco · 2020-03-16T13:11:33Z

@cessen Thanks for the detailed response and cannot wait to try it out ☺️

cessen · 2020-03-18T12:08:03Z

I've implemented part of the functionality now. On Rope there are now the following methods:

len_utf16_code_units()
char_to_utf16_code_unit()

Next up is utf16_code_unit_to_char(), which is currently just stubbed out. And after that, adding the same methods to RopeSlice.

cessen · 2020-03-18T12:42:35Z

@mcobzarenco And now Rope::utf16_code_unit_to_char() is implemented. Hopefully this should be enough to start testing with.

Next up is to implement them on RopeSlice as well.

And I think that will be it...? I don't expect these to be used in performance critical areas, so I'd like to keep the API surface area minimal, and just have people do multi-stage conversions if they need to e.g. go from byte or line to utf16 code unit or whatnot. Does that seem reasonable to you?

cessen · 2020-03-18T20:35:22Z

@mcobzarenco All functionality is now implemented on both Rope and RopeSlice. If you could please test it when you get the chance, I'd really appreciate it! Also, if you have any feedback on the docs or API, please don't hesitate to let me know.

Thank you!

mcobzarenco · 2020-03-18T21:42:20Z

That's great to hear, thanks @cessen!

I don't expect these to be used in performance critical areas, so I'd like to keep the API surface area minimal

Agreed -- I hope keeping track of utf16 code points doesn't add too much of an overhead by itself.

[...] and just have people do multi-stage conversions if they need to e.g. go from byte or line to utf16 code unit or whatnot. Does that seem reasonable to you?

It seems very reasonable to me -- as you mentioned the main use case of this feature is to interact with external APIs such at LSP's. That is a slow asynchronous operation, the conversion to UTF-16 should only be used when converting a response, not inside a hot loop etc.

@mcobzarenco All functionality is now implemented on both Rope and RopeSlice. If you could please test it when you get the chance, I'd really appreciate it! Also, if you have any feedback on the docs or API, please don't hesitate to let me know.

💯 I will do, although it may have to wait for the weekend or so due to work responsibilities. I will post my preliminary experience / results here.

cessen · 2020-03-19T01:43:38Z

Yeah, no rush at all! Please don't feel pressured. I really appreciate any time you contribute to this.

cessen · 2020-03-21T14:00:30Z

Just a quick note: I've merged this into the master branch now, and deleted the utf16 branch. So you can just test directly from master now.

cessen · 2020-04-03T22:38:06Z

@mcobzarenco Again, no rush, but I'm curious if you've had a chance to poke at this yet?

mcobzarenco · 2020-04-13T13:35:17Z

@cessen My apologies for the really belated reply, it's been hectic --- I haven't had a chance to come back around to this, sorry, but I have some time over the next week

cessen · 2020-04-15T00:42:19Z

@mcobzarenco No worries! Looking forward to your feedback (and maybe bug reports!).

cessen · 2020-05-01T05:49:59Z

@mcobzarenco Again, no rush, but just curious if you've had a chance to poke at this yet?

(Edit: also, just realized this was freakishly similar to my earlier comment, ha ha. Guess my phrasing is pretty consistent.)

cessen · 2020-05-20T00:57:47Z

@mcobzarenco I hope you're doing okay amidst the pandemic.

Secondarily: any progress on testing this out? If you don't expect to get around to it any time soon, that's totally fine, but please let me know. I'd like to get this released in not too long, so I just want to know if I should move forward or wait a little longer for your testing.

Thanks!

mcobzarenco · 2020-05-20T15:03:02Z

@cessen Thanks for asking -- I'm muddling through the pandemic, like most I guess.. I hope you're doing alright.

Also thank you for chasing me and I do apologise for being unresponsive. I've wanted to come back with an extensive description of my experience using the new utf-16 tracking as part of building an LSP integration (rust analyser) for a text editor.

As it turned out, that was a more distant goal than I thought when I first looked at whether ropey supports utf-16 and unfortunately I struggled to allocate as much time as I wished. On the flip side, I've recently built a prototype that to works all 💯. I've had to make progress on other unrelated features (and bugs) to get to the LSP integration -- my only year's resolution is to switch from Emacs to it -- still very much committed :-D

I'll certainly keep you in the loop of how it goes -- even if it's taking a lot longer :-/

cessen · 2020-05-21T06:22:35Z

@mcobzarenco Glad you're doing okay! Luckily, I'm doing well so far. :-)

Yeah, that makes sense that something like LSP integration would take quite a bit of time. I don't think I want to wait that long to make the next release, however.

So if you have the time, could you just look over the docs for master and double-check that the utf16 APIs seem right for your use-case? I'm also open to bike shedding on the method names—I think they're fine as-is, but it would be great to make them shorter if possible without losing too much clarity.

In any case, when you do get around to the LSP stuff, please do file bug reports if you run into any issues. I already have reasonable test coverage, I think, but... there's always room for bugs in code this complex! And real-world usage has a way of exposing such things. :-)

cessen · 2020-06-14T04:59:39Z

Just released Ropey 1.2.0, which includes this feature.

If anyone encounters any bugs with this, please don't hesitate to open a new issue!

cessen added the enhancement label Oct 5, 2019

cessen self-assigned this Oct 5, 2019

cessen closed this as completed Jun 14, 2020

schrieveslaach mentioned this issue Oct 8, 2024

Fix LSP non-ascii characters offset issues. nushell/nushell#14002

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support converting to and from UTF-16 code unit indices. #26

Support converting to and from UTF-16 code unit indices. #26

cessen commented Oct 5, 2019 •

edited

Loading

cessen commented Oct 6, 2019

mcobzarenco commented Mar 15, 2020

cessen commented Mar 15, 2020

mcobzarenco commented Mar 16, 2020

cessen commented Mar 18, 2020

cessen commented Mar 18, 2020

cessen commented Mar 18, 2020

mcobzarenco commented Mar 18, 2020

cessen commented Mar 19, 2020

cessen commented Mar 21, 2020

cessen commented Apr 3, 2020

mcobzarenco commented Apr 13, 2020

cessen commented Apr 15, 2020

cessen commented May 1, 2020 •

edited

Loading

cessen commented May 20, 2020

mcobzarenco commented May 20, 2020

cessen commented May 21, 2020 •

edited

Loading

cessen commented Jun 14, 2020

Support converting to and from UTF-16 code unit indices. #26

Support converting to and from UTF-16 code unit indices. #26

Comments

cessen commented Oct 5, 2019 • edited Loading

cessen commented Oct 6, 2019

mcobzarenco commented Mar 15, 2020

cessen commented Mar 15, 2020

mcobzarenco commented Mar 16, 2020

cessen commented Mar 18, 2020

cessen commented Mar 18, 2020

cessen commented Mar 18, 2020

mcobzarenco commented Mar 18, 2020

cessen commented Mar 19, 2020

cessen commented Mar 21, 2020

cessen commented Apr 3, 2020

mcobzarenco commented Apr 13, 2020

cessen commented Apr 15, 2020

cessen commented May 1, 2020 • edited Loading

cessen commented May 20, 2020

mcobzarenco commented May 20, 2020

cessen commented May 21, 2020 • edited Loading

cessen commented Jun 14, 2020

cessen commented Oct 5, 2019 •

edited

Loading

cessen commented May 1, 2020 •

edited

Loading

cessen commented May 21, 2020 •

edited

Loading