Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support converting to and from UTF-16 code unit indices. #26

Closed
cessen opened this issue Oct 5, 2019 · 18 comments
Closed

Support converting to and from UTF-16 code unit indices. #26

cessen opened this issue Oct 5, 2019 · 18 comments
Assignees

Comments

@cessen
Copy link
Owner

cessen commented Oct 5, 2019

Some software and APIs, especially on the Windows platform, still operate in terms of UTF-16 code units when working with text. One example is the Language Server Protocol, which specifies text offsets in UTF-16 code units.

Being able to efficiently interoperate with these APIs is useful, especially for code editors. To better support this use-case, and to generally round-out Ropey's Unicode support, we should provide a way to convert between Unicode scalar value indices (which Ropey uses) and UTF-16 code unit indices.

I think tracking the data for this can be done with relatively little overhead. I'm not as concerned that the actual conversions be super optimized (though they should still be efficient). But I don't want the feature to negatively impact clients that aren't using it.

Update:
This feature is now implemented and in master, and will be in the next release.

@cessen cessen self-assigned this Oct 5, 2019
@cessen
Copy link
Owner Author

cessen commented Oct 6, 2019

Started work on this in the utf16 branch.

@mcobzarenco
Copy link

@cessen First, awesome crate! -- I'm using it in a text editor and I would like to implement LSP support. I was curious how far do you think the utf16 branch is from a complete implementation. I am happy to help with this in any way -- I will certainly be testing quite a bit :-)

@cessen
Copy link
Owner Author

cessen commented Mar 15, 2020

@mcobzarenco Thanks for the kind words!

I don't think it's too far off. IIRC, I already have it tracking the indices in the tree, so the remaining work is implementing the conversion functions. So, I guess it's about 50% done. But it's not a big feature--the index tracking only took a day or two, IIRC.

I got really busy part way through implementation, which is why it stalled. And then I just never got back to it. But I'm on break now, and since there's someone with an actual use-case (you!) that definitely provides motivation. So I'll get back on this soon. Thanks for the push!

I would love help with testing. I'll post here again once I've got an initial version working. If you'd be willing to test at that point I would super appreciate it.

@mcobzarenco
Copy link

@cessen Thanks for the detailed response and cannot wait to try it out ☺️

@cessen
Copy link
Owner Author

cessen commented Mar 18, 2020

I've implemented part of the functionality now. On Rope there are now the following methods:

  • len_utf16_code_units()
  • char_to_utf16_code_unit()

Next up is utf16_code_unit_to_char(), which is currently just stubbed out. And after that, adding the same methods to RopeSlice.

@cessen
Copy link
Owner Author

cessen commented Mar 18, 2020

@mcobzarenco And now Rope::utf16_code_unit_to_char() is implemented. Hopefully this should be enough to start testing with.

Next up is to implement them on RopeSlice as well.

And I think that will be it...? I don't expect these to be used in performance critical areas, so I'd like to keep the API surface area minimal, and just have people do multi-stage conversions if they need to e.g. go from byte or line to utf16 code unit or whatnot. Does that seem reasonable to you?

@cessen
Copy link
Owner Author

cessen commented Mar 18, 2020

@mcobzarenco All functionality is now implemented on both Rope and RopeSlice. If you could please test it when you get the chance, I'd really appreciate it! Also, if you have any feedback on the docs or API, please don't hesitate to let me know.

Thank you!

@mcobzarenco
Copy link

That's great to hear, thanks @cessen!

I don't expect these to be used in performance critical areas, so I'd like to keep the API surface area minimal

Agreed -- I hope keeping track of utf16 code points doesn't add too much of an overhead by itself.

[...] and just have people do multi-stage conversions if they need to e.g. go from byte or line to utf16 code unit or whatnot. Does that seem reasonable to you?

It seems very reasonable to me -- as you mentioned the main use case of this feature is to interact with external APIs such at LSP's. That is a slow asynchronous operation, the conversion to UTF-16 should only be used when converting a response, not inside a hot loop etc.

@mcobzarenco All functionality is now implemented on both Rope and RopeSlice. If you could please test it when you get the chance, I'd really appreciate it! Also, if you have any feedback on the docs or API, please don't hesitate to let me know.

💯 I will do, although it may have to wait for the weekend or so due to work responsibilities. I will post my preliminary experience / results here.

@cessen
Copy link
Owner Author

cessen commented Mar 19, 2020

Yeah, no rush at all! Please don't feel pressured. I really appreciate any time you contribute to this.

@cessen
Copy link
Owner Author

cessen commented Mar 21, 2020

Just a quick note: I've merged this into the master branch now, and deleted the utf16 branch. So you can just test directly from master now.

@cessen
Copy link
Owner Author

cessen commented Apr 3, 2020

@mcobzarenco Again, no rush, but I'm curious if you've had a chance to poke at this yet?

@mcobzarenco
Copy link

@cessen My apologies for the really belated reply, it's been hectic --- I haven't had a chance to come back around to this, sorry, but I have some time over the next week

@cessen
Copy link
Owner Author

cessen commented Apr 15, 2020

@mcobzarenco No worries! Looking forward to your feedback (and maybe bug reports!).

@cessen
Copy link
Owner Author

cessen commented May 1, 2020

@mcobzarenco Again, no rush, but just curious if you've had a chance to poke at this yet?

(Edit: also, just realized this was freakishly similar to my earlier comment, ha ha. Guess my phrasing is pretty consistent.)

@cessen
Copy link
Owner Author

cessen commented May 20, 2020

@mcobzarenco I hope you're doing okay amidst the pandemic.

Secondarily: any progress on testing this out? If you don't expect to get around to it any time soon, that's totally fine, but please let me know. I'd like to get this released in not too long, so I just want to know if I should move forward or wait a little longer for your testing.

Thanks!

@mcobzarenco
Copy link

@cessen Thanks for asking -- I'm muddling through the pandemic, like most I guess.. I hope you're doing alright.

Also thank you for chasing me and I do apologise for being unresponsive. I've wanted to come back with an extensive description of my experience using the new utf-16 tracking as part of building an LSP integration (rust analyser) for a text editor.

As it turned out, that was a more distant goal than I thought when I first looked at whether ropey supports utf-16 and unfortunately I struggled to allocate as much time as I wished. On the flip side, I've recently built a prototype that to works all 💯. I've had to make progress on other unrelated features (and bugs) to get to the LSP integration -- my only year's resolution is to switch from Emacs to it -- still very much committed :-D

I'll certainly keep you in the loop of how it goes -- even if it's taking a lot longer :-/

@cessen
Copy link
Owner Author

cessen commented May 21, 2020

@mcobzarenco Glad you're doing okay! Luckily, I'm doing well so far. :-)

Yeah, that makes sense that something like LSP integration would take quite a bit of time. I don't think I want to wait that long to make the next release, however.

So if you have the time, could you just look over the docs for master and double-check that the utf16 APIs seem right for your use-case? I'm also open to bike shedding on the method names—I think they're fine as-is, but it would be great to make them shorter if possible without losing too much clarity.

In any case, when you do get around to the LSP stuff, please do file bug reports if you run into any issues. I already have reasonable test coverage, I think, but... there's always room for bugs in code this complex! And real-world usage has a way of exposing such things. :-)

@cessen
Copy link
Owner Author

cessen commented Jun 14, 2020

Just released Ropey 1.2.0, which includes this feature.

If anyone encounters any bugs with this, please don't hesitate to open a new issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants