Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

ftxqxd · 2014-12-12T02:55:33Z

I’m not sure what the precise issue is, but loading UTF-8-encoded files with non-ASCII characters in them doesn’t work well. As a simple example, writing ÿ to a file with a different editor and then loading that file with Iota shows garbled text.

The text was updated successfully, but these errors were encountered:

gchp · 2014-12-12T09:10:20Z

Yeah, if you open view.rs too you'll see there's a funny character in there → which gets displayed all wrong. Something needs to be added which will do some checking on special characters like this before drawing them with rustbox. I haven't spent a whole lot of time investigating Ascii characters with Rust just yet, so I'm not sure what exactly I need to do to make this work. If anyone has any pointers I'd love to hear them!

ftxqxd · 2014-12-12T09:11:58Z

I experimented a little and made a short patch that fixed this. I guess I should submit a PR? Edit: It’s actually looking a little trickier than I thought. I wonder if making Line store a String instead of a Vec<u8> would be a reasonable solution?

gchp · 2014-12-12T09:30:38Z

Yep, PRs are more than welcome!

I actually used to have Line store a String though I moved away from it as I thought Vec<u8> was more sensible and more efficient. I'm open to being convinced otherwise, though. What benefits are you seeing with using String instead of the vec?

ftxqxd · 2014-12-12T09:41:33Z

String enforces that its contents are valid UTF-8, and is designed to work with characters rather than bytes. This is both a good and a bad thing: it’s good because you get methods for working with characters instead of bytes, which is what you normally want when showing/editing text, but it’s bad because if you get something that isn’t valid UTF-8, you’re going to have a hard time using something that cannot be safely constructed or used when it contains invalid UTF-8. Rendering invalid characters on a terminal doesn’t really make any sense anyway, but it could be useful to show some kind of replacement instead, and using String might somehow complicate that. I personally think it’s fine to start with only basic UTF-8 support, and then later (if it has the demand) move on to support for invalid UTF-8 and even other encodings.

String shouldn’t be any less efficient as far as I know: it’s internally represented as a Vec<u8>. I guess some things like character indexing and character length are O(n) instead of O(1), but I personally think that correctness is more important than a slight speed boost in this case, and the performance shouldn’t be affected too much since it’s on a per-line basis (and lines aren’t usually that long).

gchp · 2014-12-12T13:10:20Z

Ok, these are some good points. This is definitely worth looking into. Shouldn't be a huge task in converting to use String. I'll give it a go and see if it solves our problem.

This changes the editor to represent lines as `String`s. It’s possible that this will have to change in the future if we want to support reading non-UTF-8 files, but I think this is a good fix for now at least. It also changes the represntation of keys to be either a character or a ‘special’ key that doesn’t represent a character. This means that all characters should now be typable. Fixes gchp#19. Fixes gchp#17.

gchp added the bug label Dec 12, 2014

gchp added the enhancement label Dec 12, 2014

Arcterus mentioned this issue Dec 12, 2014

Stop hardcoding supported characters #20

Closed

ftxqxd mentioned this issue Dec 13, 2014

Generally improve Unicode support #29

Closed

gchp closed this as completed in 0e30c73 Dec 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

Comments

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014

ftxqxd commented Dec 12, 2014

gchp commented Dec 12, 2014