Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

Closed
ftxqxd opened this issue Dec 12, 2014 · 5 comments
Closed

Reading non-ASCII characters encoded as UTF-8 doesn’t work well #19

ftxqxd opened this issue Dec 12, 2014 · 5 comments

Comments

@ftxqxd
Copy link
Contributor

ftxqxd commented Dec 12, 2014

I’m not sure what the precise issue is, but loading UTF-8-encoded files with non-ASCII characters in them doesn’t work well. As a simple example, writing ÿ to a file with a different editor and then loading that file with Iota shows garbled text.

@gchp
Copy link
Owner

gchp commented Dec 12, 2014

Yeah, if you open view.rs too you'll see there's a funny character in there which gets displayed all wrong. Something needs to be added which will do some checking on special characters like this before drawing them with rustbox. I haven't spent a whole lot of time investigating Ascii characters with Rust just yet, so I'm not sure what exactly I need to do to make this work. If anyone has any pointers I'd love to hear them!

@gchp gchp added the bug label Dec 12, 2014
@ftxqxd
Copy link
Contributor Author

ftxqxd commented Dec 12, 2014

I experimented a little and made a short patch that fixed this. I guess I should submit a PR? Edit: It’s actually looking a little trickier than I thought. I wonder if making Line store a String instead of a Vec<u8> would be a reasonable solution?

@gchp
Copy link
Owner

gchp commented Dec 12, 2014

Yep, PRs are more than welcome!

I actually used to have Line store a String though I moved away from it as I thought Vec<u8> was more sensible and more efficient. I'm open to being convinced otherwise, though. What benefits are you seeing with using String instead of the vec?

@ftxqxd
Copy link
Contributor Author

ftxqxd commented Dec 12, 2014

String enforces that its contents are valid UTF-8, and is designed to work with characters rather than bytes. This is both a good and a bad thing: it’s good because you get methods for working with characters instead of bytes, which is what you normally want when showing/editing text, but it’s bad because if you get something that isn’t valid UTF-8, you’re going to have a hard time using something that cannot be safely constructed or used when it contains invalid UTF-8. Rendering invalid characters on a terminal doesn’t really make any sense anyway, but it could be useful to show some kind of replacement instead, and using String might somehow complicate that. I personally think it’s fine to start with only basic UTF-8 support, and then later (if it has the demand) move on to support for invalid UTF-8 and even other encodings.

String shouldn’t be any less efficient as far as I know: it’s internally represented as a Vec<u8>. I guess some things like character indexing and character length are O(n) instead of O(1), but I personally think that correctness is more important than a slight speed boost in this case, and the performance shouldn’t be affected too much since it’s on a per-line basis (and lines aren’t usually that long).

@gchp
Copy link
Owner

gchp commented Dec 12, 2014

Ok, these are some good points. This is definitely worth looking into. Shouldn't be a huge task in converting to use String. I'll give it a go and see if it solves our problem.

ftxqxd added a commit to ftxqxd/iota that referenced this issue Dec 13, 2014
This changes the editor to represent lines as `String`s. It’s possible that this
will have to change in the future if we want to support reading non-UTF-8 files,
but I think this is a good fix for now at least.

It also changes the represntation of keys to be either a character or a
‘special’ key that doesn’t represent a character. This means that all characters
should now be typable.

Fixes gchp#19.
Fixes gchp#17.
ftxqxd added a commit to ftxqxd/iota that referenced this issue Dec 13, 2014
This changes the editor to represent lines as `String`s. It’s possible that this
will have to change in the future if we want to support reading non-UTF-8 files,
but I think this is a good fix for now at least.

It also changes the represntation of keys to be either a character or a
‘special’ key that doesn’t represent a character. This means that all characters
should now be typable.

Fixes gchp#19.
Fixes gchp#17.
@gchp gchp closed this as completed in 0e30c73 Dec 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants