Skip to content

StringIO improvements #2096

Merged
merged 1 commit into from Mar 27, 2014

4 participants

@devinus
devinus commented Mar 6, 2014

I've begun StringIO improvements, and I've had several rewrites (e.g. https://gist.github.com/devinus/c8fc12b63f1c73aedbae) that have all led me to question my own interpretation of the IO protocol and appreciate some design decisions that @mururu took.

However, there are several performance improvements we can still make. For example, @josevalim has confirmed that we only need to care about :latin1 (terribly named, but this is what's used when using binread/binwrite because there's no concept of multibyte codepoints) and :unicode (which should only be UTF-8), but I wanted to go on a line by line basis with @mururu to confirm behaviour and maybe @josevalim can chime in when necessary.

I've already fixed several potential bugs that I'd like a second pair of eyes to double check.

@devinus
devinus commented Mar 6, 2014

First thought: Since we only support :latin1 and UTF-8 when using :unicode, this can be changed to a simple loop that uses << char :: utf8, rest :: binary >> and byte_size to calculate the size.

@devinus
devinus commented Mar 6, 2014

Second thought: I need to confirm exactly what other IO servers do, but with preliminary testing the file IO server does NOT return something like :invalid_unicode (devinus@9fc7a1c#diff-b27997fe2e28aaac3723720c70523ea1L228) when there's invalid unicode, so our behaviour is different from the norm. Is this ok?

@devinus
devinus commented Mar 6, 2014

Third thought: The entire get_line implementation can be binary based and not convert to a list. All we need ot check for is <<"\r\n">> or <<"\n">>. See: https://github.com/erlang/otp/blob/21095e6830f37676dd29c33a590851ba2c76499b/lib/stdlib/src/io_lib.erl#L744

@devinus
devinus commented Mar 6, 2014

Forth thought: These two :unicode.characters_to_binary can fail, leaving the server in a bad state (input and output will be an error tuple): devinus@9fc7a1c#diff-b27997fe2e28aaac3723720c70523ea1L259

@devinus
devinus commented Mar 6, 2014

Sixth thought: Do we really need to do these types of converstions: https://github.com/devinus/elixir/blob/9fc7a1c43f0a99dfd62e2ade69d58306f503e449/lib/elixir/lib/string_io.ex#L347

It seems the server will only ever get binaries passed to it from my testing.

@mururu
mururu commented Mar 6, 2014

@devinus, thanks!

we only need to care about :latin1 (terribly named, but this is what's used when using binread/binwrite because there's no concept of multibyte codepoints) and :unicode (which should only be UTF-8)

Ah, it seems to fix some ugly points of the current implementation.

Sorry, I'm just going to bed now. I will reply to respective comments tommorow.

@ericmj ericmj commented on the diff Mar 6, 2014
lib/elixir/lib/string_io.ex
{ input, "" }
end
defp do_get_chars(input, :latin1, n) do
- if byte_size(input) < n do
- { input, "" }
- else
- :erlang.split_binary(input, n)
- end
+ << chars :: [ binary, size(n) ], rest :: binary >> = input
+ { chars, rest }
@ericmj
Elixir member
ericmj added a note Mar 6, 2014

Shouldn't :binary.split/2 be used here?

From the binary module docs:

This module contains functions for manipulating byte-oriented binaries. Although the majority of functions could be implemented using bit-syntax, the functions in this library are highly optimized and are expected to either execute faster or consume less memory (or both) than a counterpart written in pure Erlang.

@devinus
devinus added a note Mar 6, 2014

@ericmj :binary.split/2 is for splitting based on a pattern, we're splitting at a specific size.

@ericmj
Elixir member
ericmj added a note Mar 6, 2014

@devinus Ah, okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@josevalim
Elixir member

Gentlemen, is this good to go?

@devinus
devinus commented Mar 13, 2014

@josevalim Waiting on a response from @mururu

@mururu
mururu commented Mar 17, 2014

First thought: Since we only support :latin1 and UTF-8 when using :unicode, this can be changed to a simple loop that uses << char :: utf8, rest :: binary >> and byte_size to calculate the size.

I agree. I use :file_io_server.count_and_find only to handle the other encodings.

@mururu
mururu commented Mar 17, 2014

Second thought: I need to confirm exactly what other IO servers do, but with preliminary testing the file IO server does NOT return something like :invalid_unicode (devinus@9fc7a1c#diff-b27997fe2e28aaac3723720c70523ea1L228) when there's invalid unicode, so our behaviour is different from the norm. Is this ok?

I don't understand problem here fully, but the following example seems to be a counterexample.

iex(1)> {:ok,a}=StringIO.start(<<255,255,255>>)
{:ok, #PID<0.43.0>}
iex(2)> IO.getn(a, "", 1)
{:error, :invalid_unicode}
iex(1)> a=File.open!("a.txt", [:write])
#PID<0.43.0>
iex(2)> IO.binwrite a, <<255,255,255>>
:ok
iex(3)> a=File.open!("a.txt", [:read,:utf8])
#PID<0.46.0>
iex(4)> IO.getn(a, "", 1)
{:error, :invalid_unicode}
@mururu
mururu commented Mar 17, 2014

Third thought: The entire get_line implementation can be binary based and not convert to a list. All we need ot check for is <<"\r\n">> or <<"\n">>. See: https://github.com/erlang/otp/blob/21095e6830f37676dd29c33a590851ba2c76499b/lib/stdlib/src/io_lib.erl#L744

IIRC, the reason is only that collect_line is also used by get_until which handles chars as charlists. So it actually doesn't need to convert to a list.

@mururu
mururu commented Mar 17, 2014

Forth thought: These two :unicode.characters_to_binary can fail, leaving the server in a bad state (input and output will be an error tuple): devinus@9fc7a1c#diff-b27997fe2e28aaac3723720c70523ea1L259

Its arguments is the return values of :unicode.characters_to_list(input, encoding), so I thought it doesn't failed. If it is wrong, we should check errors.

@mururu
mururu commented Mar 17, 2014

Sixth thought: Do we really need to do these types of converstions: https://github.com/devinus/elixir/blob/9fc7a1c43f0a99dfd62e2ade69d58306f503e449/lib/elixir/lib/string_io.ex#L347

It is used to convert prompts and in put_chars. Prompts can be either charlist or binary. But in put_chars the server seems to get only binaries. So in put_chars, we don't need the conversion.

@mururu
mururu commented Mar 17, 2014

@devinus Sorry for the late reply.

@mururu
mururu commented Mar 17, 2014

Maybe we should have an option to specify its internal encoding like File.open

@josevalim
Elixir member

I would go only with unicode encoding support. We can add other encoding if someone eventually needs in the future.

@mururu
mururu commented Mar 17, 2014

Ah, I see.

@josevalim
Elixir member

Guys, is this good to go? v0.13 is upcoming and I'd like to merge this before. After this is merged, there are some modifications we need to do the API. :)

@devinus
devinus commented Mar 26, 2014

The current PR fixes some bugs, but that's it.

@josevalim josevalim merged commit e8ea196 into elixir-lang:v0.13 Mar 27, 2014

1 check failed

Details default The Travis CI build failed
@josevalim
Elixir member

:heart: :green_heart: :blue_heart: :yellow_heart: :purple_heart:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.