Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't print "É" character read with `File.read/1` #7599

Closed
lobo-tuerto opened this Issue Apr 25, 2018 · 10 comments

Comments

Projects
None yet
4 participants
@lobo-tuerto
Copy link

lobo-tuerto commented Apr 25, 2018

I already tried asking in the #beginners Slack channel, no luck, googling it, also no luck, so I'll just ask here.

Environment

  • Elixir & Erlang versions (elixir --version):
    Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [hipe] [kernel-poll:false]
    Elixir 1.6.4 (compiled with OTP 20)

  • Operating system: Manjaro Linux 17.1.8

Current behavior

So, I'm currently reading The little Elixir & OTP guidebook and trying out the ID3 tag reader program.
I have an MP3 file that has an accented character in the title, like this: "Éso".

This is the program that parses the ID3 tag for an MP3 file:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary - size(30),
          artist::binary - size(30),
          album::binary - size(30),
          year::binary - size(4),
          _rest::binary>> = id3_tag

        #IO.puts "#{artist} - #{title} (#{album} #{year})"

        IO.inspect String.codepoints(title)
        IO.puts title

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end

Executing it with the problem file (the one with accented chars in its title), gives:

iex(1)> ID3Parser.parse "/home/victor/Music/some-song.mp3"
[
  <<201>>,
  "s",
  "o",
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>, 
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>
]
** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [<<201, 115, 111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>>, 10])

As you can see the problem is that first <<201>> number. String.codepoints should be showing an "É" in its place.

I found that:

iex(6)> "É" <> <<0>>   
<<195, 137, 0>>

iex(9)> "É" == <<201::utf8>>
true
iex(10)> "É" == <<195, 137>> 
true

iex(11)> IO.puts <<201::utf8>>
É
:ok

iex(12)> IO.puts <<201>>
** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [<<201>>, 10])

Expected behavior

I don't know what exactly the expected behavior should be.

  • Should that <<201>> be read as <<195, 137>> by File.read/1?
  • Should that <<201>> be interpreted as utf8 by IO.puts?
  • Should I be able to specify the encoding on IO.puts...?
  • Or maybe I'm making a terrible beginners mistake. :(
@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

Someone pointed out in the Slack channel that the ID3 tag might be encoded in :latin1.

And, kindly pointed me to: :unicode.characters_to_binary(your_string, :latin1)

iex(3)> a = :unicode.characters_to_binary(<<201>>, :latin1)
"É"
iex(4)> a
"É"
iex(5)> a <> <<0>>
<<195, 137, 0>>
iex(6)> IO.puts a
É
:ok

So, now I know how to make my IO.puts not to fail.
But still, the question is: Can this situation be improved somehow?
I was really confused by the errors I got above, and didn't have any pointers as to what was really going on, or how to fix it.

Could something like this be a good idea? (notice the extra ::latin1 to specify the encoding)

<<"TAG",
  title::binary - size(30)::latin1,
  artist::binary - size(30),
  album::binary - size(30),
  year::binary - size(4),
  _rest::binary>> = id3_tag

Or title::binary-size(30)-latin1

@josevalim

This comment has been minimized.

Copy link
Member

josevalim commented Apr 25, 2018

Unfortunately there isn't much we can do regarding the error. The message is sent to the IO device and it is coming from within Erlang/OTP. I have started working on a PR to potentially convert this error into badencoding but it may take a while for it to get merged.

I don't think latin1 would fix it because binaries are just a bunch of bytes, they don't have a separate tag telling us the encoding and I don't think matching should convert from one to the other transparently.

Thanks for the report!

@josevalim josevalim closed this Apr 25, 2018

@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

I'll leave here the version that doesn't throw errors, just in case someone else is looking for answers:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] = Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end
@NobbZ

This comment has been minimized.

Copy link

NobbZ commented Apr 25, 2018

@lubo-tuerto be careful. According to a quick Google search the encoding of the texts in id3 are not specified so the next file you check can be itf8 or ASCII or anything else. Well ASCII, latin1 and utf8 seem to be the most common encodings in the wild according to the same search...

@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

@NobbZ Gotcha, thanks for the warning.

I know this is a special case since I have been wrestling with encoding problems for a long time (I'm from Mexico). I just wanted to illustrate a specific solution to this general problem.

I was wondering why something like this won't work though:

title::binary-size(30)-utf8
# or
title::size(30)-utf8

Since this does:

<<201::utf8>> == <<195, 137>>

It says you can't mix those.

For the first one:

** (CompileError) iex:7: conflicting type specification for bit field: "binary" and "utf8"

For the second one:

** (CompileError) iex:11: size and unit are not supported on utf types

I guess I'm just looking for facilities to ease working with ---or converting between--- different encodings.

@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

@NobbZ It seems that ISO-8859-1 (Latin 1) is indeed the default encoding format for ID3v1 after all.

From Wikipedia:

ID3v2

...The internationalization problem was solved by allowing the encoding of strings not only in ISO-8859-1, but also in Unicode.

From TagLib documentation:

ID3v1 should in theory always contain ISO-8859-1 (Latin1) data. In practice it does not. TagLib by default only supports ISO-8859-1 data in ID3v1 tags.

But I agree, you can put whatever you want in those bytes (WinAmp did exactly this).

@OvermindDL1

This comment has been minimized.

Copy link
Contributor

OvermindDL1 commented Apr 25, 2018

title::binary-size(30)-utf8
# or
title::size(30)-utf8

This is treating the binary as if it is already utf8 (it wouldn't know how to convert it from otherwise, is it ascii? extended ascii? UTF16? WTF16? Etc...?)

<<201::utf8>> == <<195, 137>>

That is an integral code-point that is defined as utf8, which the erlang beam has specialized handling for (it used to not to, that is a fairly recent feature).

It seems that ISO-8859-1 (Latin 1) is indeed the default encoding format for ID3v1 after all.

Definitely don't want to use UTF-8 as the output then, treat it purely as ISO-8859-1. Remember the old creed of 'Be lenient in what you accept and strict in what you put out'. :-)

@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

@OvermindDL1 Hmm, what I'm doing is converting those bytes from a Latin-1 encoding to a UTF-8 one with:

:unicode.characters_to_binary(string, :latin1)

And now IO.puts works as intended.

When you said:

Definitely don't want to use UTF-8 as the output then, treat it purely as ISO-8859-1.

How should I go about that?

@OvermindDL1

This comment has been minimized.

Copy link
Contributor

OvermindDL1 commented Apr 25, 2018

How should I go about that?

For printing, that is fine if the terminal is in UTF-8 (not all are). Optimally you'd convert to whatever format the connected terminal is. If you are just going to output the tags I'd treat it as opaque binary and just pass it through. So it all depends on your setup and what you are wanting to accomplish. :-)

@lobo-tuerto

This comment has been minimized.

Copy link
Author

lobo-tuerto commented Apr 25, 2018

@OvermindDL1 Gotcha, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.