Can't print "É" character read with `File.read/1` #7599

lobo-tuerto · 2018-04-25T04:47:21Z

I already tried asking in the #beginners Slack channel, no luck, googling it, also no luck, so I'll just ask here.

Environment

Elixir & Erlang versions (elixir --version):
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.6.4 (compiled with OTP 20)
Operating system: Manjaro Linux 17.1.8

Current behavior

So, I'm currently reading The little Elixir & OTP guidebook and trying out the ID3 tag reader program.
I have an MP3 file that has an accented character in the title, like this: "Éso".

This is the program that parses the ID3 tag for an MP3 file:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary - size(30),
          artist::binary - size(30),
          album::binary - size(30),
          year::binary - size(4),
          _rest::binary>> = id3_tag

        #IO.puts "#{artist} - #{title} (#{album} #{year})"

        IO.inspect String.codepoints(title)
        IO.puts title

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end
end

Executing it with the problem file (the one with accented chars in its title), gives:

iex(1)> ID3Parser.parse "/home/victor/Music/some-song.mp3"
[
  <<201>>,
  "s",
  "o",
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>, 
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>,
  <<0>>
]
** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [<<201, 115, 111, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0>>, 10])

As you can see the problem is that first <<201>> number. String.codepoints should be showing an "É" in its place.

I found that:

iex(6)> "É" <> <<0>>   
<<195, 137, 0>>

iex(9)> "É" == <<201::utf8>>
true
iex(10)> "É" == <<195, 137>> 
true

iex(11)> IO.puts <<201::utf8>>
É
:ok

iex(12)> IO.puts <<201>>
** (ArgumentError) argument error
    (stdlib) :io.put_chars(:standard_io, :unicode, [<<201>>, 10])

Expected behavior

I don't know what exactly the expected behavior should be.

Should that <<201>> be read as <<195, 137>> by File.read/1?
Should that <<201>> be interpreted as utf8 by IO.puts?
Should I be able to specify the encoding on IO.puts...?
Or maybe I'm making a terrible beginners mistake. :(

The text was updated successfully, but these errors were encountered:

lobo-tuerto · 2018-04-25T05:15:05Z

Someone pointed out in the Slack channel that the ID3 tag might be encoded in :latin1.

And, kindly pointed me to: :unicode.characters_to_binary(your_string, :latin1)

iex(3)> a = :unicode.characters_to_binary(<<201>>, :latin1)
"É"
iex(4)> a
"É"
iex(5)> a <> <<0>>
<<195, 137, 0>>
iex(6)> IO.puts a
É
:ok

So, now I know how to make my IO.puts not to fail.
But still, the question is: Can this situation be improved somehow?
I was really confused by the errors I got above, and didn't have any pointers as to what was really going on, or how to fix it.

Could something like this be a good idea? (notice the extra ::latin1 to specify the encoding)

<<"TAG",
  title::binary - size(30)::latin1,
  artist::binary - size(30),
  album::binary - size(30),
  year::binary - size(4),
  _rest::binary>> = id3_tag

Or title::binary-size(30)-latin1

josevalim · 2018-04-25T07:50:33Z

Unfortunately there isn't much we can do regarding the error. The message is sent to the IO device and it is coming from within Erlang/OTP. I have started working on a PR to potentially convert this error into badencoding but it may take a while for it to get merged.

I don't think latin1 would fix it because binaries are just a bunch of bytes, they don't have a separate tag telling us the encoding and I don't think matching should convert from one to the other transparently.

Thanks for the report!

lobo-tuerto · 2018-04-25T14:14:34Z

I'll leave here the version that doesn't throw errors, just in case someone else is looking for answers:

defmodule ID3Parser do
  def parse(file_name) do
    case File.read(file_name) do
      {:ok, mp3} ->
        mp3_byte_size = byte_size(mp3) - 128

        <<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3

        <<"TAG",
          title::binary-size(30),
          artist::binary-size(30),
          album::binary-size(30),
          year::binary-size(4),
          _rest::binary>> = id3_tag

        to_convert = [title, artist, album, year]
        [title, artist, album, year] = Enum.map(to_convert, fn tag -> from_latin1(tag) end)

        IO.puts "#{artist} - #{title} (#{album} #{year})"

      _ ->
        IO.puts "Couldn't open #{file_name}"
    end
  end

  defp from_latin1(string) do
    :unicode.characters_to_binary(string, :latin1)
  end
end

NobbZ · 2018-04-25T14:41:36Z

@lubo-tuerto be careful. According to a quick Google search the encoding of the texts in id3 are not specified so the next file you check can be itf8 or ASCII or anything else. Well ASCII, latin1 and utf8 seem to be the most common encodings in the wild according to the same search...

lobo-tuerto · 2018-04-25T14:56:04Z

@NobbZ Gotcha, thanks for the warning.

I know this is a special case since I have been wrestling with encoding problems for a long time (I'm from Mexico). I just wanted to illustrate a specific solution to this general problem.

I was wondering why something like this won't work though:

title::binary-size(30)-utf8
# or
title::size(30)-utf8

Since this does:

<<201::utf8>> == <<195, 137>>

It says you can't mix those.

For the first one:

** (CompileError) iex:7: conflicting type specification for bit field: "binary" and "utf8"

For the second one:

** (CompileError) iex:11: size and unit are not supported on utf types

I guess I'm just looking for facilities to ease working with ---or converting between--- different encodings.

lobo-tuerto · 2018-04-25T15:19:19Z

@NobbZ It seems that ISO-8859-1 (Latin 1) is indeed the default encoding format for ID3v1 after all.

From Wikipedia:

ID3v2

...The internationalization problem was solved by allowing the encoding of strings not only in ISO-8859-1, but also in Unicode.

From TagLib documentation:

ID3v1 should in theory always contain ISO-8859-1 (Latin1) data. In practice it does not. TagLib by default only supports ISO-8859-1 data in ID3v1 tags.

But I agree, you can put whatever you want in those bytes (WinAmp did exactly this).

OvermindDL1 · 2018-04-25T15:20:34Z

title::binary-size(30)-utf8
# or
title::size(30)-utf8

This is treating the binary as if it is already utf8 (it wouldn't know how to convert it from otherwise, is it ascii? extended ascii? UTF16? WTF16? Etc...?)

<<201::utf8>> == <<195, 137>>

That is an integral code-point that is defined as utf8, which the erlang beam has specialized handling for (it used to not to, that is a fairly recent feature).

It seems that ISO-8859-1 (Latin 1) is indeed the default encoding format for ID3v1 after all.

Definitely don't want to use UTF-8 as the output then, treat it purely as ISO-8859-1. Remember the old creed of 'Be lenient in what you accept and strict in what you put out'. :-)

lobo-tuerto · 2018-04-25T16:00:11Z

@OvermindDL1 Hmm, what I'm doing is converting those bytes from a Latin-1 encoding to a UTF-8 one with:

:unicode.characters_to_binary(string, :latin1)

And now IO.puts works as intended.

When you said:

Definitely don't want to use UTF-8 as the output then, treat it purely as ISO-8859-1.

How should I go about that?

OvermindDL1 · 2018-04-25T16:20:03Z

How should I go about that?

For printing, that is fine if the terminal is in UTF-8 (not all are). Optimally you'd convert to whatever format the connected terminal is. If you are just going to output the tags I'd treat it as opaque binary and just pass it through. So it all depends on your setup and what you are wanting to accomplish. :-)

lobo-tuerto · 2018-04-25T16:22:45Z

@OvermindDL1 Gotcha, thanks!

josevalim closed this as completed Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't print "É" character read with `File.read/1` #7599

Can't print "É" character read with `File.read/1` #7599

lobo-tuerto commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018 •

edited

Loading

josevalim commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

NobbZ commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

ID3v2

OvermindDL1 commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

OvermindDL1 commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

Can't print "É" character read with File.read/1 #7599

Can't print "É" character read with File.read/1 #7599

Comments

lobo-tuerto commented Apr 25, 2018

Environment

Current behavior

Expected behavior

lobo-tuerto commented Apr 25, 2018 • edited Loading

josevalim commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

NobbZ commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

ID3v2

OvermindDL1 commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

OvermindDL1 commented Apr 25, 2018

lobo-tuerto commented Apr 25, 2018

Can't print "É" character read with `File.read/1` #7599

Can't print "É" character read with `File.read/1` #7599

lobo-tuerto commented Apr 25, 2018 •

edited

Loading