-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't print "É" character read with File.read/1
#7599
Comments
Someone pointed out in the Slack channel that the ID3 tag might be encoded in And, kindly pointed me to: iex(3)> a = :unicode.characters_to_binary(<<201>>, :latin1)
"É"
iex(4)> a
"É"
iex(5)> a <> <<0>>
<<195, 137, 0>>
iex(6)> IO.puts a
É
:ok So, now I know how to make my Could something like this be a good idea? (notice the extra <<"TAG",
title::binary - size(30)::latin1,
artist::binary - size(30),
album::binary - size(30),
year::binary - size(4),
_rest::binary>> = id3_tag Or |
Unfortunately there isn't much we can do regarding the error. The message is sent to the IO device and it is coming from within Erlang/OTP. I have started working on a PR to potentially convert this error into badencoding but it may take a while for it to get merged. I don't think latin1 would fix it because binaries are just a bunch of bytes, they don't have a separate tag telling us the encoding and I don't think matching should convert from one to the other transparently. Thanks for the report! |
I'll leave here the version that doesn't throw errors, just in case someone else is looking for answers: defmodule ID3Parser do
def parse(file_name) do
case File.read(file_name) do
{:ok, mp3} ->
mp3_byte_size = byte_size(mp3) - 128
<<_::binary-size(mp3_byte_size), id3_tag::binary>> = mp3
<<"TAG",
title::binary-size(30),
artist::binary-size(30),
album::binary-size(30),
year::binary-size(4),
_rest::binary>> = id3_tag
to_convert = [title, artist, album, year]
[title, artist, album, year] = Enum.map(to_convert, fn tag -> from_latin1(tag) end)
IO.puts "#{artist} - #{title} (#{album} #{year})"
_ ->
IO.puts "Couldn't open #{file_name}"
end
end
defp from_latin1(string) do
:unicode.characters_to_binary(string, :latin1)
end
end |
@lubo-tuerto be careful. According to a quick Google search the encoding of the texts in id3 are not specified so the next file you check can be itf8 or ASCII or anything else. Well ASCII, latin1 and utf8 seem to be the most common encodings in the wild according to the same search... |
@NobbZ Gotcha, thanks for the warning. I know this is a special case since I have been wrestling with encoding problems for a long time (I'm from Mexico). I just wanted to illustrate a specific solution to this general problem. I was wondering why something like this won't work though: title::binary-size(30)-utf8
# or
title::size(30)-utf8 Since this does: <<201::utf8>> == <<195, 137>> It says you can't mix those. For the first one:
For the second one:
I guess I'm just looking for facilities to ease working with ---or converting between--- different encodings. |
@NobbZ It seems that ISO-8859-1 (Latin 1) is indeed the default encoding format for ID3v1 after all. From Wikipedia:
From TagLib documentation:
But I agree, you can put whatever you want in those bytes (WinAmp did exactly this). |
This is treating the binary as if it is already utf8 (it wouldn't know how to convert it from otherwise, is it ascii? extended ascii? UTF16? WTF16? Etc...?)
That is an integral code-point that is defined as utf8, which the erlang beam has specialized handling for (it used to not to, that is a fairly recent feature).
Definitely don't want to use UTF-8 as the output then, treat it purely as ISO-8859-1. Remember the old creed of 'Be lenient in what you accept and strict in what you put out'. :-) |
@OvermindDL1 Hmm, what I'm doing is converting those bytes from a Latin-1 encoding to a UTF-8 one with: :unicode.characters_to_binary(string, :latin1) And now When you said:
How should I go about that? |
For printing, that is fine if the terminal is in UTF-8 (not all are). Optimally you'd convert to whatever format the connected terminal is. If you are just going to output the tags I'd treat it as opaque binary and just pass it through. So it all depends on your setup and what you are wanting to accomplish. :-) |
@OvermindDL1 Gotcha, thanks! |
I already tried asking in the #beginners Slack channel, no luck, googling it, also no luck, so I'll just ask here.
Environment
Elixir & Erlang versions (elixir --version):
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.6.4 (compiled with OTP 20)
Operating system: Manjaro Linux 17.1.8
Current behavior
So, I'm currently reading The little Elixir & OTP guidebook and trying out the ID3 tag reader program.
I have an MP3 file that has an accented character in the title, like this: "Éso".
This is the program that parses the ID3 tag for an MP3 file:
Executing it with the problem file (the one with accented chars in its title), gives:
As you can see the problem is that first
<<201>>
number.String.codepoints
should be showing an "É" in its place.I found that:
Expected behavior
I don't know what exactly the expected behavior should be.
<<201>>
be read as<<195, 137>>
byFile.read/1
?<<201>>
be interpreted asutf8
byIO.puts
?IO.puts
...?The text was updated successfully, but these errors were encountered: