Text strings encoding #67

Closed
vokimon opened this Issue Jun 25, 2014 · 9 comments

Comments

Projects
None yet
3 participants
@vokimon

vokimon commented Jun 25, 2014

API documentation does not states on how to deal with non-ascii text in metadata strings such as title, author...

While Vorbis Comments (ogg, speex, flac) are utf-8, ID3v1 is latin1 and ID3v2 might be latin1, utf8, ucs or utf16-LE depending on a flag. I have no idea on which encoding uses metadata in other formats such as WAV or AIFF.

Anyway libsndfile should provide either:

  • text strings in a common universal text encoding such as utf-8, or
  • means to know the text encoding to deal with.

Could we asume the former? Is there any means for the later?

@vokimon vokimon referenced this issue in vokimon/python-wavefile Jun 25, 2014

Closed

Safe non-ascii text in tags #7

@erikd

This comment has been minimized.

Show comment
Hide comment
@erikd

erikd Jun 25, 2014

Owner

Well strings stored in WAV and AIFF files arre ASCII. Fortunately that is a valid subset of utf-8.

Like you say Vorbis and FLAC comments are utf-8 and libsndfile doesn't handle ID3 tags at all, so its safe to assume that all string metadata is utf-8 encoded.

Owner

erikd commented Jun 25, 2014

Well strings stored in WAV and AIFF files arre ASCII. Fortunately that is a valid subset of utf-8.

Like you say Vorbis and FLAC comments are utf-8 and libsndfile doesn't handle ID3 tags at all, so its safe to assume that all string metadata is utf-8 encoded.

@vokimon

This comment has been minimized.

Show comment
Hide comment
@vokimon

vokimon Jun 26, 2014

Thanks for the answer, Erik. After many years of using your nice software in CLAM and other audio related project it is cool to talk with you. :-)

So, ID3 is not in the equation and using generally utf8 for reading seems not to be a big deal. But what about writing? Would utf-8 used generally to encode text inside WAV and AIFF break other readers? Should i prevent my users to do that by raising an text encoding exception? Or, should i let them to take over the WAV world with utf8? :-)

And, regarding the API documentation, I think it would be nice to put some notice that strings are set and retrieved text encoded as required by the format: ASCII for WAV and AIFF and utf-8 for Vorbis and FLAC.

vokimon commented Jun 26, 2014

Thanks for the answer, Erik. After many years of using your nice software in CLAM and other audio related project it is cool to talk with you. :-)

So, ID3 is not in the equation and using generally utf8 for reading seems not to be a big deal. But what about writing? Would utf-8 used generally to encode text inside WAV and AIFF break other readers? Should i prevent my users to do that by raising an text encoding exception? Or, should i let them to take over the WAV world with utf8? :-)

And, regarding the API documentation, I think it would be nice to put some notice that strings are set and retrieved text encoded as required by the format: ASCII for WAV and AIFF and utf-8 for Vorbis and FLAC.

@erikd

This comment has been minimized.

Show comment
Hide comment
@erikd

erikd Jun 26, 2014

Owner

So here is where I have to confess that I have really not thought about this very much at all.

Some time ago (2009 according to the git logs) I was forced to deal with non-ascii file names on Windows which resulted in addition of the Windows specific function sf_wchar_open(). Non-ascii filenames were never a problem on Linux or Mac because these OSes mostly used locale specific encodings like Latin before swicthing to Utf-8 which afaiac JustWorks (tm) which meant that I could mostly avoid the issue.

The other way I avoided the issues of text encoding is by doing alot of my more recent code in a langauge other than C (eg Haskell) which has three string types; String, Text and ByteString (the first two of which support Unicode).

So, to answer the question, yes I think it should be safe to write utf-8 to encode text embedded in WAV and AIFF files, because any code that just passes this text around (ie without processing them) and assumes them to be ASCII, should just work. Obviously there could be code out these that does something somewhat silly like masking off the top bit of each character, but that is rather unlikely.

Yes, I will update the documentation (probably with a FAQ entry and a link from the docs the FAQ) to the effect that uft-8 should be safe for this functions before I close this issue.

Owner

erikd commented Jun 26, 2014

So here is where I have to confess that I have really not thought about this very much at all.

Some time ago (2009 according to the git logs) I was forced to deal with non-ascii file names on Windows which resulted in addition of the Windows specific function sf_wchar_open(). Non-ascii filenames were never a problem on Linux or Mac because these OSes mostly used locale specific encodings like Latin before swicthing to Utf-8 which afaiac JustWorks (tm) which meant that I could mostly avoid the issue.

The other way I avoided the issues of text encoding is by doing alot of my more recent code in a langauge other than C (eg Haskell) which has three string types; String, Text and ByteString (the first two of which support Unicode).

So, to answer the question, yes I think it should be safe to write utf-8 to encode text embedded in WAV and AIFF files, because any code that just passes this text around (ie without processing them) and assumes them to be ASCII, should just work. Obviously there could be code out these that does something somewhat silly like masking off the top bit of each character, but that is rather unlikely.

Yes, I will update the documentation (probably with a FAQ entry and a link from the docs the FAQ) to the effect that uft-8 should be safe for this functions before I close this issue.

@vokimon

This comment has been minimized.

Show comment
Hide comment
@vokimon

vokimon Jun 26, 2014

Yes, well... it is safe in the sense that libsndfile just handles bytes as bytes, and, when you write some encoding, say utf8, you can retrieve the same bytes back safely... as long as you know that that was is utf8. What about everybody else? Other software may interpret my utf8 as it where, say, latin1 or write latin1 in its files. That's why i was joking about taking over the world: If python-wavefile encodes transparently any non-US ASCII character as utf8, i am placing an stake by setting a convention outside the standard. If libsndfile recommends using utf8 as well, that could settle a "wider convention" and a higher stake given the many applications that use it. But that's still a convention, outside the standard, anyway. Moreover, libsndfile do not encode/decode transparently so the users must do the en/decoding themselves. In summary, even if i like the idea of pushing utf8, i am not sure i have the right or the market force to do so :-) But that's now my philosophical problem, I won't bother you more with that ;-)

Leaving this issue apart, about unicode filenames, in Python, i am just encoding any non-encoded string the user passes to the locale the file-system uses which is quite straight forward in Python. So I guess it should be safe even without using your wchar version of the open function, but i couldn't test it in a Windows host to see how it actually works.

vokimon commented Jun 26, 2014

Yes, well... it is safe in the sense that libsndfile just handles bytes as bytes, and, when you write some encoding, say utf8, you can retrieve the same bytes back safely... as long as you know that that was is utf8. What about everybody else? Other software may interpret my utf8 as it where, say, latin1 or write latin1 in its files. That's why i was joking about taking over the world: If python-wavefile encodes transparently any non-US ASCII character as utf8, i am placing an stake by setting a convention outside the standard. If libsndfile recommends using utf8 as well, that could settle a "wider convention" and a higher stake given the many applications that use it. But that's still a convention, outside the standard, anyway. Moreover, libsndfile do not encode/decode transparently so the users must do the en/decoding themselves. In summary, even if i like the idea of pushing utf8, i am not sure i have the right or the market force to do so :-) But that's now my philosophical problem, I won't bother you more with that ;-)

Leaving this issue apart, about unicode filenames, in Python, i am just encoding any non-encoded string the user passes to the locale the file-system uses which is quite straight forward in Python. So I guess it should be safe even without using your wchar version of the open function, but i couldn't test it in a Windows host to see how it actually works.

@erikd erikd added the NextRelease label Jun 27, 2014

@erikd

This comment has been minimized.

Show comment
Hide comment
@erikd

erikd Nov 30, 2014

Owner

Just re-read this ticket and I'm not sure if there is anything I actually need to do. Is that right?

Owner

erikd commented Nov 30, 2014

Just re-read this ticket and I'm not sure if there is anything I actually need to do. Is that right?

@vokimon

This comment has been minimized.

Show comment
Hide comment
@vokimon

vokimon Dec 1, 2014

Nope, but maybe bringing the conclusions in these discussion to the documentation as a notice about using no- ascii encodings both in names and tags. As I undestood:

  • Filenames are byte encoded using filesystem charset in char version of the functions and UTF16_BE in the wchar_t flavor for windows.
  • Tags are utf-8 encoded in OGG files.
  • For writting tags in wavs using plain ASCIII is recommended if you want to avoid problems to the reader programs. Or use UTF8 if you want to support the utf8-everywhere campaign.
  • For reading wav tags, try to decode it as UTF8 and wait for decoding errors.

vokimon commented Dec 1, 2014

Nope, but maybe bringing the conclusions in these discussion to the documentation as a notice about using no- ascii encodings both in names and tags. As I undestood:

  • Filenames are byte encoded using filesystem charset in char version of the functions and UTF16_BE in the wchar_t flavor for windows.
  • Tags are utf-8 encoded in OGG files.
  • For writting tags in wavs using plain ASCIII is recommended if you want to avoid problems to the reader programs. Or use UTF8 if you want to support the utf8-everywhere campaign.
  • For reading wav tags, try to decode it as UTF8 and wait for decoding errors.
@erikd

This comment has been minimized.

Show comment
Hide comment
@erikd

erikd Mar 17, 2015

Owner

@vokimon Please check out the above commit and let me know if its up to scratch.

Owner

erikd commented Mar 17, 2015

@vokimon Please check out the above commit and let me know if its up to scratch.

@erikd

This comment has been minimized.

Show comment
Hide comment
@erikd

erikd Mar 20, 2015

Owner

Thanks @vokimon. I'll mark this as fixed.

Owner

erikd commented Mar 20, 2015

Thanks @vokimon. I'll mark this as fixed.

@erikd erikd closed this Mar 20, 2015

@camlorn

This comment has been minimized.

Show comment
Hide comment
@camlorn

camlorn Oct 22, 2015

I think you can kill sf_wchar_open by just always decoding from UTF8; see MultiByteToWideChar.
This would give a uniform interface across all platforms and mirrors the approach I'm taking in my software: since every other platform is already UTF8, define as part of the API that incoming strings need to be UTF8 and decode as appropriate.
Just a thought, but I think this might be better than a windows-specific function, it's compatible with ascii so it won't break any existing software, and it should be an easy patch.

camlorn commented Oct 22, 2015

I think you can kill sf_wchar_open by just always decoding from UTF8; see MultiByteToWideChar.
This would give a uniform interface across all platforms and mirrors the approach I'm taking in my software: since every other platform is already UTF8, define as part of the API that incoming strings need to be UTF8 and decode as appropriate.
Just a thought, but I think this might be better than a windows-specific function, it's compatible with ascii so it won't break any existing software, and it should be an easy patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment