Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails on filenames that use a character encoding different from the system #28

Open
StyXman opened this issue Jul 29, 2020 · 10 comments
Open
Labels
bug Something isn't working

Comments

@StyXman
Copy link

StyXman commented Jul 29, 2020

I have a friend that has a audio collection that predates the general availability of UTF-8 on OSs. He also has a lot of music with band, album and son names that include non ascii chars. Combine those two and you get:

Traceback (most recent call last):
  File "/usr/bin/collectiongain", line 6, in <module>
    collectiongain()
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 341, in collectiongain
    do_collectiongain(args[0], opts.ref_level, opts.force, opts.dry_run,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 274, in do_collectiongain
    collect_files(music_dir, files, visited_cache,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 117, in collect_files
    print("  [%i] %s |" % (i, filepath), end='')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udced' in position 49: surrogates not allowed

Notice that these are valid filenames (from the OS point of view; on Unix, any char except \0x00 and / can be part of the path), just not valid UTF-8. Yes, he could sit down and rename all those files and directories, but I guess he won't be the only one.

OTOH, you could say 'go fix your filenames' and we will understand. Cheers!

@chaudum
Copy link
Owner

chaudum commented Sep 3, 2020

Thanks for reporting.

Non UTF-8 file names are definitely something the script should be able to deal with. You're probably right, that your friend won't be the only one.

This problem should be solvable by making use of PEP 383.

@chaudum chaudum added the bug Something isn't working label Sep 3, 2020
@chaudum
Copy link
Owner

chaudum commented Oct 27, 2020

This regression has probably been introduced with 6de7740

@StyXman could you try a Python3 compatible version prior to this commit?

git clone https://github.com/chaudum/rgain.git
cd rgain
git checkout aef5bde971c204d46e11a5f808aa4152cefa9687
python3 -m venv env
env/bin/python -m pip install -Ue .

@chaudum
Copy link
Owner

chaudum commented Nov 10, 2020

@StyXman Unfortunately I could not reproduce your issue yet. I tried to create files with random bytes as filenames, but did not succeed either - ran into a different issue:

$ python
Python 3.8.6 (default, Sep 25 2020, 09:36:53) 
[GCC 10.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir()
['album-tag.mp3']
>>> os.rename('album-tag.mp3', os.urandom(4)+b'.mp3')
>>> os.listdir()
['\udcdb\udcc3\udcc0L.mp3']
$ env/bin/collectiongain /tmp/tmp.iEg1y395Tw
Collecting files ...
  [1] ���L.mp3 |Test Album
Dispatching jobs ...
Now waiting for results ...
Unfortunately, there were some errors:
Test Album:Checking for Replay Gain information ...
  /tmp/tmp.iEg1y395Tw/���L.mp3:none
Calculating Replay Gain information ...
Traceback (most recent call last):
  File "/home/christian/sandbox/chaudum/rgain/rgain3/replaygain.py", line 112, in do_gain
    tracks_data, albumdata = calculate_gain(files, ref_level)
  File "/home/christian/sandbox/chaudum/rgain/rgain3/replaygain.py", line 53, in calculate_gain
    rg.start()
  File "/home/christian/sandbox/chaudum/rgain/rgain3/lib/rgcalc.py", line 93, in start
    if not self._next_file():
  File "/home/christian/sandbox/chaudum/rgain/rgain3/lib/rgcalc.py", line 184, in _next_file
    self.src.set_property("location", fname)
TypeError: could not convert '/tmp/tmp.iEg1y395Tw/\udcdb\udcc3\udcc0L.mp3' to type 'gchararray' when setting property 'GstFileSrc.location'


0 successful, 1 failed.
All finished.

@chaudum
Copy link
Owner

chaudum commented Nov 10, 2020

Could you provide information about your Python version and encoding?

python --version

python -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())"

locale

@chaudum
Copy link
Owner

chaudum commented Jan 26, 2021

Could you provide information about your Python version and encoding?

@StyXman ⬆️

@StyXman
Copy link
Author

StyXman commented Jan 27, 2021

Sorry, busy with life :(

mdione@diablo:~$ python3
Python 3.9.1+ (default, Jan 10 2021, 15:42:50)
[GCC 10.2.1 20201224] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ
environ({'LANGUAGE': 'en_US:es:fr:it', 'LANG': 'en_US.UTF-8', 'LC_TIME': 'es_AR.UTF-8'})

I was pretty sure at least LC_ALL would be en_US.UTF-8. I guess LANG is picked up instead?

@StyXman
Copy link
Author

StyXman commented Jan 27, 2021

Ah:

mdione@diablo:~$ python3 -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())"
utf-8 utf-8
mdione@diablo:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:es:fr:it
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME=es_AR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

@chaudum
Copy link
Owner

chaudum commented Jan 27, 2021

Thanks, will have another try whether I can reproduce the issue on my machine.

@brettpim
Copy link

brettpim commented Mar 6, 2024

I am also having this problem. My OS is Ubuntu 22.04.4. I installed rgain via apt install replaygain

My failing output:

Collecting files ...
Traceback (most recent call last):
  File "/usr/bin/collectiongain", line 6, in <module>
    collectiongain()
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 341, in collectiongain
    do_collectiongain(args[0], opts.ref_level, opts.force, opts.dry_run,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 274, in do_collectiongain
    collect_files(music_dir, files, visited_cache,
  File "/usr/lib/python3/dist-packages/rgain3/script/collectiongain.py", line 117, in collect_files
    print("  [%i] %s |" % (i, filepath), end='')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcea' in position 53: surrogates not allowed

python3 -version:

Python 3.10.12

python3 -c "import sys; print(sys.getfilesystemencoding(), sys.getdefaultencoding())":

utf-8 utf-8

locale:

LANG=en_CA.UTF-8
LANGUAGE=en_CA:en
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

I am happy to report any other information that can help diagnose this problem.

@brettpim
Copy link

brettpim commented Mar 6, 2024

Thanks for reporting.

Non UTF-8 file names are definitely something the script should be able to deal with. You're probably right, that your friend won't be the only one.

This problem should be solvable by making use of PEP 383.

How can I try to use PEP 383 to try to solve this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants