Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Problem with special characters in file name #663

Closed
3 tasks done
Fakeaccount12312 opened this issue Sep 18, 2022 · 5 comments · Fixed by #746
Closed
3 tasks done

[BUG] Problem with special characters in file name #663

Fakeaccount12312 opened this issue Sep 18, 2022 · 5 comments · Fixed by #746
Labels
bug Something isn't working

Comments

@Fakeaccount12312
Copy link

  • I am reporting a bug.
  • I am running the latest version of BDfR
  • I have read the Opening an issue

Description

I noticed that bdfr crashes when attempting to download posts with special characters in the title which are not allowed in file names of some file systems (like "*\~). For example this post (ignore the necrophilia joke):
https://www.reddit.com/r/197/comments/xdi48h/
I tried to clone it with this command:
python -m bdfr clone -v -l https://www.reddit.com/r/197/comments/xdi48h/ ""
And got this error:

[2022-09-18 23:48:58,201 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2022-09-18 23:48:58,201 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2022-09-18 23:48:58,202 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2022-09-18 23:48:58,202 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2022-09-18 23:48:59,541 - bdfr.downloader - DEBUG] - Attempting to download submission xdi48h
[2022-09-18 23:49:02,853 - bdfr.downloader - DEBUG] - Using YtdlpFallback with url https://v.redd.it/aq4adlbbuon91
[2022-09-18 23:49:11,267 - bdfr.downloader - ERROR] - Failed to write file in submission xdi48h to /media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-18 23:49:11,267 - bdfr.archive_entry.submission_archive_entry - DEBUG] - Retrieving full comment tree for submission xdi48h
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 154, in <module>
    cli()
  File "/usr/lib/python3/dist-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 120, in cli_clone
    reddit_scraper.download()
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/cloner.py", line 21, in download
    self.write_entry(submission)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 75, in write_entry
    self._write_entry_json(archive_entry)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 87, in _write_entry_json
    self._write_content_to_disk(resource, content)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 102, in _write_content_to_disk
    with open(file_path, 'w', encoding="utf-8") as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.json'

Note how downloading just fails, but archiving crashes bdfr, which is more annoying when downloading multiple posts.

Now I noticed this only happens when downloading to my USB-Stick, so this isn't really easily reproducible as I work on linux, old hardware and did a custom format of the USB stick (exFAT). Therefore I'd suggest a more general way to deal with this kind of problem.

My suggestion would be either an option that automatically removes these problematic characters from file names, simply adding a catch so archiving fails for this file but doesn't crash the whole process or even replacing these characters if this operation fails and trying again. The second one would obviously be the easiest to implement and makes this problem more managable, the first is more of a feature request.

Command

python -m bdfr clone -v -l https://www.reddit.com/r/197/comments/xdi48h/ ""

Environment (please complete the following information):

  • OS: Linux Lite 6
  • Python version: 3.10.4
  • Fresh install

Logs

[2022-09-19 00:17:58,009 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2022-09-19 00:17:58,010 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2022-09-19 00:17:58,011 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created download filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created time filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Created sort filter
[2022-09-19 00:17:58,011 - bdfr.connector - Level 9] - Create file name formatter
[2022-09-19 00:17:58,012 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2022-09-19 00:17:58,013 - bdfr.connector - Level 9] - Created site authenticator
[2022-09-19 00:17:58,013 - bdfr.connector - Level 9] - Retrieved subreddits
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved multireddits
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved user data
[2022-09-19 00:17:58,014 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2022-09-19 00:17:59,011 - bdfr.downloader - DEBUG] - Attempting to download submission xdi48h
[2022-09-19 00:18:04,075 - bdfr.downloader - DEBUG] - Using YtdlpFallback with url https://v.redd.it/aq4adlbbuon91
[2022-09-19 00:18:12,746 - bdfr.downloader - ERROR] - [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
Traceback (most recent call last):
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/downloader.py", line 110, in _download_submission
    with open(destination, 'wb') as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-19 00:18:12,746 - bdfr.downloader - ERROR] - Failed to write file in submission xdi48h to /media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.mp4'
[2022-09-19 00:18:12,747 - bdfr.archive_entry.submission_archive_entry - DEBUG] - Retrieving full comment tree for submission xdi48h
[2022-09-19 00:18:12,751 - root - ERROR] - Scraper exited unexpectedly
Traceback (most recent call last):
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/__main__.py", line 120, in cli_clone
    reddit_scraper.download()
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/cloner.py", line 21, in download
    self.write_entry(submission)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 75, in write_entry
    self._write_entry_json(archive_entry)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 87, in _write_entry_json
    self._write_content_to_disk(resource, content)
  File "/home/anon/.local/lib/python3.10/site-packages/bdfr/archiver.py", line 102, in _write_content_to_disk
    with open(file_path, 'w', encoding="utf-8") as file:
OSError: [Errno 22] Invalid argument: '/media/anon/INTENSO/197/Cookieman077_fun :3_xdi48h.json'
@Fakeaccount12312 Fakeaccount12312 added the bug Something isn't working label Sep 18, 2022
@Fakeaccount12312
Copy link
Author

I am pretty sure this is already managed on windows in some way, but I currently don't have access to a Windows machine to check.

@Serene-Arc
Copy link
Collaborator

We do have logic to screen out character names, though I have no clue what is in that filename that is tripping Windows up.

@Serene-Arc
Copy link
Collaborator

I didn't notice the first time but upon rereading the bug report, it seems that the error is from you using Linux to download the files to an exFAT filesystem. Linux is much more loose in terms of the characters in file names but exFAT is a Microsoft creation that conforms to Windows rules. You're essentially tricking the BDFR into believing that it should follow Linux naming conventions and then it errors out when the filesystem says no.

The erroring out I can address but the other issue isn't really something that can be easily fixed, and certainly not automatically. I don't know of a way to gather the information on the filesystem that is being written to, especially since there are options like SAMBA, NFS, USBs, disks, and a whole bunch of other network protocols for filesharing that may or may not expose the underlying filesystem to the user or our queries.

Maybe being able to force the Windows naming conventions in the configuration file would be a suitable work around? I'm not sure.

@Botts85
Copy link
Contributor

Botts85 commented Jan 3, 2023

Woah, thanks for the heads up on that @Serene-Arc.

I had been noticing files with the incorrect names and flabbergasted why they were downloading but silently remaining in the BDFR logs.

It turns out that smbutil was transparently renaming them to a macOS friendly name. Thus, when I listed the directory in macOS zsh it would show the wrong names. So tonight I SSHd into the linux VM that I run BDFR on, listed the directory there and the names are perfect.

Providing an option to force the Windows naming in the configuration file would likely rectify the situation for any users accessing BDFR's results over SMB.

@Fakeaccount12312
Copy link
Author

Fakeaccount12312 commented Jun 11, 2023

Thanks for adding an option to fix this issue, really appreciate your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants