Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophe error #411

Closed
ceonelson opened this issue Apr 22, 2022 · 6 comments · Fixed by #416
Closed

Apostrophe error #411

ceonelson opened this issue Apr 22, 2022 · 6 comments · Fixed by #416
Labels
bug Something isn't working

Comments

@ceonelson
Copy link
Contributor

Win10 / 2.4

If I check a video, the name shows correctly in Tartube:

image

python3 D:_YT\yt-dlp-20220408 --newline -i --hls-prefer-native --write-description --write-info-json --write-annotations --cookies D:/_ytt/cookies.txt --write-thumbnail --merge-output-format mkv --write-sub --embed-thumbnail --add-metadata --windows-filenames --convert-thumbnails jpg --sub-lang en --output D:/_ytt/Comedy/%(uploader)s - (%(upload_date)s) - %(title)s - %(id)s - [%(format_id)s#%(height)sp].%(ext)s --get-comments --extractor-args youtube:comment_sort=top -f bestvideo[ext=webm][height<=?480][fps<=?30]+bestaudio[ext=webm]/bestvideo[height<=?480][fps<=?30]+bestaudio/best --dump-json --download-archive D:/_ytt/Comedy/ytdl-archive.txt https://www.youtube.com/watch?v=GVtEzGZP-_s
[Comedy] <Simulated download of: 'Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p]'>

But upon downloading it, the apostrophe gets messed up:

image

python3 D:_YT\yt-dlp-20220408 --newline -i --hls-prefer-native --write-description --write-info-json --write-annotations --cookies D:/_ytt/cookies.txt --write-thumbnail --merge-output-format mkv --write-sub --embed-thumbnail --add-metadata --windows-filenames --convert-thumbnails jpg --sub-lang en --output D:/_ytt/Comedy/%(uploader)s - (%(upload_date)s) - %(title)s - %(id)s - [%(format_id)s#%(height)sp].%(ext)s --get-comments --extractor-args youtube:comment_sort=top -f bestvideo[ext=webm][height<=?480][fps<=?30]+bestaudio[ext=webm]/bestvideo[height<=?480][fps<=?30]+bestaudio/best --download-archive D:/_ytt/Comedy/ytdl-archive.txt https://www.youtube.com/watch?v=GVtEzGZP-_s
[youtube] GVtEzGZP-_s: Downloading webpage
[youtube] GVtEzGZP-_s: Downloading android player API JSON
[youtube] Downloading comment section API JSON
[youtube] Downloading ~140 comments
[youtube] Sorting comments by top comments
[youtube] Downloading comment API JSON page 1 (0/140)
[youtube] Downloading comment API JSON reply thread 1 (6/140)
[youtube] Downloading comment API JSON reply thread 2 (9/140)
[youtube] Downloading comment API JSON reply thread 3 (15/140)
[youtube] Downloading comment API JSON reply thread 4 (23/140)
[youtube] Downloading comment replies API JSON page 1 (33/140)
[youtube] Downloading comment API JSON page 2 (47/140)
[youtube] Downloading comment API JSON reply thread 1 (49/140)
[youtube] Downloading comment replies API JSON page 1 (59/140)
[youtube] Downloading comment API JSON reply thread 2 (79/140)
[youtube] Downloading comment API JSON reply thread 3 (82/140)
[youtube] Downloading comment API JSON page 3 (96/140)
[youtube] Downloading comment API JSON reply thread 1 (104/140)
[youtube] Downloading comment API JSON reply thread 2 (109/140)
[youtube] Downloading comment API JSON reply thread 3 (111/140)
[youtube] Downloading comment API JSON reply thread 4 (113/140)
[youtube] Extracted 114 comments
[info] GVtEzGZP-_s: Downloading 1 format(s): 244+251
[info] Writing video description to: D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].description
[info] Downloading video thumbnail 41 ...
[info] Writing video thumbnail 41 to: D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].webp
[info] Writing video metadata as JSON to: D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].info.json
WARNING: There are no annotations to write.
[ThumbnailsConvertor] Converting thumbnail "D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].webp" to jpg
Deleting original file D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].webp (pass -k to keep)
[download] Destination: D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].f244.webm
[download] 100% of 125.51MiB in 00:27
[download] Destination: D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].f251.webm
[download] 100% of 42.52MiB in 00:07
[Merger] Merging formats into "D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].mkv"
Deleting original file D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].f244.webm (pass -k to keep)
Deleting original file D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].f251.webm (pass -k to keep)
[Metadata] Adding metadata to "D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].mkv"
[EmbedThumbnail] ffmpeg: Adding thumbnail to "D:/_ytt/Comedy/Dinesh D'Souza - (20220421) - UNMASKED Dinesh D’Souza Podcast Ep315 - GVtEzGZP-_s - [244+251#480p].mkv"

Tried both with and without --windows-filenames, same result.

Any ideas? Thanks Axcore!

@ceonelson ceonelson added the bug Something isn't working label Apr 22, 2022
@axcore
Copy link
Owner

axcore commented Apr 26, 2022

Tartube stores two names for every video: a name matching the video's filename, and a 'nickname' taken from the video's metadata.

Open the video's properties window. The name is at the top, the nickname is one in the **Listed as" box. The nickname is also the one visible in your screenshots.

I checked that video's metadata, and it contains a character which is not supposed to be used as an apostrophe ( ’ ). Check it for yourself here.

The nickname is just for aesthetics, it doesn't affect Tartube operations or your filesystem. As far as I can tell, the text is being rendered correctly. You can write to the video's author, if you like, and politely suggest that they learn how to use their keyboard.

@ceonelson
Copy link
Contributor Author

Why is it that it shows correctly when checking the video but doesn't once the video is downloaded?

In the JSON for that file, it gets correctly rendered by VSCode as an apostrophe (U+2019):
image

Also, it looks like it used to render correctly previously in Tartube:
image

(on a side note, that check script link reports as broken for the character search):
image

@ceonelson
Copy link
Contributor Author

Hmm so I pulled up the JSON for the last one that rendered correctly, and it appears instead of storing the character as ' in the file it stores it as \u2019:
image

Any idea why that would be happening?

@ceonelson
Copy link
Contributor Author

So I think I've figured out the root cause of this issue, and it's another change from yt-dlp's 20211227 release that seems to have broken things.

I went to my test db and reverted to that release of yt-dlp, and sure enough everything was fine in the name and description:
image

So I figured I'd work my way back up to the current yt-dlp release and see where things break, but I didn't have to go far as the January release no longer displayed correctly:
image

I went to yt-dlp and searched through all the issues for unicode and found the culprit:
yt-dlp/yt-dlp#2139

Which inspired this change to the code that, you guessed it, was made live in the January release:
yt-dlp/yt-dlp@45d86ab

Since this is the default behavior of yt-dlp moving forward, is it possible to have tartube read in the description and info.json files as unicode to make it where the characters display correctly?

Thanks Axcore!

@ceonelson
Copy link
Contributor Author

Spent a couple more hours playing around with this tonight and it looks like the issue is that the json/description is being encoded to UTF-8 twice (first by yt-dlp, then by tartube).

(I'm testing on the description file because that was the easiest for me to figure out how to hack a test for)

With latest Tartube and 2022-04-08 yt-dlp, this is how the filename and description is displayed:
image

But if I change downloads.py#L4529 to not encode as UTF-8, it displays correctly on my system with an encoding detected as cp1252:
image

Or if I leave downloads.py alone, but change utils.py#L2855 to force the system encoding to UTF-8, it will also display correctly (but it breaks the filenames on Windows that way):
image

[an hour later]
Great success! (I hope)
Only changing [files.py lines 86]((https://github.com/axcore/tartube/blob/master/tartube/files.py#L86) and 121 to force reading as UTF-8 seemed to do the trick and fix all the problems!
image

Maybe this is the proper fix? Any thoughts on if it will break something else?

I'm going to submit a PR with that change in hopes it will work or be a base for a fix :)

Thanks Axcore!

@axcore
Copy link
Owner

axcore commented May 5, 2022

Nice work @ceonelson, I think I would not have solved that without you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants