Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError on accented characters #61

Closed
hdb opened this issue Feb 10, 2019 · 1 comment
Closed

UnicodeDecodeError on accented characters #61

hdb opened this issue Feb 10, 2019 · 1 comment

Comments

@hdb
Copy link

hdb commented Feb 10, 2019

I am attempting to videogrep a video that is English language but brief lines in Spanish occasionally appear. It looks like subtitles that have some non-English characters cause a unicode decode error to be thrown:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 19669: invalid continuation byte

This can easily be fixed by finding and replacing accented characters with non-accented characters in the subtitle track, but maybe this can be done programmatically without altering the original subtitle file? I'm not sure how common it is to find English language subtitles with correct non-English accent markings, etc.

@bberenberg
Copy link

I tried to deal with this by doing:
find . -type f -name '*.vtt' -print -exec iconv -c -f utf-8 -t ascii {} -o {} \;
but it didn't help resolve the issue. Makes me think there is something more here that I may be overlooking.

Also, with 80+ files, the lack of error handling means that I have no idea which file is causing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants