Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtitles validation #23

Open
Diaoul opened this issue Dec 9, 2012 · 4 comments
Open

Subtitles validation #23

Diaoul opened this issue Dec 9, 2012 · 4 comments

Comments

@Diaoul
Copy link

Diaoul commented Dec 9, 2012

I try to validate subtitles with this:

import codecs
import pysrt
from charade.universaldetector import UniversalDetector


def is_valid_subtitle(path):
    u = UniversalDetector()
    for line in open(path, 'rb'):
        u.feed(line)
    u.close()
    encoding = u.result['encoding']
    source_file = codecs.open(path, 'rU', encoding=encoding, errors='replace')
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error:
        return False
    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
        pass
    return True

But unfortunately for some subtitles it fails even though the file is a valid subtitle. For example this one: https://docs.google.com/open?id=0B2q9iBGZdj6qOXZrbFpiV2ozOHc
I think there should be different kind of InvalidItem error. It could be subclassed to raise, in this case, EmptyText error.

Although, I'm not sure this should raise an error at all because this doesn't mean the item is invalid, it just has its text empty.

@Diaoul
Copy link
Author

Diaoul commented Dec 9, 2012

A convenience method in pysrt would be welcome to check for valid subtitles files.

@byroot
Copy link
Owner

byroot commented Dec 14, 2012

I'm not sure if pysrt should consider an empty text as an error.
All I can say right now is that it's not an intended behavior, in fact I never tough about that possibility.

I think it's reasonable to consider them as valid unless you know some players that fail to parse them ?

@byroot
Copy link
Owner

byroot commented Dec 14, 2012

I'm also ok to implement a kind of pysrt.validate(path, encoding=None), but i'm not sure of the best behavior:

  • Should I just return a boolean or the error list ?
  • If I fail to parse because of an encoding error should I raise or consider the file as invalid ?

Your feedback on this one is welcome.

@Diaoul
Copy link
Author

Diaoul commented Dec 14, 2012

My personal taste is pysrt.is_valid(path, ignore_encoding_errors=True) that checks for subtitle file error. When having issues with encoding, most readers will just display unreadable characters but will read the file anyway hence the ignore_encoding_errors.
I think it's important to dissociate structure validation and encoding issues.

I wouldn't use the error list but I agree that could be useful for some.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants