Encode error #12

Closed
limpbrains opened this Issue Mar 16, 2012 · 20 comments

Comments

Projects
None yet
3 participants

I can't run srt with this file http://dl.dropbox.com/u/1788271/Bones.S07E01.HDTVRip.srt
It is cp1251
I have the following error:

Traceback (most recent call last):
  File "/usr/local/bin/srt", line 9, in <module>
    load_entry_point('pysrt==0.4.1', 'console_scripts', 'srt')()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 190, in main
    SubRipShifter().run(sys.argv[1:])
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 118, in run
    self.arguments.action()
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 164, in break_lines
    self.input_file.break_lines(self.arguments.length)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/commands.py", line 177, in input_file
    encoding=encoding, error_handling=SubRipFile.ERROR_LOG)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 131, in open
    new_file.read(source_file, error_handling=error_handling)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 159, in read
    self.extend(self.stream(source_file, error_handling=error_handling))
  File "/usr/lib/python2.7/UserList.py", line 88, in extend
    self.data.extend(other)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtfile.py", line 190, in stream
    yield SubRipItem.from_lines(source)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 79, in from_lines
    return cls(index, start, end, body, position)
  File "/usr/local/lib/python2.7/dist-packages/pysrt/srtitem.py", line 21, in __init__
    self.index = int(index)
UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string
Owner

byroot commented Mar 17, 2012

Strange, I'm able to shift it without encoding error.

srt shift 20 russian.srt

Can you paste the whole command you typed ?

byroot was assigned Mar 17, 2012

Owner

byroot commented Apr 11, 2012

Well, a month without reply -> I close this issue.

Feel free to reopen it if you still have a problem.

byroot closed this Apr 11, 2012

Hi, sorry for the long responce

srt shift 40s 33.srt
Traceback (most recent call last):
File "/usr/local/bin/srt", line 9, in
load_entry_point('pysrt==0.4.1', 'console_scripts', 'srt')()
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 192, in main
SubRipShifter().run(sys.argv[1:])
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 118, in run
self.arguments.action()
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 136, in shift
self.input_file.shift(milliseconds=self.arguments.time_offset)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/commands.py", line 179, in input_file
encoding=encoding, error_handling=SubRipFile.ERROR_LOG)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 127, in open
new_file.read(source_file, error_handling=error_handling)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 155, in read
self.extend(self.stream(source_file, error_handling=error_handling))
File "/usr/lib/python2.7/UserList.py", line 88, in extend
self.data.extend(other)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtfile.py", line 186, in stream
yield SubRipItem.from_lines(source)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtitem.py", line 58, in from_lines
return cls(index, start, end, body, position)
File "/data/share/_films/Game of Thrones_S02E02/src/pysrt/pysrt/srtitem.py", line 21, in init
self.index = int(index)
UnicodeEncodeError: 'decimal' codec can't encode character u'\ufeff' in position 0: invalid decimal Unicode string

python -V
Python 2.7.2+

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 11.10
Release: 11.10
Codename: oneiric

Owner

byroot commented Apr 16, 2012

Hum, very strange... so it always happen whatever the subtitle file ?

And how did you installed it ? Beacause /data/share/_films/Game of Thrones_S02E02/src/ is a very strange location...

I've only tried on a few files, all russian, UTF8.
installed from git
pip install -e git+https://github.com/byroot/pysrt.git#egg=pysrt

byroot reopened this Apr 16, 2012

Owner

byroot commented Apr 16, 2012

Ok, I still can't reproduce but now I'm almost sure that it's a BOM issue...

I will ask a friend on ubuntu to test that

Did you tried the version released on PyPI ?
pip install --upgrade pysrt

I confirm it is a BOM issue.
I've successfully edited file without BOM created with notepad++
also I've tried the following command
srt -e utf_8_sig ...
but failed with same error

Owner

byroot commented Apr 17, 2012

Pysrt is supposed to handle BOM correctly...

And the file you gived to me is in cp1252, why did it have an utf-8 BOM ?
Can you send me another file again ?

Diaoul commented Dec 9, 2012

I'm having the same issue
File is here: https://docs.google.com/open?id=0B2q9iBGZdj6qN29uUzBBQXNJM2c

byroot closed this in f780a06 Dec 14, 2012

Owner

byroot commented Dec 14, 2012

I finally found the issue, it was because chardet returned "UTF-8" and the encodings module was only aware of "utf-8".

My bad ...

Diaoul commented Jan 13, 2013

Is this fixed in 0.4.4? Because I still have this error

Owner

byroot commented Jan 13, 2013

I Think so. You still have the issue with this same file and pysrt 0.4.4 ?

Owner

byroot commented Jan 13, 2013

Oh shit ... confirmed, I'll fix that right now.

Owner

byroot commented Jan 13, 2013

Oh, I just forgot to release ...

Owner

byroot commented Jan 13, 2013

0.4.5 released with the fix.

Diaoul commented Jan 13, 2013

Thanks, that was fast :)

Diaoul commented Jan 13, 2013

I'm still having an error 😢
I added a print statement to see what's in lines here and I got this:

[u'\ufeff1\r\n', u'00:00:01,677 --> 00:00:04,145\r\n', u'Alors, sur quel genre de croisi\xe8re\r\n', u'allez-vous embarquer ?\r\n']

Diaoul commented Jan 13, 2013

Of course int(u'\ufeff1\r\n') fails
File can be downloaded on Addic7ed

Diaoul commented Jan 13, 2013

Sample code to reproduce the error:

from charade.universaldetector import UniversalDetector
import codecs
import pysrt

def is_valid_subtitle(path):
    u = UniversalDetector()
    for line in open(path, 'rb'):
        u.feed(line)
    u.close()
    encoding = u.result['encoding']
    source_file = codecs.open(path, 'rU', encoding=encoding, errors='replace')
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error as e:
        if e.args[0] < 50:  # Error occurs within the 50 first lines
            return False
#    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
#        pass
    return True
Owner

byroot commented Jan 13, 2013

Oh ! it make sense now. If you open the file yourself pysrt do not strip the BOM.

Anyway chardet is integrated inside pysrt now.

Try something like:

def is_valid_subtitle(path):
    source_file = pysrt.SubRipFile._open_unicode_file(path)
    try:
        for _ in pysrt.SubRipFile.stream(source_file, error_handling=pysrt.SubRipFile.ERROR_RAISE):
            pass
    except pysrt.Error as e:
        if e.args[0] < 50:  # Error occurs within the 50 first lines
            return False
#    except UnicodeEncodeError:  # Workaround for https://github.com/byroot/pysrt/issues/12
#        pass
    return True
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment