CJK characters support for RTF parse #15

Closed
darkranger-red opened this Issue Oct 5, 2012 · 5 comments

Projects

None yet

3 participants

@darkranger-red

Hello guys,

CJK means Chinese, Japanese, and Korean. Many ancient RTF writer doesn't store these characters in Unicode, and use pyth to read CJK characters from these ancient RTF documents would cause "UnicodeDecodeError" due to CJK codecs actually use 4 hex digits not 2.

I did modified plugins/rtf15/reader.py to resolve my own needs. But I still hope someone can write a better code to deal with this issue.

1)Add this first:

from binascii import unhexlify

2)Add number 936:

# All the ones named by number in my 2.6 encodings dir
_CODEPAGES_BY_NUMBER = dict(
    (x, "cp%s" % x) for x in (37, 1006, 1026, 1140, 1250, 1251, 1252, 1253, 1254, 1255,
                              1256, 1257, 1258, 424, 437, 500, 737, 775, 850, 852, 855,
                              856, 857, 860, 861, 862, 863, 864, 865, 866, 869, 874,
                              875, 932, 936, 949, 950))

3)Change to 'ignore' :

def read(self, source, errors='ignore'):

4):

            if next == "'":
                # ANSI escape, takes two hex digits
                chars.extend("ansi_escape")
                digits.extend(self.source.read(2))

                #For some asian languages, takes two more digits

                #Japanse:
                if self.charset == "cp932":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))    
                #Simplified Chinese:       
                if self.charset == "cp936":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))                          
                #Korean:
                if self.charset == "cp949":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))          
                #Traditional Chinese:
                if self.charset == "cp950":
                   if self.source.read(2) == "\\'":
                      digits.extend(self.source.read(2))

                break
    def handle_ansi_escape(self, code):
        cjk = code
        code = int(code, 16)

        if isinstance(self.charset, dict):
            uni_code = self.charset.get(code)
            if uni_code is None:
                char = u'?'
            else:
                char = unichr(uni_code)

        else:
            if code <= 255:
               char = chr(code).decode(self.charset, self.reader.errors)
               self.content.append(char)
            else:
               char = unhexlify(cjk).decode(self.charset, self.reader.errors)
               self.content.append(char)
@brendonh
Owner
brendonh commented Oct 5, 2012

This is definitely something I'd like to support, but I'm not sure how (if at all!) it's covered by the RTF specs. Can you give me a couple of example RTF files to test against?

@darkranger-red

OK, I will collect some files when I back to the office on Monday.

@yairchu
yairchu commented Feb 20, 2015

Btw maybe using incrementaldecoder would be the right way?

@brendonh
Owner

Multibyte codepages are fixed in 381a306 and your test file now works.

@brendonh brendonh closed this May 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment