Return GB18030 for Simplified Chinese files. #33

atbest · 2014-09-12T02:06:20Z

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to UnicodeDecodeError. Changing to GB18030 can fix the problem.

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.

sigmavirus24 · 2014-09-12T02:39:43Z

We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet.

atbest · 2014-09-12T02:55:52Z

Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port juniversalchardet also use this 'fix'.

sigmavirus24 · 2014-09-12T02:58:27Z

I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. juniversalchardet is a separate project whose decisions does not impact this project's.

dan-blanchard · 2014-12-02T16:31:01Z

I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it.

ericlingit · 2018-08-27T06:52:37Z

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

honglei · 2020-11-28T18:29:16Z

Any news?

cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
    cur_encoding = 'GBK'

dan-blanchard · 2022-06-29T03:34:33Z

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming.

Return GB18030 for Simplified Chinese files.

caf0676

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.

dan-blanchard closed this Dec 2, 2014

dan-blanchard reopened this Dec 2, 2014

dan-blanchard closed this Dec 2, 2014

dan-blanchard mentioned this pull request Nov 18, 2015

Is it OK to replace GB2312 by GB18030? #79

Closed

atbest mentioned this pull request Nov 16, 2016

GB18030 for Chinese #94

Closed

atbest deleted the patch-1 branch December 13, 2016 07:05

grzhan mentioned this pull request May 13, 2018

关于 Python chardet 库处理 GB2312、GBK、GB18030 grzhan/keng#1

Open

x1angli mentioned this pull request Dec 21, 2018

GB18030 encoded file incorrectly classified as GB2312 #168

Open

RaiKoHoff mentioned this pull request Mar 7, 2019

Encoding detection detects GB18030 instead of GB2312 rizonesoft/Notepad3#998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return GB18030 for Simplified Chinese files. #33

Return GB18030 for Simplified Chinese files. #33

atbest commented Sep 12, 2014

sigmavirus24 commented Sep 12, 2014

atbest commented Sep 12, 2014

sigmavirus24 commented Sep 12, 2014

dan-blanchard commented Dec 2, 2014

ericlingit commented Aug 27, 2018

honglei commented Nov 28, 2020 •

edited

Loading

dan-blanchard commented Jun 29, 2022

Return GB18030 for Simplified Chinese files. #33

Return GB18030 for Simplified Chinese files. #33

Conversation

atbest commented Sep 12, 2014

sigmavirus24 commented Sep 12, 2014

atbest commented Sep 12, 2014

sigmavirus24 commented Sep 12, 2014

dan-blanchard commented Dec 2, 2014

ericlingit commented Aug 27, 2018

honglei commented Nov 28, 2020 • edited Loading

dan-blanchard commented Jun 29, 2022

honglei commented Nov 28, 2020 •

edited

Loading