Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return GB18030 for Simplified Chinese files. #33

Closed
wants to merge 1 commit into from

Conversation

atbest
Copy link
Contributor

@atbest atbest commented Sep 12, 2014

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to UnicodeDecodeError. Changing to GB18030 can fix the problem.

GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.
@sigmavirus24
Copy link
Member

We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet.

@atbest
Copy link
Contributor Author

atbest commented Sep 12, 2014

Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port juniversalchardet also use this 'fix'.

@sigmavirus24
Copy link
Member

I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. juniversalchardet is a separate project whose decisions does not impact this project's.

@dan-blanchard
Copy link
Member

I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it.

@ericlingit
Copy link

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

@honglei
Copy link

honglei commented Nov 28, 2020

Any news?

cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
    cur_encoding = 'GBK'

@dan-blanchard
Copy link
Member

W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder.

It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants