-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return GB18030 for Simplified Chinese files. #33
Conversation
GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to `UnicodeDecodeError`. Changing to GB18030 can fix the problem.
We've had this discussion at length on other pull requests and issues over on sigmavirus24/charade. This is not sufficient to "fix" this issue. In order to do this properly we need a separate State Machine and Prober for GB18030 which we cannot do yet. |
Yes, I know this is not a perfect solution. But at least it can help to solve many practical problems. The java port |
I'm not in favor of cheating a solution. It's not the right way to do it, I won't approve of it. |
I'm going to close this PR in light of the objections from @sigmavirus24. If you want to create a new PR that adds an actual state machine and prober for GB18030, we'll gladly accept it. |
W3C's technical recommendation specifies a GBK encoding to be inferred for streams labeled gb2312, which in turn uses a GB18030 decoder. |
Any news? cur_encoding = chardet.detect(content)['encoding']
if cur_encoding == 'GB2312':
cur_encoding = 'GBK' |
It took me a long time to notice this comment (sorry), but I'm adding a new flag to apply the W3C suggested legacy encoding renaming. |
GB2312 is a subset of GB18030 and covers fewer Chinese characters, which frequently leads to
UnicodeDecodeError
. Changing to GB18030 can fix the problem.