Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken CP949 state machine #268

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

HelloWorld017
Copy link

@HelloWorld017 HelloWorld017 commented Jul 21, 2022

Abstract

Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error.
This PR rewrites the state transition table, to comply the CP949 Specification.

Details

These are some cases, which a false-positive error can occur in the current implementation.

  • (0xAD68)
    The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.

  • (0xC652)
    The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

Test

I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success.
When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

from chardet.codingstatemachine import CodingStateMachine
from chardet.mbcssm import CP949_SM_MODEL

sm = CodingStateMachine(CP949_SM_MODEL)

with open('./path/to/cp949-chars.txt', 'rb') as f:
    data = f.read()

for i, byte in enumerate(data):
    state = sm.next_state(byte)

    if state == 1:
        print("Error! at byte %d" % i)
        break

if state != 1:
  print("Success! :)")

I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)

@dan-blanchard
Copy link
Member

Do you have any test files that you could contribute that show that it failed before and would work now?

@HelloWorld017
Copy link
Author

HelloWorld017 commented Jul 25, 2022

I have added tests for coding state machines (test_coding_state_machine).
It's working for the most machines and points out the bug of the current CP949 state machine.

Drawbacks

  • I have commented out the ISO-2022-CN and EUCTW as the python does not supports them.
  • Actually, for the unicode machines, it seems that it fails. I'm not certain whether it is the bug of the state machine, or the bug of the test code.
    These are the case where they fails:
    • State 5 + u+1b00 for UTF-16LE
    • State 6 + u+1b00 for UTF-16BE
    • State START + u+10000 for UTF-8

Please let me know if there are something to improve, or if I have made some mistakes in the test code.

@HelloWorld017
Copy link
Author

I have limited testing charsets for the test case, which is added by this PR, as it would be better to be handled in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants