Fix broken CP949 state machine #268

HelloWorld017 · 2022-07-21T16:54:03Z

Abstract

Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error.
This PR rewrites the state transition table, to comply the CP949 Specification.

Details

These are some cases, which a false-positive error can occur in the current implementation.

춉 (0xAD68)
The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.
힣 (0xC652)
The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.

Test

I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned Success.
When I have tested it against the current implementation (As-Is), it shows Error! at byte 15479.

from chardet.codingstatemachine import CodingStateMachine
from chardet.mbcssm import CP949_SM_MODEL

sm = CodingStateMachine(CP949_SM_MODEL)

with open('./path/to/cp949-chars.txt', 'rb') as f:
    data = f.read()

for i, byte in enumerate(data):
    state = sm.next_state(byte)

    if state == 1:
        print("Error! at byte %d" % i)
        break

if state != 1:
  print("Success! :)")

I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)

dan-blanchard · 2022-07-22T20:31:37Z

Do you have any test files that you could contribute that show that it failed before and would work now?

HelloWorld017 · 2022-07-25T17:05:34Z

I have added tests for coding state machines (test_coding_state_machine).
It's working for the most machines and points out the bug of the current CP949 state machine.

Drawbacks

I have commented out the ISO-2022-CN and EUCTW as the python does not supports them.
Actually, for the unicode machines, it seems that it fails. I'm not certain whether it is the bug of the state machine, or the bug of the test code.
These are the case where they fails:
- State 5 + u+1b00 for UTF-16LE
- State 6 + u+1b00 for UTF-16BE
- State START + u+10000 for UTF-8

Please let me know if there are something to improve, or if I have made some mistakes in the test code.

HelloWorld017 · 2023-11-05T12:58:41Z

I have limited testing charsets for the test case, which is added by this PR, as it would be better to be handled in another PR.

dan-blanchard force-pushed the feature/cp949-class8 branch from 2efbe00 to b83279c Compare July 22, 2022 19:58

HelloWorld017 and others added 4 commits November 5, 2023 21:58

Fix broken CP949 state machine

aec8efb

Fix CP949 char len table

97c572e

Add coding state machine test code

ac07f69

Ignore unicode coding state machine tests

9159b2c

HelloWorld017 force-pushed the feature/cp949-class8 branch from a6fa082 to 9159b2c Compare November 5, 2023 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken CP949 state machine #268

Fix broken CP949 state machine #268

HelloWorld017 commented Jul 21, 2022 •

edited

dan-blanchard commented Jul 22, 2022

HelloWorld017 commented Jul 25, 2022 •

edited

HelloWorld017 commented Nov 5, 2023

Fix broken CP949 state machine #268

Are you sure you want to change the base?

Fix broken CP949 state machine #268

Conversation

HelloWorld017 commented Jul 21, 2022 • edited

Abstract

Details

Test

dan-blanchard commented Jul 22, 2022

HelloWorld017 commented Jul 25, 2022 • edited

Drawbacks

HelloWorld017 commented Nov 5, 2023

HelloWorld017 commented Jul 21, 2022 •

edited

HelloWorld017 commented Jul 25, 2022 •

edited