Skip to content

Problem parsing non-ascii letters inside json strings #59

@Jonatan-Nilsson

Description

@Jonatan-Nilsson

When I run the following code

import json_stream

em_dash = '—' #chr(8212)
payload = ('{"test": "' + em_dash*10 + '"}').encode('utf-8')

chunk_size = 10

itr = (payload[i:i+chunk_size] for i in range(0, len(payload), chunk_size))

data = json_stream.load(itr)
test = data['test']

print(test)

I get this error message

OSError: I/O error while parsing (index 15): Custom { kind: Other, error: "incomplete utf-8 byte sequence from index 0" }

The problem seems to be with the rust tokenizer. If I do json_stream.load(itr, tokenizer=json_stream.tokenizer.tokenize) it works.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions