Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jan 22, 2026

Rationale for this change

The JSON test utility GenerateAscii was only generating ASCII characters. Should better have the test coverage for proper UTF-8 and Unicode handling.

What changes are included in this PR?

Replaced ASCII-only generation with proper UTF-8 string generation that produces valid Unicode scalar values across all planes (BMP, SMP, SIP, planes 3-16), correctly encoded per RFC 3629.

Are these changes tested?

There are existent tests for JSON.

Are there any user-facing changes?

No, test-only.

@github-actions
Copy link

⚠️ GitHub issue #48941 has been automatically assigned in GitHub to PR creator.

Comment on lines +181 to 183
// Using c_str() is safe here because generation excludes U+0000 (no embedded nulls).
// U+0000 can only exist in plane 0 (BMP), and BMP generation starts at U+0020.
return OK(writer.String(s.c_str()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just call writer.String(s) actually.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 26, 2026
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a particular concern around this that led you to submit this PR?

In any case, I think it might be worth having a more generic helper in arrow/testing/random.h or anything. Something like:

std::string RandomUtf8String(int num_chars);

@HyukjinKwon
Copy link
Member Author

There was a todo // FIXME generate UTF8, and I am trying to kill those TODOs ... some of them are not really actionable or overkill. I plan to swipe them away once I kill all actionable ones.

Let me take a look at the suggestion. I am also happy with just removing this TODO out if that doesn't sound quite worthwhile ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants