Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyString::data(): return the internal representation of the Python unicode object #247

Merged
merged 6 commits into from
Feb 17, 2021

Conversation

dgrunwald
Copy link
Owner

This fixes #246: panic in PyString::to_string and PyString::to_string_lossy when a Python3 unicode string contains unpaired surrogates.

As an optimization, PyString::to_string keeps using PyUnicode_AsUTF8AndSize which allows Python to cache the UTF-8 representation.

Copy link
Collaborator

@markbt markbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

let data = py_string.cast_as::<PyString>(py).unwrap().data(py);
#[cfg(feature = "python3-sys")]
{
if let PyStringData::Latin1(s) = data {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth adding tests for the other PyStringData variants? I assume non-Latin-1 text (e.g. something in Greek) would be Utf16 and non-BMP text (e.g. something with emojis) would be Utf32.

@dgrunwald dgrunwald merged commit 40c815e into master Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyString::to_string panics on UnicodeDecodeError
2 participants