Skip to content

Conversation

@abraemer
Copy link
Contributor

Context

On Unix systems, paths may contain arbitrary bytes. This can cause the string encoding to fail if the path contains bytes that are not valid UTF-8 codepoints. I ran into this bug while scanning a large code base with ScanCode. In this circumstance, this bug caused ScanCode to abort entirely and not produce any output.

Summary of changes

This PR add the option "surrogateescape" to the .encode call. This means that bytes that cannot be encoded properly are escaped instead. Since we only use the encoding to convert a string to a byte sequence for hashing, this simple change should be sufficient to fix the problem and does not require further changes.

Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! @abraemer LGTM

Keeping these links here for reference:

There were some formatting test failures, I've fixed them for you (since the tests does not run automatically for first-time contributors).

@AyanSinhaMahapatra AyanSinhaMahapatra merged commit 1dd162b into aboutcode-org:main Oct 22, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants