fix: handle paths with non-utf-8 bytes #91

abraemer · 2025-10-17T14:47:26Z

Context

On Unix systems, paths may contain arbitrary bytes. This can cause the string encoding to fail if the path contains bytes that are not valid UTF-8 codepoints. I ran into this bug while scanning a large code base with ScanCode. In this circumstance, this bug caused ScanCode to abort entirely and not produce any output.

Summary of changes

This PR add the option "surrogateescape" to the .encode call. This means that bytes that cannot be encoded properly are escaped instead. Since we only use the encoding to convert a string to a byte sequence for hashing, this simple change should be sufficient to fix the problem and does not require further changes.

Signed-off-by: Adrian Braemer <adrian.braemer@tngtech.com>

AyanSinhaMahapatra

Thanks! @abraemer LGTM

Keeping these links here for reference:

There were some formatting test failures, I've fixed them for you (since the tests does not run automatically for first-time contributors).

abraemer and others added 3 commits October 17, 2025 16:36

fix: handle paths with non-utf-8 bytes

3d0e061

Signed-off-by: Adrian Braemer <adrian.braemer@tngtech.com>

Fix string quotes in test for non-UTF8 path

3b48b9b

Fix formatting in test_resource.py

47fb4dd

AyanSinhaMahapatra approved these changes Oct 22, 2025

View reviewed changes

AyanSinhaMahapatra merged commit 1dd162b into aboutcode-org:main Oct 22, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: handle paths with non-utf-8 bytes #91

fix: handle paths with non-utf-8 bytes #91

Uh oh!

abraemer commented Oct 17, 2025

Uh oh!

AyanSinhaMahapatra left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: handle paths with non-utf-8 bytes #91

fix: handle paths with non-utf-8 bytes #91

Uh oh!

Conversation

abraemer commented Oct 17, 2025

Context

Summary of changes

Uh oh!

AyanSinhaMahapatra left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AyanSinhaMahapatra left a comment •

edited

Loading