feat: add lossy UTF-8 reading variants and deprecate strict APIs#73
Merged
Conversation
Add read_lines_lossy, read_to_string_lossy, read_to_bytes, and async variants. Invalid UTF-8 sequences are replaced with U+FFFD instead of aborting the read. - CLI defaults to lossy reading; add --strict-utf8 flag for old behavior - Existing read_lines/read_to_string/read_to_string_async deprecated - 24 new tests for lossy replacement, CRLF handling, byte round-trip, and legacy strict failure modes - Docs and examples updated to use new APIs
Contributor
There was a problem hiding this comment.
Pull request overview
Adds lossy UTF-8 read APIs to OneIO so callers can keep processing real-world inputs containing invalid UTF-8 bytes, while keeping the existing strict APIs available (but deprecated). Updates documentation, examples, CLI behavior, and tests to prefer the new lossy/default-safe paths.
Changes:
- Added new lossy text APIs (
read_lines_lossy,read_to_string_lossy) plus byte-perfect helpers (read_to_bytes) with async counterparts. - Deprecated strict UTF-8 APIs (
read_lines,read_to_string,read_to_string_async) and migrated docs/examples/tests to the new defaults. - Updated the CLI to default to lossy UTF-8 and added a
--strict-utf8flag for the legacy strict behavior.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/lossy_read_tests.rs | Adds a broad test suite for lossy line iteration and byte-perfect reads. |
| tests/basic_integration.rs | Switches baseline integration coverage to lossy APIs. |
| src/lib.rs | Adds crate-level lossy/bytes APIs and deprecates strict wrappers; updates docs usage. |
| src/client.rs | Implements lossy line iterator + new OneIo lossy/bytes methods; deprecates strict methods. |
| src/bin/oneio.rs | CLI defaults to lossy line reading and introduces --strict-utf8. |
| src/async_reader.rs | Adds async lossy string and byte-perfect read helpers; deprecates strict async string read. |
| specs/README.md | Adds the lossy-read spec to the spec index. |
| specs/02-lossy-read/README.md | New design spec documenting rationale, API plan, and testing strategy. |
| README.md | Updates README examples to use lossy APIs. |
| examples/test_crypto_provider.rs | Updates example to use lossy string read. |
| examples/async_read.rs | Updates async example to use lossy async string read. |
| CHANGELOG.md | Records new APIs, CLI behavior change, and deprecations under Unreleased. |
- Add + Send bound to all impl Iterator returns (to_lines_lossy, read_lines_lossy) - Add async lossy/bytes tests to async_integration.rs - Replace write_temp with RAII TempFile (auto-deletes on drop) - Fix doc comment to remove mention of non-existent async tests - Fix specs/README.md markdown table separator - Update spec bare CR expected result to match BufRead::lines() semantics - Clarify CHANGELOG deprecation guidance for get_reader usage
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
tests/async_integration.rs:106
- This test also uses a fixed filename (
tests/_tmp_async_bytes.txt), which can collide under parallel test execution and lead to flakiness. Use a per-test unique temp path (or a shared helper that creates unique temp files) instead of a constant path undertests/.
let tmp_path = "tests/_tmp_async_bytes.txt";
let _ = std::fs::remove_file(tmp_path);
let expected = b"valid\nbad: \xf3\nnext\n";
std::fs::write(tmp_path, expected).unwrap();
let bytes = oneio::read_to_bytes_async(tmp_path).await.unwrap();
assert_eq!(bytes, expected);
- Generate unique temp paths per async test (PID + counter) to avoid race conditions under parallel cargo test - Rename CHANGELOG header to [Unreleased] for parser compatibility
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #72 —
read_linesandread_to_stringfail on real-world data containing non-UTF-8 bytes (e.g., Latin-1 in RIPE IRR databases). Adds lossy reading variants that replace invalid sequences withU+FFFDand continue processing instead of aborting mid-stream.Changes
New APIs
read_lines_lossy/OneIo::read_lines_lossy— line iterator with lossy UTF-8read_to_string_lossy/OneIo::read_to_string_lossy— full-file lossy readread_to_bytes/OneIo::read_to_bytes— byte-perfectVec<u8>readread_to_string_lossy_async/read_to_bytes_async— async variantsOneIo::to_lines_lossy(reader)— convert any reader to lossy line iteratorDeprecated (still functional)
read_lines→ useread_lines_lossyread_to_string→ useread_to_string_lossyorread_to_bytesread_to_string_async→ useread_to_string_lossy_asyncorread_to_bytes_asyncCLI
--strict-utf8flag for the old strict behaviorLibrary compatibility
No breaking API changes. Existing functions are deprecated with
#[deprecated]but retain identical signatures and behavior. Callers can migrate at their own pace.Testing
24 new tests covering:
0xf3→U+FFFD)read_to_bytesSendtrait for cross-thread usageRelated
specs/02-lossy-read/README.mdread_linesfails on non-UTF-8 input — consider a lossy variant #72