Skip to content

feat: add lossy UTF-8 reading variants and deprecate strict APIs#73

Merged
digizeph merged 3 commits into
mainfrom
dev/lossy-read
May 12, 2026
Merged

feat: add lossy UTF-8 reading variants and deprecate strict APIs#73
digizeph merged 3 commits into
mainfrom
dev/lossy-read

Conversation

@digizeph
Copy link
Copy Markdown
Member

Summary

Fixes #72read_lines and read_to_string fail on real-world data containing non-UTF-8 bytes (e.g., Latin-1 in RIPE IRR databases). Adds lossy reading variants that replace invalid sequences with U+FFFD and continue processing instead of aborting mid-stream.

Changes

New APIs

  • read_lines_lossy / OneIo::read_lines_lossy — line iterator with lossy UTF-8
  • read_to_string_lossy / OneIo::read_to_string_lossy — full-file lossy read
  • read_to_bytes / OneIo::read_to_bytes — byte-perfect Vec<u8> read
  • read_to_string_lossy_async / read_to_bytes_async — async variants
  • OneIo::to_lines_lossy(reader) — convert any reader to lossy line iterator

Deprecated (still functional)

  • read_lines → use read_lines_lossy
  • read_to_string → use read_to_string_lossy or read_to_bytes
  • read_to_string_async → use read_to_string_lossy_async or read_to_bytes_async

CLI

  • Defaults to lossy UTF-8 reading (no longer exits on invalid bytes)
  • New --strict-utf8 flag for the old strict behavior

Library compatibility

No breaking API changes. Existing functions are deprecated with #[deprecated] but retain identical signatures and behavior. Callers can migrate at their own pace.

Testing

24 new tests covering:

  • Latin-1 lossy replacement (0xf3U+FFFD)
  • CRLF, bare CR, no-trailing-newline handling
  • Continuation past bad bytes (the original bug)
  • Byte-perfect round-trip via read_to_bytes
  • Legacy strict APIs still fail as expected
  • Send trait for cross-thread usage

Related

Add read_lines_lossy, read_to_string_lossy, read_to_bytes, and async
variants. Invalid UTF-8 sequences are replaced with U+FFFD instead
of aborting the read.

- CLI defaults to lossy reading; add --strict-utf8 flag for old behavior
- Existing read_lines/read_to_string/read_to_string_async deprecated
- 24 new tests for lossy replacement, CRLF handling, byte round-trip,
  and legacy strict failure modes
- Docs and examples updated to use new APIs
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds lossy UTF-8 read APIs to OneIO so callers can keep processing real-world inputs containing invalid UTF-8 bytes, while keeping the existing strict APIs available (but deprecated). Updates documentation, examples, CLI behavior, and tests to prefer the new lossy/default-safe paths.

Changes:

  • Added new lossy text APIs (read_lines_lossy, read_to_string_lossy) plus byte-perfect helpers (read_to_bytes) with async counterparts.
  • Deprecated strict UTF-8 APIs (read_lines, read_to_string, read_to_string_async) and migrated docs/examples/tests to the new defaults.
  • Updated the CLI to default to lossy UTF-8 and added a --strict-utf8 flag for the legacy strict behavior.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/lossy_read_tests.rs Adds a broad test suite for lossy line iteration and byte-perfect reads.
tests/basic_integration.rs Switches baseline integration coverage to lossy APIs.
src/lib.rs Adds crate-level lossy/bytes APIs and deprecates strict wrappers; updates docs usage.
src/client.rs Implements lossy line iterator + new OneIo lossy/bytes methods; deprecates strict methods.
src/bin/oneio.rs CLI defaults to lossy line reading and introduces --strict-utf8.
src/async_reader.rs Adds async lossy string and byte-perfect read helpers; deprecates strict async string read.
specs/README.md Adds the lossy-read spec to the spec index.
specs/02-lossy-read/README.md New design spec documenting rationale, API plan, and testing strategy.
README.md Updates README examples to use lossy APIs.
examples/test_crypto_provider.rs Updates example to use lossy string read.
examples/async_read.rs Updates async example to use lossy async string read.
CHANGELOG.md Records new APIs, CLI behavior change, and deprecations under Unreleased.

Comment thread src/client.rs
Comment thread src/client.rs
Comment thread src/lib.rs Outdated
Comment thread src/async_reader.rs
Comment thread tests/lossy_read_tests.rs Outdated
Comment thread tests/lossy_read_tests.rs Outdated
Comment thread specs/README.md Outdated
Comment thread specs/02-lossy-read/README.md Outdated
Comment thread CHANGELOG.md Outdated
- Add + Send bound to all impl Iterator returns (to_lines_lossy, read_lines_lossy)
- Add async lossy/bytes tests to async_integration.rs
- Replace write_temp with RAII TempFile (auto-deletes on drop)
- Fix doc comment to remove mention of non-existent async tests
- Fix specs/README.md markdown table separator
- Update spec bare CR expected result to match BufRead::lines() semantics
- Clarify CHANGELOG deprecation guidance for get_reader usage
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

tests/async_integration.rs:106

  • This test also uses a fixed filename (tests/_tmp_async_bytes.txt), which can collide under parallel test execution and lead to flakiness. Use a per-test unique temp path (or a shared helper that creates unique temp files) instead of a constant path under tests/.
    let tmp_path = "tests/_tmp_async_bytes.txt";
    let _ = std::fs::remove_file(tmp_path);
    let expected = b"valid\nbad: \xf3\nnext\n";
    std::fs::write(tmp_path, expected).unwrap();

    let bytes = oneio::read_to_bytes_async(tmp_path).await.unwrap();
    assert_eq!(bytes, expected);

Comment thread tests/async_integration.rs Outdated
Comment thread CHANGELOG.md Outdated
- Generate unique temp paths per async test (PID + counter) to avoid race
  conditions under parallel cargo test
- Rename CHANGELOG header to [Unreleased] for parser compatibility
@digizeph digizeph enabled auto-merge May 12, 2026 20:49
@digizeph digizeph merged commit 23a9279 into main May 12, 2026
6 checks passed
@digizeph digizeph deleted the dev/lossy-read branch May 12, 2026 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

read_lines fails on non-UTF-8 input — consider a lossy variant

2 participants