Skip to content

Conversation

@ppkarwasz
Copy link
Contributor

The 7z file format specification defines only unsigned numbers (UINT64, REAL_UINT64, UINT32). However, the current implementation allows parsing methods like readUint64, getLong, and getInt to return negative values and then handles those inconsistently in downstream logic.

This PR introduces a safer and more specification-compliant number parsing model.

Key changes

  • Strict unsigned number parsing

    • Parsing methods now never return negative numbers.
    • readUint64, readUint64ToIntExact, readRealUint64, and readUint32 follow the terminology from 7zFormat.txt.
    • Eliminates scattered negative-value checks that previously compensated for parsing issues.
  • Improved header integrity validation

    • Before large allocations, the size is now validated against the actual available data in the header as well as the memory limit.
    • Prevents unnecessary or unsafe allocations when the archive is corrupted or truncated.
  • Correct numeric type usage

    • Some fields represent 7z numbers as 64-bit values but are constrained internally to Java int limits.
    • These are now declared as int to signal real constraints in our implementation.
  • Consistent error handling Parsing now throws only three well-defined exception types:

    Condition Exception
    Declared structure exceeds maxMemoryLimitKiB MemoryLimitException
    Missing data inside header (truncated or corrupt) ArchiveException("Corrupted 7z archive")
    Unsupported numeric values (too large for implementation) ArchiveException("Unsupported 7z archive")

    Note: EOFException is no longer used: a header with missing fields is not “EOF,” it is corrupted.

This PR lays groundwork for safer parsing and easier future maintenance by aligning number handling with the actual 7z specification and making header parsing behavior predictable and robust.

ppkarwasz and others added 6 commits October 18, 2025 00:25
The 7z file format specification defines only **unsigned numbers** (`UINT64`, `REAL_UINT64`, `UINT32`). However, the current implementation allows parsing methods like `readUint64`, `getLong`, and `getInt` to return negative values and then handles those inconsistently in downstream logic.

This PR introduces a safer and more specification-compliant number parsing model.

### Key changes

* **Strict unsigned number parsing**

  * Parsing methods now *never* return negative numbers.
  * `readUint64`, `readUint64ToIntExact`, `readRealUint64`, and `readUint32` follow the terminology from `7zFormat.txt`.
  * Eliminates scattered negative-value checks that previously compensated for parsing issues.

* **Improved header integrity validation**

  * Before large allocations, the size is now validated against the **actual available data in the header** as well as the memory limit.
  * Prevents unnecessary or unsafe allocations when the archive is corrupted or truncated.

* **Correct numeric type usage**

  * Some fields represent 7z numbers as 64-bit values but are constrained internally to Java `int` limits.
  * These are now declared as `int` to signal real constraints in our implementation.

* **Consistent error handling**
  Parsing now throws only three well-defined exception types:

  | Condition                                                              | Exception                                    |
  | ---------------------------------------------------------------------- | -------------------------------------------- |
  | Declared structure exceeds `maxMemoryLimitKiB`                         | `MemoryLimitException`                       |
  | Missing data inside header (truncated or corrupt)                      | `ArchiveException("Corrupted 7z archive")`   |
  | Unsupported numeric values (too large for implementation) | `ArchiveException("Unsupported 7z archive")` |

  Note: `EOFException` is no longer used: a header with missing fields is not “EOF,” it is **corrupted**.

This PR lays groundwork for safer parsing and easier future maintenance by aligning number handling with the actual 7z specification and making header parsing behavior *predictable and robust*.
@garydgregory garydgregory merged commit f1830f9 into commons_io_2_21_0 Oct 18, 2025
16 checks passed
@garydgregory garydgregory deleted the fix/7z-size-validation branch October 18, 2025 11:37
@garydgregory garydgregory changed the title 7z: unsigned number parsing and improved header validation 7z: Unsigned number parsing and improved header validation Oct 18, 2025
garydgregory added a commit that referenced this pull request Nov 7, 2025
* Migrate to Apache Commons IO 2.21.0-SNAPSHOT

* Avoid redirects for snapshots in this branch

* Use `IOUtils.checkIndexFromSize` for argument validation (#716)

This change replaces custom argument checks in all `InputStream` and `OutputStream` implementations with the common utility method `IOUtils.checkIndexFromSize`.
This ensures consistent validation logic across the codebase.

Some implementations are intentionally left unchanged: cases where the arguments are immediately delegated to methods such as `InputStream#read`, `OutputStream#write`, or `ByteBuffer#wrap`, which already perform equivalent checks internally.

* Set commons-parent to 88

* remove: make `TarUtils` final and clean up internal methods (#712)

* remove: make `TarUtils` final and clean up internal methods

`TarUtils` already had a private constructor and was only used within the `o.a.c.compress.archivers.tar` package. This PR makes the class explicitly final and simplifies its internal API:

* Mark `TarUtils` as `final`.
* Change `protected` methods to package-private (they were never usable outside the package due to the private constructor).
* Remove deprecated non-`public` methods.
* Simplify `parsePaxHeaders`:

  * Require a non-negative `headerSize`. The special value `-1` (“unlimited”) was only used in tests.
  * Update tests to supply a valid `headerSize`.
  * Standardize error handling for invalid PAX 0.0 sparse records: all parsing errors now throw `ArchiveException` (previously one case threw `IOException`).

A changelog entry is included even though these are internal changes, to give users a reference point in case any unintended side effects arise.

* fix: make `TarUtils` final

This change actually makes `TarUtils` final and introduces a JapiCmp configuration that ignores all `CLASS_NOW_FINAL` changes.

* fix: don't call tests with `Integer.MAX_VALUE`

* fix: bump `japicmp` to version `0.24.0`

This bumps `japicmp` to the new version `0.24.0`, which ignores changes to protected method in non-extensible classes such as `TarUtils`.

* Update japicmp version from 0.24.0 to 0.24.1

* fix: clean up POM

* fix: enforcer rule to clean up temporary override

* Test `project.parent.version` resolution on CI

* Try ignoring white space

* Try newest Enforcer version

* Remove enforcer rule

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>

* Add `AbstractArchiveBuilder` for unified archiver support (#721)

* Introduce `AbstractArchiveBuilder` for unified archiver support

This PR introduces a new `AbstractArchiveBuilder`, which serves as a shared foundation for all archive input streams, including `SevenZFile`, `TarFile`, and `ZipFile`.

This refactoring also leverages the new `ChannelOrigin` from Commons IO 2.21.0, streamlining the builder implementations for `SevenZFile`, `TarFile`, and `ZipFile`.

### Additional improvements

* **Constructor unification:** All constructors now delegate to a single constructor that accepts a builder.
* **Resource safety:** If an error occurs during construction (or within the builder’s `get()` method), the underlying resource is now always closed. Previously, resources provided via `SeekableByteChannel` were not closed on error, potentially leaving channels in an incoherent state.
* **Deprecation cleanup:** All references to deprecated constructors and methods have been removed. Compatibility is verified through `LegacyConstructorsTest`, which ensures the builder pattern behaves equivalently to the removed constructors.

* fix: Checkstyle violation

* fix: remove unused members

* fix: Windows failure

* Disable JDK 26-ea tests (#724)

All JDK 26-ea tests currently fail because PMD does not yet support this version (see [job 52084346962](https://github.com/apache/commons-compress/actions/runs/18292910119/job/52084346962) for an example).

While it’s valuable to stay ahead of upcoming JDK releases, having test jobs that are guaranteed to fail for reasons outside of Commons Compress or the JDK itself adds little benefit.

This PR disables JDK 26-ea tests until pmd/pmd#5871 is resolved, which will likely require upgrading to ASM 9.9 (released only a few days ago). Tests can be re-enabled once PMD adds support for Java 26.

* Bump org.apache.commons:commons-parent from 88 to 89

No longer need to override JApiCmp

* [COMPRESS-711] Fix incorrect CPIO checksum verification (#725)

This PR fixes an issue in `CpioArchiveInputStream` where the checksum was computed using the wrong slice of the buffer. Previously, it always used bytes `0` through `len`, ignoring the specified `off` parameter. As a result, checksum verification could fail when `read()` was called with a non-zero offset.

### Changes

* Corrected checksum calculation to use the actual range `off` through `off + len`.
* Added a regression test to ensure checksum verification works correctly with non-zero offsets.

* Sort members and remove test clutter

* Move action to fix section

* Fix Javadoc

* Add `ArchiveFile` abstraction for file-based archives (#709)

* feat: introduce `ArchiveFile` abstraction for file-based archives

This PR adds a new `ArchiveFile` interface to unify the handling of file-based archive utilities such as `SevenZFile`, `TarFile`, and `ZipFile`.

Although these classes target different archive formats, they share several core characteristics:

* All are `Closeable`.
* Each provides the same method to open an `InputStream` for a given entry (`InputStream getInputStream(T)` where `T extends ArchiveEntry`).
* Historically, their `getEntries()` methods returned incompatible types. This PR introduces a common `List<? extends T> entries()` method, aligning with `java.util.zip.ZipFile` in name but offering a modern return value.
* The `ZipFile#stream()` method (added in 1.28.0) is now part of this abstraction.

This change establishes a consistent, format-agnostic API for working with archive files, reducing duplication and improving discoverability for users.

* fix: add `IOIterator<T>` interface

* fix: failing tests

* fix: comment on usage of `getNextTarEntry`

* empty commit to trigger CI

* empty commit to trigger CI (2)

* fix: use `asIterable()` to provide `unwrap()`

* Sort members

* Javadoc

Use longer lines

* Narrow test exception typing

* Use final

* Add configurable maximum entry name length (#710)

* feat: Add configurable maximum entry name length

This change introduces a configurable limit on archive entry names.

Although formats like **AR**, **CPIO**, and **TAR** permit arbitrarily long path names, real operating systems and file systems impose much stricter limits:

* Individual path segments (file names) are typically limited to 255 bytes.
* Full paths are usually capped at a few KiB (e.g. 1024 bytes on macOS via `MAX_PATH`).

#### What’s new

* Added a new common builder, `AbstractArchiveBuilder`, inserted in the hierarchy between archive stream builders and `AbstractStreamBuilder`.
* Introduced a new configuration option: `setMaxEntryLength`.
* Default value is `Short.MAX_VALUE`, which is higher than any realistic OS limit.
* Enforced the limit across:

  * All `ArchiveInputStream` implementations
  * `SevenZFile`, `TarFile`, and `ZipFile`
* Added a dedicated test suite to verify:

  * Entry names up to `Short.MAX_VALUE` are handled correctly.
  * Entries exceeding a lowered limit result in an exception.

* fix: reformat

* fix: AbstractArchiveBuilder javadoc

* fix: extract PAX_NAME_KEY and PAX_LINK_NAME_KEY

* fix: merge errors

* Empty commit to trigger CI

* fix: replace `readRange` + size check with `IOUtils.toByteArray`**

Simplifies PAX header parsing by using `IOUtils.toByteArray` from Commons IO instead of `IOUtils.readRange` followed by a manual size check.

The previous size check was effectively unreachable: oversized PAX key/value pairs are validated earlier, and truncated reads are already handled by the input stream. This change improves code clarity and test coverage without altering behavior.

* simplify file name parsing in `SevenZFile`

This change streamlines file name parsing by:

* Reading directly from the existing `ByteBuffer` rather than copying data into a temporary byte array.
* Replacing the temporary byte array with a `StringBuilder` to accumulate the decoded UTF-16 characters.

This reduces intermediate allocations and simplifies the code path without changing behavior.

* fix: PMD failure

* Restrict visibility of unintentionally exposed APIs (#730)

This PR removes or hides several members that leaked into the public surface, were never meant for external use and were **never released**. There is **no functional behavior change** to the library; this only corrects API visibility and duplication.

## What changed

* **`SevenZFile.SOFT_MAX_ARRAY_LENGTH`**

  * **Change:** Removed.
  * **Rationale:** Duplicates the constant already available in Commons IO’s `IOUtils`.

* **`SevenZFile.toNonNegativeInt(...)`**

  * **Change:** Visibility reduced (internal helper).
  * **Rationale:** Not part of the supported API; only used internally.

* **`SeekableInMemoryByteChannel.getSize()`**

  * **Change:** Removed public alias.
  * **Rationale:** Only used in tests; behavior diverges from `size()` after channel closure and shouldn’t be exposed.

* **`ElementValue.BYTES`**

  * **Change:** Migrated to caller class.
  * **Rationale:** Had a single call site in another package; not a public contract.

* Post merge clean ups

- Use final
- Fix errant curly
- Sort members
- Reduce vertical space

* Declare `IOException` on archive `InputStream` constructors (#731)

> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.

* Javadoc

* Rename private instance variable

* 7z: enforce reference limits on `Folder` parsing (#729)

7z: enforce reference limits on `Folder` parsing

This change aligns `Folder` parsing with the limits defined in the official 7-Zip implementation
([`7zIn.cpp`](https://github.com/ip7z/7zip/blob/main/CPP/7zip/Archive/7z/7zIn.cpp)):

* Maximum coders per folder: **64**
* Maximum input streams per coder: **64**
* Maximum output streams per coder: **1**
* Maximum total input streams per folder: **64**

These bounds are consistent with the reference behavior and are safe because:

* Other 7z implementations use the same or stricter limits.
* No supported coder uses multiple inputs or outputs.
* Custom coder definitions are not supported in this implementation.

By enforcing these limits, the parser becomes simpler and more predictable,
and redundant dynamic size checks can be removed.

* Improve sparse file handling performance (#715)

* Improve sparse file handling performance

Previously, sparse files were processed recursively. On highly fragmented files, this led to deep recursion and significant inefficiency.

This change replaces the recursive approach with an iterative strategy, which scales better for files with many fragments. It also introduces generated tests that simulate sparse files with very high fragmentation to ensure correctness and performance under stress.

* fix: remove unused method

* fix: simplify input streams

* fix: error message

* Fix failing tests

* Sort members

* Simplify writing ustart trailer

- Javadoc

* ARJ: correct byte accounting and truncation errors (#723)

* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* Fix failing test

* Sort methods

* Remove unused method

* Refactor common test pattern

Use final

* ARJ: strict header validation and selfExtracting option (#728)

* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* ARJ: strict header validation and `selfExtracting` option

Today `ArjArchiveInputStream` keeps scanning past invalid headers, assuming self-extracting stubs. That can hide corruption.

This PR:

* Introduces a `selfExtracting` ARJ archive option (default **false**).

  * **false:** no scanning; parse strictly from the first byte. Any invalid/truncated header fails fast.
  * **true:** scan only to locate the Main Archive Header (AMH), then switch to **strict mode**. All subsequent headers must be contiguous and valid.

**Behavioral change**

Previously, we might “skip over” bad data. Now we **only** allow a discovery scan for AMH (when opted in); everything after must validate or fail.

* fix: simplify rethrowing

* Fix failing test

* Sort methods

* Remove unused method

* Sort members

* Fix remove unused method

* Extract `MAX_BASIC_HEADER_SIZE` constant

* fix: exception message

* Javadoc

- Normalize error messages
- Use final
- Reduce vertical whitespace

* 7z: unsigned number parsing and improved header validation (#734)

* 7z: unsigned number parsing and improved header validation

The 7z file format specification defines only **unsigned numbers** (`UINT64`, `REAL_UINT64`, `UINT32`). However, the current implementation allows parsing methods like `readUint64`, `getLong`, and `getInt` to return negative values and then handles those inconsistently in downstream logic.

This PR introduces a safer and more specification-compliant number parsing model.

### Key changes

* **Strict unsigned number parsing**

  * Parsing methods now *never* return negative numbers.
  * `readUint64`, `readUint64ToIntExact`, `readRealUint64`, and `readUint32` follow the terminology from `7zFormat.txt`.
  * Eliminates scattered negative-value checks that previously compensated for parsing issues.

* **Improved header integrity validation**

  * Before large allocations, the size is now validated against the **actual available data in the header** as well as the memory limit.
  * Prevents unnecessary or unsafe allocations when the archive is corrupted or truncated.

* **Correct numeric type usage**

  * Some fields represent 7z numbers as 64-bit values but are constrained internally to Java `int` limits.
  * These are now declared as `int` to signal real constraints in our implementation.

* **Consistent error handling**
  Parsing now throws only three well-defined exception types:

  | Condition                                                              | Exception                                    |
  | ---------------------------------------------------------------------- | -------------------------------------------- |
  | Declared structure exceeds `maxMemoryLimitKiB`                         | `MemoryLimitException`                       |
  | Missing data inside header (truncated or corrupt)                      | `ArchiveException("Corrupted 7z archive")`   |
  | Unsupported numeric values (too large for implementation) | `ArchiveException("Unsupported 7z archive")` |

  Note: `EOFException` is no longer used: a header with missing fields is not “EOF,” it is **corrupted**.

This PR lays groundwork for safer parsing and easier future maintenance by aligning number handling with the actual 7z specification and making header parsing behavior *predictable and robust*.

* fix: explain 3 bytes for Folder

* fix: comment memory check again

* fix: remove unused import

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>

* Normalize some error messages

* Simplify `PackingUtils` using `IOUtils` (#737)

This change simplifies `PackingUtils` by leveraging `IOUtils` from Commons IO for common I/O operations, reducing boilerplate and improving readability.

* Fix grammar

* 7z: optimize header loading (#735)

* 7z: optimize header loading

This change improves the efficiency of 7z header parsing:

* Reads the **Signature Header** in a single ByteBuffer instead of multiple small reads, reducing overhead.
* Uses a `MappedByteBuffer` to load the **Next Header** when the archive is backed by a `FileChannel`, improving performance for large headers by avoiding unnecessary copies.

No new tests are added, as the existing test suite already exercises the affected header loading paths sufficiently.

* fix: rename `computeChecksum` to `crc32`

* fix: improve error messages

* Deprecate `IOUtils.readFully` and `IOUtils.skip` (#736)

* Deprecate `IOUtils.readFully` and `IOUtils.skip`

This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.

* fix: unnecessary qualifier

* fix: remove qualified name

* Use `consume` instead of `skip` (#738)

Replace `IOUtils.skip(in, Long.MAX_VALUE)`` with `IOUtils.consume(in)`` for clarity and intent, removing the need for a magic constant.

* Use HTTPS in URL

* Test with `commons-io` 2.21.0 RC1

* fix: remove staging repository

---------

Co-authored-by: Piotr P. Karwasz <piotr@github.copernik.eu>
Co-authored-by: Piotr P. Karwasz <pkarwasz-github@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants