Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let archive files be embedded in any other archive file. #223

Merged

Conversation

nishihatapalmer
Copy link
Contributor

The existing code restricts many archive types such that they can only be read if they are directly inside a file. This pull request uses interfaces supported by those archive types, and adapts WindowReaders to them. This allows us to process these archive types no matter where they appear - in files, embedded in zips, tar.gz, or any nested combination.

Note that this is not a new approach - it's exactly what already exists to support TrueZip reading.

  • New classes to adapt WindowReaders to the various interfaces supported by different archive types.
  • Tests for the new classes (and a test for the existing TrueZip reading classes).

To support random access interfaces for various readers, we have to
wrap a WindowReader from the existing IdentificationRequest, and
provide the data to the archive class using a random access interface
it supports.

Two methods are provided, to copy bytes from a WindowReader into either
a byte array (with offset and length), or into a ByteBuffer.
This allows us to read FAT file systems from any WindowReader, not just
ones which are directly inside a file on the file system.  We can nest
FAT archives inside zip files, for example.
This allows us to instantiate and walk SevenZip archives no matter
whether they are in a file system file, or embedded inside another
archive type.

 * SevenZipReader - this class adapts the WindowReader to SeekableByteChannel.
 * SevenZipIteratorAdapter - iterates over seven zip entries, using a
   wrapping class to encapsulate the entry metadata and the stream of
   bytes associated with it.  The previous approach tried to obtain the
   same stream of bytes from the main archive stream, but this forced
   the class to re-implement some logic contained inside the seven zip
   archive class.
  * SevenZipArchiveHandler - use the new classes.
The RarReader class adapts a WindowReader to the IReadOnlyAccess interface,
which lets us read rar files even if they are embedded inside another archive,
instead of only when they contained directly in files in the file system.

 * The RarReader class implements various interfaces and sub-interfaces
   as private classes, which are required by the rar archive classes.
   Ultimately, we wrap the WindowReader in a class which implements
   IReadOnlyAccess, allowing the rar archive to process rar bytes
   no matter what their source.
 * This makes the FatArchiveHandler use the new FatReader class.
 * This should have been committed with the FatReader class originally!
 * Rename ReaderReadonlyFile to TrueZipReader, as this fits in with the
   general pattern of calling the WindowReader adapter classes after
   the archive type they are written for.
 * Use the new tested ArchiveFileUtils methods to copy bytes.  There
   was a bug in the original, as it threw an error if length + offset
   was greater than available bytes - but length should be seen as a
   maximum, not an absolute instruction.
 * Fix bug in seek - where it would set position to after the file
   end.  Should test that position is >= length, not just >.
  * The latest code in java-iso-tools supports a SeekableInput
    interface we could easily use to achieve the same things we've
    just done for the other archive types.
  * 2.0.2-SNAPSHOT has what we need, but it isn't released yet.
  * Doesn't look like the author is still releasing and supporting this
    code.  Have asked when a release might be available and had no
    response, and no code has been updated for a couple of years.
  * Might need to fork java-iso-tools to use the new interface.
@nishihatapalmer
Copy link
Contributor Author

Seems OK... but the Travis build fails with something to do with JDK 10 being deprecated. Don't think this has anything to do with this pull request - am I missing something here?

@nishihatapalmer
Copy link
Contributor Author

The one remaining archive type that is still limited to files as a source is ISO.

The last release of java iso tools only supports reading from files. The latest SNAPSHOT code in master does have an interface we could use, but this hasn't been released.

I'm not sure this code is being maintained anymore, very little activity for the last 2 years, and no response to questions about a new release.

One option would be to fork java iso tools and release it, so we could use the SeekableInput interface.

@LauraDamianTNA
Copy link
Contributor

Yes, oraclejdk10 is no longer supported. Thanks for the pull request, we'll merge soon

@LauraDamianTNA LauraDamianTNA merged commit de3ada5 into digital-preservation:master Nov 14, 2018
@nishihatapalmer
Copy link
Contributor Author

Thanks for merging the PR!

@jcharlet jcharlet added this to the 6.5 milestone Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants