Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested folders in OLE2 container #229

Closed
thorsted opened this issue Jun 19, 2019 · 9 comments
Closed

Nested folders in OLE2 container #229

thorsted opened this issue Jun 19, 2019 · 9 comments
Milestone

Comments

@thorsted
Copy link

It appears the code for OLE2 only looks at the root of the document and can't parse files in nested directories unlike the ZIP engine.

@DavidUnderdown
Copy link

Do you think this would be useful behaviour, I think we're trying to make sure we identify the "top-level" format (since eg a Word doc could have Excel tables embedded etc). I don't think we've intended to try and identify every component, whereas in a zip file it's typically the files inside that are really of interest.

@thorsted
Copy link
Author

I currently have two formats I am working on now which would require this behavior. One is an OmniPage OPD file which has no files at the root only folders, another is a Microsoft Home Publishing file which is mostly identical to a MSWorks 5 file, but with a unique folder within the structure. In both cases I would need to identify a file or folder within the internal directories like a ZIP.
Samples.zip

@Dclipsham
Copy link

This is definitely required behaviour. The SIARD 2.1 format identification wouldn't work without this functionality existing for ZIP-based containers, and this issue would bring OLE2 containers in line and facilitate more complex container signature patterns for OLE2 types. It wouldn't adversely affect existing signatures that are using files at root level.

nishihatapalmer added a commit to nishihatapalmer/droid that referenced this issue Nov 20, 2019
   * digital-preservation#229
   * Walks over all internal files inside an OLE2, not just the
     immediate children of the root.
   * Paths are separated by / just like ZIP.
@nishihatapalmer
Copy link
Contributor

There's a PR for this ( #321 ) which adds the ability to scan all sub-folders and files of an OLE2 container.

I do not have a good container signature using paths to test it with. It certainly iterates over all the sub-folders and files inside an OLE2, but I can't swear it all works until someone tries a real signature with it.

Paths are specified just like in ZIP container files, using / to separate folders. The root does not have a starting /.

For example:

File1
File2
Directory1
Directory1/File3
Directory1/File4
Directory2
Directory2/File5

@thorsted
Copy link
Author

Excellent work, This is the format I was trying to write a container signature for when I came across the issue. OmniPage 10 format only has folders and versions 12 & 18 have a Data file inside a Version folder. https://github.com/thorsted/pronom-research-week-2019/tree/master/OmniPage

@nishihatapalmer
Copy link
Contributor

I'm afraid I don't have the tools to explore inside OLE2 objects handy to try to build a test signature. Can you recommend a tool which can dump the internal streams out to files? I'm on linux, can use Windows at a pinch but it's not set up for dev work.

@Dclipsham
Copy link

Hi Matt,
Tyler has created a candidate signature for OmniPage and submitted to the PRONOM Research Week repo - https://github.com/digital-preservation/pronom-research-week-2019/tree/master/OmniPage - you'll see a binary sig file providing minimal mappings for container ID, and a container signature file with subfile patterns.

@thorsted
Copy link
Author

7-zip will dump the contents

jcharlet pushed a commit to nishihatapalmer/droid that referenced this issue Nov 26, 2019
   * digital-preservation#229
   * Walks over all internal files inside an OLE2, not just the
     immediate children of the root.
   * Paths are separated by / just like ZIP.
@jcharlet
Copy link
Contributor

Hi @thorsted, so this is the output I get when checking your Omnipage files with your binary and container signature files.

from #321 (comment)

image

As you can see, it works on Omnipage 9, 12 and 18, which are defined in your signature files, but more work needs to be done to recognize all different versions (10 and 15)
Are you happy with that, can we close this issue?

@jcharlet jcharlet added this to the 6.5 milestone Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants