-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested folders in OLE2 container #229
Comments
Do you think this would be useful behaviour, I think we're trying to make sure we identify the "top-level" format (since eg a Word doc could have Excel tables embedded etc). I don't think we've intended to try and identify every component, whereas in a zip file it's typically the files inside that are really of interest. |
I currently have two formats I am working on now which would require this behavior. One is an OmniPage OPD file which has no files at the root only folders, another is a Microsoft Home Publishing file which is mostly identical to a MSWorks 5 file, but with a unique folder within the structure. In both cases I would need to identify a file or folder within the internal directories like a ZIP. |
This is definitely required behaviour. The SIARD 2.1 format identification wouldn't work without this functionality existing for ZIP-based containers, and this issue would bring OLE2 containers in line and facilitate more complex container signature patterns for OLE2 types. It wouldn't adversely affect existing signatures that are using files at root level. |
* digital-preservation#229 * Walks over all internal files inside an OLE2, not just the immediate children of the root. * Paths are separated by / just like ZIP.
There's a PR for this ( #321 ) which adds the ability to scan all sub-folders and files of an OLE2 container. I do not have a good container signature using paths to test it with. It certainly iterates over all the sub-folders and files inside an OLE2, but I can't swear it all works until someone tries a real signature with it. Paths are specified just like in ZIP container files, using / to separate folders. The root does not have a starting /. For example:
|
Excellent work, This is the format I was trying to write a container signature for when I came across the issue. OmniPage 10 format only has folders and versions 12 & 18 have a Data file inside a Version folder. https://github.com/thorsted/pronom-research-week-2019/tree/master/OmniPage |
I'm afraid I don't have the tools to explore inside OLE2 objects handy to try to build a test signature. Can you recommend a tool which can dump the internal streams out to files? I'm on linux, can use Windows at a pinch but it's not set up for dev work. |
Hi Matt, |
7-zip will dump the contents |
* digital-preservation#229 * Walks over all internal files inside an OLE2, not just the immediate children of the root. * Paths are separated by / just like ZIP.
Hi @thorsted, so this is the output I get when checking your Omnipage files with your binary and container signature files. from #321 (comment) As you can see, it works on Omnipage 9, 12 and 18, which are defined in your signature files, but more work needs to be done to recognize all different versions (10 and 15) |
It appears the code for OLE2 only looks at the root of the document and can't parse files in nested directories unlike the ZIP engine.
The text was updated successfully, but these errors were encountered: