Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match signatures in OLE2 sub folders #321

Conversation

nishihatapalmer
Copy link
Contributor

@adamretter
Copy link
Contributor

@nishihatapalmer thanks this looks great! Would you be able to add some unit tests too please?

@nishihatapalmer
Copy link
Contributor Author

I honestly don't think I have time to write unit tests for this - I'm starting a new job on Monday, so my free time will disappear.

Well - the unit tests would actually be very simple - they could only test that a container signature with sub files or folders actually matches a file that has those properties. So the code is easy.

The hard part is developing (preferably more than one) container signatures that matches actual files (and doesn't match others), so that the engine can run over them to validate they work. I have no experience developing signatures, and nor do I have suitable files or tools to do this with.

If there's some clever way to mock objects to avoid having to make real signatures for real files, then I guess that might work (although we'll still need test files with the right properties). I'm not too up on how mocking works (and I'm not sure it solves this problem).

@nishihatapalmer
Copy link
Contributor Author

I can see on the issue there's a sample container file signature for an actual OLE2 file. I'll have a look at using this and see how I go - but it should test more than one signature preferably.

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 22, 2019

I can confirm that a manual test of this using the Omnipage container signatures and documents linked to in the associated issue doesn't work!.

All documents are identified as both Omnipage 12 and Omnipage 18, but by their file extension, not the container signature.

I think this has to do with putting an ending / on a file path. I think either:

Directory1
Directory1/

should be valid names for a directory, but the code assumes the first, and the signatures are providing the second form.

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 22, 2019

Well, the ending path separator issue is certainly there, but that's not what's causing this to fail.

Looks like the signature files provided don't recognise the OLE2 base format, so we never trigger the OLE2IdentifierEngine in the first place. I'll see if I can hack together a working set of signature files from them all.

@nishihatapalmer
Copy link
Contributor Author

My bad, I'd downloaded the HTML page instead of the actual files...

@nishihatapalmer
Copy link
Contributor Author

Well, there are bugs in this code still. The actual walking of the internal objects works, but it fails to get a document stream, as the call we're using on the POIFSfilesystem only gets streams from children of the root.

Have to modify our code so we can get streams from any part of the POIFSfilesystem.

@nishihatapalmer
Copy link
Contributor Author

Latest code fixes the streaming issue, and we get a container signature identification for the Omnipage files using internal paths inside OLE2.

Still need to resolve the path ending issue (I modified the signature files to strip out ending /) - but this may not be the right solution.

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 22, 2019

I'll try to find the time to do a bit more manual testing, and I'll have a think about the path ending / issue.

Not sure I'll find the time to automate all of this with unit tests, but at least we'll have manual confirmation that the code works.

@nishihatapalmer
Copy link
Contributor Author

Latest code matches container signatures including internal paths.

Also matches regardless of whether a directory has a trailing / or not (people do both in practice).

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 23, 2019

Added unit tests for OLE2.

Turns out all the existing unit tests were flagged as @ignore

Everything was throwing null pointer exceptions. This turns out to be a result of using uninitialised objects. It's fine normally in DROID as spring sets up all the internal dependencies, but if you constructed objects programmatically, lots broke.

I've made the small changes required to ensure that internal objects are not null when constructed programmatically. The existing tests (bar one) have been re-enabled, and I've added a simple test that looks for sub-folders and files inside OLE2 containers.

It would be good to add a test for a container signature with internal binary sigs, but the current code makes that quite hard. Once we get the new signature compiling capabilities (in another PR), this could be added quite easily later.

@jcharlet
Copy link
Contributor

so it depends on #305 which is almost ready

@jcharlet
Copy link
Contributor

jcharlet commented Nov 26, 2019

ci failed cause of bug on Travis. Fixed on master, will rebase master onto this branch to fix it.

   * digital-preservation#229
   * Walks over all internal files inside an OLE2, not just the
     immediate children of the root.
   * Paths are separated by / just like ZIP.
…ng /

   * Some people specify directories with a final /, and some don't.
   * This change ensures that paths can be matched whether they include
     a final / or not.
  * When a ContainerIdentifierInit is used to parse signatures,
    it initializes its members, but if one is contructed programmatically
    (e.g. for testing container identifiers) and a signature file is
    not parsed to build the identifier engine, the list was null.
  * This object is normally instantiated by the spring system.
  * If you construct a container identifier programmatically,
    lots of null pointer exceptions occur.  This just ensures
    that this internal object is initialised correctly, even
    if you aren't using spring to set things up.
  * Spring sets these things up for DROID normally, but if you
    need to construct them programmatically, they need to know what
    their identifier engine is.
   * These tests were flagged as ignore.
   * Everything was breaking with null pointer exceptions.
   * Previous commits fixes those NPEs (construct objects properly
     without needing spring to fix things up).
   * Add a test for OLE2 signatures with internal paths.
it was using an outdated version of the signatures, so changed the test
to verify it's dealing with a Microsoft related PUID
@jcharlet
Copy link
Contributor

tested locally on the sample file from #229 (comment)
but it doesn't seem to work. it should be expanding the opd file right @nishihatapalmer ?

image

image

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 26, 2019

@jcharlet No - it shouldn't expand - it's not an archive handler, it's a container file format (the internal "files" are used to identify the overall parent format, not listed under the parent.

However - it's not identifying an Omnipage 10 file - it just thinks it's an OLE2 file (which it is - but we want it to go further).

The signature I used identified the version 12 and 18 Omnipage formats. I didn't try it on an Omnipage 10 file. Does it work on 12 or 18 opds?

@nishihatapalmer
Copy link
Contributor Author

Did you put in the new signature files too? The default signature files have no support for Omnipage. There's a binary and container signature file stored along the sample opd files linked to from the issue.

@jcharlet
Copy link
Contributor

yes included the new signature

so what's the expected output?

image

image

@nishihatapalmer
Copy link
Contributor Author

The expected output is that the files are identified as Omnipage 12 or 18 files. This doesn't seem to be working for you.

One of the unit tests in Ole2RootFileTest (testIdentifyOLE2ContainersWithInternalPaths) tries to identify using internal paths (but crucially, doesn't run any binary signatures inside them). Does that unit test work for you?

@nishihatapalmer
Copy link
Contributor Author

You aren't using the Omnipage container file, only the binary file.

Use the Omnipage container file instead and it should work.

@nishihatapalmer
Copy link
Contributor Author

I mean - use the Omnipage binary file and the Omnipage container file. Both are needed to identify all the samples.

@jcharlet
Copy link
Contributor

image

a bit better, does work on 12 and 18!
Although omnipage10 is detected as version 12, and omnipage15 as 18. Trying to investigate why

@nishihatapalmer
Copy link
Contributor Author

nishihatapalmer commented Nov 26, 2019 via email

@jcharlet
Copy link
Contributor

Yes, ok makes complete sense. So that PR is complete. I just need to fix the CI, and will be able to merge after

@jcharlet jcharlet merged commit 0d405ae into digital-preservation:master Nov 26, 2019
@jcharlet jcharlet deleted the ScanNestedOLE2ContainerFiles branch November 26, 2019 17:07
@thorsted
Copy link

Thank you guys for the work on this. The signature appears to work as intended, from the image sent to me, There may be more work needed to define versions better, but I was unable to test with current functionality. I look forward to updating the signature before it is published in PRONOM.

@jcharlet
Copy link
Contributor

@thorsted if this can help for your test, I can send you DROID's packaged binary. After as you probably know, you just need to put the binary and container files in the right folder, select it in droid's settings (can show you how if needed).

@thorsted
Copy link

@jcharlet I would love to test out the updated binary.

@jcharlet
Copy link
Contributor

@thorsted https://we.tl/t-6fDDuxQ9FL here you go

@thorsted
Copy link

@jcharlet Tested and works as expected. Looking forward to adjusting other signatures to take advantage of this enhancement. Let me know when there is a snapshot ready for #322 as well. Need to submit OLM signature. Thanks!

@jcharlet jcharlet added this to the 6.5 milestone Jan 2, 2020
@jcharlet jcharlet changed the title ENHANCE: match signatures in OLE2 sub folders match signatures in OLE2 sub folders Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants