Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full implementation of standard signature syntax in container signatures #305

Merged
merged 143 commits into from
Dec 6, 2019

Conversation

jcharlet
Copy link
Contributor

@jcharlet jcharlet commented Nov 7, 2019

@nishihatapalmer

These classes can parse PRONOM and container signature syntax, producing an Abstract Syntax Tree (AST). A compiler for the AST produces ByteSequence, SubSequence and SideFragment objects that can be used in DROID to search for those signatures.

solves #237

  - currently splits subsequences as it goes - not sure if this is
    the right approach.  The parser should probably just represent
    the expression passed in as a single parse tree.  The compiler
    should worry about how to split this up.
the results of compilation.

These either return immutable objects, or make defensive copies.
  * There are four PRONOM signatures which currently cannot be parsed,
    as they use unofficial syntax.  Waiting for TNA to respond on what
    they want to do / support for the PRONOM language officially.
  * This is used to test that the PRONOM parser is capable of reading
    known good and real-world signatures.
  * There are four PRONOM signatures which currently cannot be parsed,
    as they use unofficial syntax.  Waiting for TNA to respond on what
    they want to do / support for the PRONOM language officially.
 * Some of these are failing at present as they aren't yet supported by
   the parser.

   - string literals in ranges ['a'-'z']
   - arbitrary sets of bytes [2227]
   - the & bitwise operator [&01]
   - ranges use both hyphens and colons to separate them in different signatures.
…sigs.

  * Haven't validated that it is correct yet, only that it can process
    all of them without error.
@jcharlet
Copy link
Contributor Author

jcharlet commented Nov 7, 2019

@nishihatapalmer the demo for rebasing:
Your commits:
Screenshot from 2019-11-07 10-54-43

My simple rebase command: worked right away without any conflict:
from feature branch
git rebase master
equivalent to git pull --rebase master
Screenshot from 2019-11-07 10-55-58

The result:
Screenshot from 2019-11-07 10-55-00

I think it's better for maintenance because when we merge back onto master, we don't have crazy branches all other the place. you can also merge with rebase on github and have a clean history without even needing a merge commit.

@jcharlet
Copy link
Contributor Author

jcharlet commented Nov 7, 2019

if you want, I'll let you do it and push to upstream to practice :) . You will have to force push it to override the history.

git checkout droidsyntaxparser
git rebase master
git push upstream droidsyntaxparser -f

I'll start fixing the checkstyle errors for travis and add documentation, to better understand your work .

  * Use a strategy pattern to encapsulate different strategies on
    what kinds of elements can appear in anchor sequences.
  * PRONOM can only support bytes in anchors.
  * DROID can support anything in anchors, but it can lead to performance
    problems if sets or bitmasks are too big, so we limit the size.
  * If we can't find an anchoring sequence for DROID given the size limits,
    we remove all restrictions on size and look again.

The advantage of this approach is that we get strict compliance for PRONOM,
but have a two-fold strategy for DROID - one which will probably give
better performance than the PRONOM strategy, and a fallback position for
signatures for which no anchoring sequence could be found (there are
three signatures which fall into this category at present - the [&01]
signatures with no other bytes in them.
  - currently splits subsequences as it goes - not sure if this is
    the right approach.  The parser should probably just represent
    the expression passed in as a single parse tree.  The compiler
    should worry about how to split this up.
the results of compilation.

These either return immutable objects, or make defensive copies.
  * There are four PRONOM signatures which currently cannot be parsed,
    as they use unofficial syntax.  Waiting for TNA to respond on what
    they want to do / support for the PRONOM language officially.
  * This is used to test that the PRONOM parser is capable of reading
    known good and real-world signatures.
  * There are four PRONOM signatures which currently cannot be parsed,
    as they use unofficial syntax.  Waiting for TNA to respond on what
    they want to do / support for the PRONOM language officially.
 * Some of these are failing at present as they aren't yet supported by
   the parser.

   - string literals in ranges ['a'-'z']
   - arbitrary sets of bytes [2227]
   - the & bitwise operator [&01]
   - ranges use both hyphens and colons to separate them in different signatures.
@nishihatapalmer
Copy link
Contributor

This doesn't touch on the differences between binary and container syntax. DROID itself doesn't care, but PRONOM won't understand container syntax.

We should probably link to a description of the syntax somewhere, and the fact that you can use all of the syntax in either binary or container signatures if you like (but PRONOM won't be able to parse them for binary signatures if you want to submit them to TNA). Of course, nothing stops TNA using sigtool to rewrite a container signatures syntax in binary compatible format.

@jcharlet
Copy link
Contributor Author

Aaaaallright sorry @nishihatapalmer for the confusion, some things are indeed a bit clearer, and I better understood your previous comment. Thanks for taking the time to explain!

We should probably link to a description of the syntax somewhere,

I mentioned in README.md "PRONOM Syntax also provides details on the regular expression syntax supported by DROID.", which I am going to improve with a relative link on github to the .md file. Isn't it sufficient?

and the fact that you can use all of the syntax in either binary or container signatures if you like (but PRONOM won't be able to parse them for binary signatures if you want to submit them to TNA). Of course, nothing stops TNA using sigtool to rewrite a container signatures syntax in binary compatible format.

what about adding the following paragraph, after what you offered to put in Signatures management and before the introduction to sigtool "To work further on signatures, we provide Sigtool [...]"

The full syntax can be used in either binary or container signatures. However, if you use the new syntax, to submit them to TNA and get those signatures included in PRONOM registry, you will need to compile those signatures for PRONOM using Sigtool.

@jcharlet jcharlet marked this pull request as ready for review November 26, 2019 17:26
@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 26, 2019

I'm going to be pedantic here, but it's kind of important.

Only binary signatures which are to be submitted to TNA need to be in binary format.

Container signatures submitted to TNA can - and should - use the full syntax, since PRONOM doesn't compile those, and it allows things which binary signatures don't support.

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 26, 2019

And if you don't intend to submit local binary signatures to TNA, you can use the full container syntax in the Sequence attribute of a binary signature ByteSequence.

@jcharlet
Copy link
Contributor Author

@nishihatapalmer could you please refine the README? I won't be as clear as you can be, I'd rather let you do it if you have some time, and that will save us from going back and forth here.

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 27, 2019 via email

  * explanation of what kinds of signature exist.
  * explanation of sigtool capabilities
  * explanation of simpler XML format.
@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@jcharlet
Copy link
Contributor Author

thanks for those changes!

what about

  • renaming PRONOM Syntax to signature syntax.
  • moving types and syntax section in that signature syntax readme.
  • add a paragraph below Since version 6.5, DROID adds some new capabilities to support developing and testing signatures. :

Signature Syntax provides details on the types of signatures and regular expression syntax supported by DROID.

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@jcharlet
Copy link
Contributor Author

jcharlet commented Nov 28, 2019

fixed https://github.com/digital-preservation/droid/blob/droidsyntaxparser/README.md

I'm not sure the current link is actually working come to think of it...

On Thu, 28 Nov 2019, 15:03 Matt Palmer, @.> wrote: As long as we retain a link to the signature syntax MD from the readme On Thu, 28 Nov 2019, 14:59 Matt Palmer, @.> wrote: > Yup, those sound perfect. > > On Thu, 28 Nov 2019, 14:53 Jeremie Charlet, @.***> > wrote: > >> thanks for those changes! >> >> what about >> >> - renaming PRONOM Syntax to signature syntax. >> - moving types and syntax section in that signature syntax readme. >> - add a paragraph below Since version 6.5, DROID adds some new >> capabilities to support developing and testing signatures. : >> >> Signature Syntax provides details on the types of signatures and regular >> expression syntax supported by DROID. >> >> — >> You are receiving this because you were mentioned. >> Reply to this email directly, view it on GitHub >> <#305?email_source=notifications&email_token=ABBY4JBJO6FSFBY3UE7PV3DQV7LPJA5CNFSM4JKE6MBKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFM2J2I#issuecomment-559523049>, >> or unsubscribe >> https://github.com/notifications/unsubscribe-auth/ABBY4JCUSYJPP2UQXZVSLNLQV7LPJANCNFSM4JKE6MBA >> . >> >

@jcharlet
Copy link
Contributor Author

done!

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@nishihatapalmer
Copy link
Contributor

nishihatapalmer commented Nov 28, 2019 via email

@jcharlet
Copy link
Contributor Author

jcharlet commented Nov 28, 2019

Looks great By the way, why does the standard non jre version of DROID have a Unix suffix on the filename? It should be completely platform independent.

mmh it used to because we had a windows version with jre and a unix version without jre.
But shouldn't be the case anymore, since we recently put back the windows stuff in the bundle without jre and thus removed the unix suffix.
just checked on master and droidsyntaxparser on the core repo here, and it looks good to me https://github.com/digital-preservation/droid/blob/master/droid-binary/assembly.xml

maybe I missed something, where did you see that?

@jcharlet
Copy link
Contributor Author

One more comment, the signature syntax file now talks about the sigtool "below". It's not below now as that's still in the original readme. Maybe change to a link back to the readme or the using sigtool documentation?

sorry for that.. fixed

@jcharlet jcharlet merged commit df8c1ca into master Dec 6, 2019
@jcharlet jcharlet deleted the droidsyntaxparser branch December 6, 2019 10:58
@jcharlet jcharlet changed the title PR: full implementation of standard signature syntax in container signatures Full implementation of standard signature syntax in container signatures Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants