[TIKA-4309] Support MachO Universal as pkg#1947
Conversation
b9ece83 to
59bc2d1
Compare
|
Hm, clearly there's a conflict with tika/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Lines 427 to 438 in 30e110a and, by extent, with tika/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Lines 417 to 425 in 30e110a Also, I'm uncertain how to handle multi-arch executables, except for not returning archs at all. And I'm at loss how to make the tests pass since the multi-package change. Need help |
|
cc @Gagravarr as author of the related change |
|
Apologies if you've already figured this out, but the way the above work is that if |
|
While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :) |
|
The magic for MachO is It looks like you've coded the magic for fat MachO as |
|
Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!" |
and
It's truly a container, and we can do that - a link to an example would be helpful :) In test files, it contains two separate "almost-mach-o" blobs |
Having not thought deeply about this, one option is to leave the mime file as is and add a magic for fat machO that's different from the other |
I thought about that as well, we can read entire header and each arch header to confirm what we're looking at. Why I paused - even if I read, it makes no sense as we can return only one content type. |
|
Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO ( If at all possible, we should try to use magic to distinguish fat machO from the other |
|
If we modify the original definition that we stole from pronom, we'd get both architectures? |
|
This offers an example of how to parse attachments as separate files: https://github.com/apache/tika/blob/main/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java#L386 |
|
It is tricky if you're new to Tika. I can try to help if you can create the skeleton for this file type of:
|
|
Doh, it looks like pronom had an entry for 32bit: https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1491&strPageToDisplay=signatures |
|
I don't see a fat machO in pronom, though. :( |
I've seen
The structure of fat Mach-O is quite vague (and this), only deep validation by code can help. So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching
:) That I've noticed :)
I'd gladly do so, the container is quite simple. It's 0xCAFEBABE + uint32t of number of headers and every header just contains cpu/arch/type flags |
|
Adding more precision to the mimetypes makes sense to me. I'd def want @Gagravarr to weigh in. For some mime types, we use attributes to make the description more precise, e.g. |
This is really unpleasant to do currently in Tika. Can we do something like in the gabriel vasile link above in mimetypes? |
|
(source)) // Class matches a java class file.
func Class(in []byte) bool {
return classOrMachO(in) && in[7] > 30
}
// MachO matches Mach-O binaries format
func MachO(in []byte) bool {
return classOrMachO(in) && in[7] < 20
}this approach relies on testing the 8th byte, or in other words for far Mach-O the To test for Mach-O universal, we could look for 0xCAFEBABE or 0xCAFEBABF, get offset of the first Mach-O from the first struct, and verify that it's a Mach-O. Does Tika's XML allow "read uint and read second it at first ints location"? |
this comment is priceless :) |
|
So, I guess we have two routes:
And in any case improve the non-fat Mach-O parsing by extending collection of MIME types. |
|
OMG, that is priceless! Thank you for finding that!
I think we're basically in agreement? This is what I see:
|
Yes, I've expressed myself poorly - you're correct.
Totally, we're on the same page. And I'd propose to do a PR per step. I'll do the step 1 with some extra stuff (I want to try parsing Mach-O type using XML) and submit a separate PR for that. How does that sound? |
Does the XML instruction set allows for a dynamic offset? Like, read value and shift to that position? |
|
Dynamic offsets aren't currently implemented. Value ranges can be implemented as a regex (less than ideal, but works for some cases). Literal value ranges or greater than, less than etc would be a new feature. Maintainers are standing by...lol... Location ranges: yes definitely, as you probably noticed. See e.g. https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L539 |
59bc2d1 to
a31ebd7
Compare
|
@tballison I clearly ventured into some weeds with package support, the tests seem to pass on this (rewritten) PR, however I'm not sure if I'm testing what I should be testing. |
a31ebd7 to
4cb1812
Compare
|
I am out of the weeds :) Now all is good and it works |
4cb1812 to
1133d47
Compare
|
Prob won't be able to review until tomorrow. Thank you! |
|
@tballison as long as it does not get lost in the drawer :) |
|
Y, it got lost in the drawer. Sorry! Many, many thanks! |
No description provided.