Skip to content

[TIKA-4309] Support MachO Universal as pkg#1947

Merged
tballison merged 1 commit into
apache:mainfrom
alexey-pelykh:feature/TIKA-4309
Oct 16, 2024
Merged

[TIKA-4309] Support MachO Universal as pkg#1947
tballison merged 1 commit into
apache:mainfrom
alexey-pelykh:feature/TIKA-4309

Conversation

@alexey-pelykh

Copy link
Copy Markdown
Contributor

No description provided.

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

Hm, clearly there's a conflict with

<mime-type type="application/x-java-jnilib">
<_comment>Java Native Library for OSX</_comment>
<magic priority="50">
<match value="0xcafebabe" type="string" offset="0">
<match value="0xfeedface" type="string" offset="4096"/>
<match value="0xfeedfacf" type="string" offset="4096"/>
<match value="0xcefaedfe" type="string" offset="4096"/>
<match value="0xcffaedfe" type="string" offset="4096"/>
</match>
</magic>
<glob pattern="*.jnilib"/>
</mime-type>

and, by extent, with

<mime-type type="application/java-vm">
<_comment>Java Class File</_comment>
<alias type="application/x-java-vm"/>
<alias type="application/x-java"/>
<magic priority="40">
<match value="0xcafebabe" type="string" offset="0" />
</magic>
<glob pattern="*.class"/>
</mime-type>

Also, I'm uncertain how to handle multi-arch executables, except for not returning archs at all.

And I'm at loss how to make the tests pass since the multi-package change. Need help

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

cc @Gagravarr as author of the related change

@tballison tballison marked this pull request as draft September 20, 2024 13:21
@tballison

Copy link
Copy Markdown
Contributor

Apologies if you've already figured this out, but the way the above work is that if cafebabe matches on offset 0 and then there's a match at 4096, then tika identifies the file as x-java-jnilib. At a lower priority, if there's just a match on cafebabe, then the file is identified as java-vm.

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :)

@tballison

Copy link
Copy Markdown
Contributor

The magic for MachO is cefaedfe (32 bit) or cffaedfe (64 bit) as the first four bytes?

It looks like you've coded the magic for fat MachO as cafebabe. Is there something more distinctive for fat MachO? This might be useful: https://github.com/gabriel-vasile/mimetype/pull/53/files#diff-db84ad344d2bf91ea9533dbaa5974e81940b053644665471594fb2eedd34e7daR19 ?

@tballison

Copy link
Copy Markdown
Contributor

Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!"

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

The magic for MachO is cefaedfe (32 bit) or cffaedfe (64 bit) as the first four bytes?

/* Constant for the magic field of the mach_header_64 (64-bit architectures) */
#define MH_MAGIC_64 0xfeedfacf /* the 64-bit mach magic number */
#define MH_CIGAM_64 0xcffaedfe /* NXSwapInt(MH_MAGIC_64) */

and

/* Constant for the magic field of the mach_header (32-bit architectures) */
#define	MH_MAGIC	0xfeedface	/* the mach magic number */
#define MH_CIGAM	0xcefaedfe	/* NXSwapInt(MH_MAGIC) */

Should we treat a fat machO file like a container file and parse its individual components as separate files? I'm not very familiar with this file type, and I'm happy for a "no!"

It's truly a container, and we can do that - a link to an example would be helpful :) In test files, it contains two separate "almost-mach-o" blobs

@tballison

Copy link
Copy Markdown
Contributor

While I did figure that part out, I didn't figure out how to resolve the conflict with ExecutableParser, so I still need help there @tballison :)

Having not thought deeply about this, one option is to leave the mime file as is and add a magic for fat machO that's different from the other cafebabes in the mime type file -- we need to confirm the whatever makes fat machO distinct
...

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

we need to confirm the whatever makes fat machO distinct

I thought about that as well, we can read entire header and each arch header to confirm what we're looking at. Why I paused - even if I read, it makes no sense as we can return only one content type.

@tballison

tballison commented Sep 20, 2024

Copy link
Copy Markdown
Contributor

Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO (application/x-fat-mach-o I just made that up...is there an existing mime type?) and then the attachments/embedded files would have their own correct mime types.

If at all possible, we should try to use magic to distinguish fat machO from the other cafebabes.

@tballison

Copy link
Copy Markdown
Contributor

If we modify the original definition that we stole from pronom, we'd get both architectures?

 <mime-type type="application/x-mach-o">
    <_comment>Mach-O</_comment>
    <tika:link>https://www.nationalarchives.gov.uk/PRONOM/fmt/693</tika:link>
    <magic priority="50">
      <match value="0xFEEDFACF" offset="0"/> <!-- 64 bit -->
      <match value="0xCFFAEDFE" offset="0"/><!-- 64 bit -->
      <match value="0xFEEDFACE" offset="0"/><!-- 32 bit -->
      <match value="0xCEFAEDFE" offset="0"/><!-- 32 bit -->
    </magic>
  </mime-type>

@tballison

Copy link
Copy Markdown
Contributor

@tballison

Copy link
Copy Markdown
Contributor

It is tricky if you're new to Tika. I can try to help if you can create the skeleton for this file type of:

  • parse the header
  • parse each "attachment"

@tballison

Copy link
Copy Markdown
Contributor

@tballison

Copy link
Copy Markdown
Contributor

I don't see a fat machO in pronom, though. :(

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

Y, that makes sense. If we treat it as a container, file though, we could make up/find a mime type for fatmachO (application/x-fat-mach-o I just made that up...is there an existing mime type?) and then the attachments/embedded files would have their own correct mime types.

I've seen application/x-mach-binary and application/x-mach-o only, however I'd be more than happy to come up with more definite ones, alike the ELF-ones. Mach-O also can be shared lib, executable, etc. So technically we should've been doing application/x-mach-o-executable application/x-mach-o-sharedlib and so on. For Fat I've seen application/x-mach-universal but following the proposed structure it would've been application/x-mach-o-universal

If at all possible, we should try to use magic to distinguish fat machO from the other cafebabes.

The structure of fat Mach-O is quite vague (and this), only deep validation by code can help. So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching cafebabe's

It is tricky if you're new to Tika.

:) That I've noticed :)

I can try to help if you can create the skeleton for this file type

I'd gladly do so, the container is quite simple. It's 0xCAFEBABE + uint32t of number of headers and every header just contains cpu/arch/type flags

@tballison

Copy link
Copy Markdown
Contributor

Adding more precision to the mimetypes makes sense to me. I'd def want @Gagravarr to weigh in. For some mime types, we use attributes to make the description more precise, e.g. text/plain; charset=UTF-8 but for others, we modify the actual mime type and avoid attributes.

@tballison

Copy link
Copy Markdown
Contributor

So ideally I'd use ExecutableParser as priority and if it fails - try other magic-matching cafebabe's....

This is really unpleasant to do currently in Tika. Can we do something like in the gabriel vasile link above in mimetypes?

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

(source))

// Class matches a java class file.
func Class(in []byte) bool {
	return classOrMachO(in) && in[7] > 30
}

// MachO matches Mach-O binaries format
func MachO(in []byte) bool {
	return classOrMachO(in) && in[7] < 20
}

this approach relies on testing the 8th byte, or in other words for far Mach-O

struct fat_header {
	uint32_t	magic;		/* FAT_MAGIC */
	uint32_t	nfat_arch;	/* number of structs that follow */
};

the nfat_arch and I can't find thus far any reasonable explanation why counter is limited by 20. For Java Class, that's a lsb byte of uint16_t major version - also makes no sense why 30 is the threshold. Seem like weird empiric values to me.

To test for Mach-O universal, we could look for 0xCAFEBABE or 0xCAFEBABF, get offset of the first Mach-O from the first struct, and verify that it's a Mach-O. Does Tika's XML allow "read uint and read second it at first ints location"?

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author
# Since Java bytecode and Mach-O universal binaries have the same magic number,
# the test must be performed in the same "magic" sequence to get both right.
# The long at offset 4 in a Mach-O universal binary tells the number of
# architectures; the short at offset 4 in a Java bytecode file is the JVM minor
# version and the short at offset 6 is the JVM major version.  Since there are only
# only 18 labeled Mach-O architectures at current, and the first released
# Java class format was version 43.0, we can safely choose any number
# between 18 and 39 to test the number of architectures against
# (and use as a hack). Let's not use 18, because the Mach-O people
# might add another one or two as time goes by...

this comment is priceless :)

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

So, I guess we have two routes:

  1. Update XML parser to recognize Mach-O universal instead of JNI stuff.
  2. Go with container route

And in any case improve the non-fat Mach-O parsing by extending collection of MIME types.

@tballison

Copy link
Copy Markdown
Contributor

OMG, that is priceless! Thank you for finding that!

  1. update the mime-types.xml (not, literally, our XMLParser), right?

I think we're basically in agreement? This is what I see:

  1. update the mime-types.xml for non-fat mach-o -- include 32-bit -- figure out if we mark that as a new mime type, a mime-type with different attributes or a separate metadata item
  2. update mime-types.xml for fat mach-o vs the other cafebabes -- create a new mime type for fat mach-o
  3. Parse the fat mach-o as a package file and treat the embedded mach-os like we do other attachments in Tika.

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

update the mime-types.xml (not, literally, our XMLParser), right?

Yes, I've expressed myself poorly - you're correct.

I think we're basically in agreement?

Totally, we're on the same page. And I'd propose to do a PR per step. I'll do the step 1 with some extra stuff (I want to try parsing Mach-O type using XML) and submit a separate PR for that. How does that sound?

@alexey-pelykh

alexey-pelykh commented Sep 21, 2024

Copy link
Copy Markdown
Contributor Author

update mime-types.xml for fat mach-o vs the other cafebabes -- create a new mime type for fat mach-o

Does the XML instruction set allows for a dynamic offset? Like, read value and shift to that position?
Maybe it supports value ranges or value sets? Or "match greater" or something like that?

@tballison

tballison commented Sep 21, 2024

Copy link
Copy Markdown
Contributor

Dynamic offsets aren't currently implemented.

Value ranges can be implemented as a regex (less than ideal, but works for some cases). Literal value ranges or greater than, less than etc would be a new feature. Maintainers are standing by...lol...

Location ranges: yes definitely, as you probably noticed. See e.g. https://github.com/apache/tika/blob/main/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L539

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

@tballison I clearly ventured into some weeds with package support, the tests seem to pass on this (rewritten) PR, however I'm not sure if I'm testing what I should be testing.

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

I am out of the weeds :) Now all is good and it works

@alexey-pelykh alexey-pelykh marked this pull request as ready for review September 24, 2024 06:38
@alexey-pelykh alexey-pelykh changed the title [TIKA-4309] Support MachO [TIKA-4309] Support MachO Universal as container Sep 24, 2024
@alexey-pelykh alexey-pelykh changed the title [TIKA-4309] Support MachO Universal as container [TIKA-4309] Support MachO Universal as pkg Sep 24, 2024
@tballison

Copy link
Copy Markdown
Contributor

Prob won't be able to review until tomorrow. Thank you!

@alexey-pelykh

Copy link
Copy Markdown
Contributor Author

@tballison as long as it does not get lost in the drawer :)

@tballison tballison merged commit e459597 into apache:main Oct 16, 2024
@tballison

Copy link
Copy Markdown
Contributor

Y, it got lost in the drawer. Sorry! Many, many thanks!

@alexey-pelykh alexey-pelykh deleted the feature/TIKA-4309 branch October 16, 2024 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants