Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle two sets of bytes for matching improvements? #46

Closed
NebularNerd opened this issue Jan 31, 2024 · 4 comments
Closed

How to handle two sets of bytes for matching improvements? #46

NebularNerd opened this issue Jan 31, 2024 · 4 comments

Comments

@NebularNerd
Copy link
Contributor

Hi there,

I'm looking for a python package to help identify weird and wonderful files inside various scripts. I had seen fleep but that appears to be dead. Puremagic looks to offer the same functionality for what I want it for.

One job is for handling Amiga .iff files in an image conversion script. Having a quick look, it's nice to see .iff getting some love:

["464f524d", 0, "", "application/x-iff", "IFF file"],

But in Amiga land that .iff FORM header is used for many things Wikipedia: List_of_file_signatures

image

Is there a way to help improve mapping and confidence by adding additional matching strings such as ILBM ACBM etc..? I'm happy to help with a PR if it can be done.

@cdgriffith
Copy link
Owner

What we could do there is instead of matching at offset 0 and FORM we can change to the offset where the more accurate info lives and match there instead.

Don't currently have a way to do wildcards, so can't be as accurate matching both FORM and ACBM

Thanks for the info, I can work on that when I have time. If you know a source of sample files for that please share!

@NebularNerd
Copy link
Contributor Author

Instead or in addition to wildcards another option could be dual match, take our .iff sample, we could look to do...

[["464f524d","494c424d"], [0,8], "", "application/x-iff", "IFF file"],

If your code sees a list instead of a string, process both hex matches using the matching offset from the next list, if both matches, we get pretty much 100% confidence it's what we think it is. Logic is a little weirder than wildcarding but it's another possible way.

Aminet is pretty much the internet oldest resource for all things Amiga, we should be able to find pretty much all things there.

7zip will happily unpack most of the .lha and other formats you'll find there. If you get stuck on any let me know and I'm sure I can unearth samples from somewhere.

@cdgriffith
Copy link
Owner

Thanks for the samples! Added a multi-part detect.

Should be working in 1.20 https://github.com/cdgriffith/puremagic/releases/tag/1.20

@NebularNerd
Copy link
Contributor Author

Nice! I've just looked at the implementation and that's way a great way to handle it, much tidier than mine. I'll test it out later on a script I have for handling converting images between formats.

For retro uses this will be handy as there are a lot of older formats like file packers that use a two part fingerprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants