Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For Python 3.13: A drop-in replacement for imghdr.what() #76

Merged
merged 2 commits into from
May 22, 2024

Conversation

cclauss
Copy link
Contributor

@cclauss cclauss commented May 18, 2024

Closes #72

This might be more complete after #75 lands.

Current pytest results: 53 passed in 0.05s

  • test_what_from_file is 6 tests
  • test_what_from_file_none is 1 test
  • test_what_from_string is 38 tests
  • test_what_from_string_py311 is 2 tests
    • Fail on imghdr only on Python < 3.11
  • test_what_from_string_todo is 6 tests
    • Fail on puremagic

@NebularNerd Your reviews, please.

@cclauss cclauss force-pushed the puremagic.what branch 5 times, most recently from 8aae6d3 to 6fe0635 Compare May 18, 2024 20:33
@NebularNerd
Copy link
Contributor

NebularNerd commented May 19, 2024

I've just had a look at the code and really need to spend some more time on it. These may seem obvious questions but I'm lacking caffeine and sleep (long days this week) so they are just to help me understand a little better:

  1. I assume you are converting bytes to hex when pushing them over to PureMagic? Looking at the test strings, 424d will work but BM would not if not first converted to hex.
  2. Which strings/files/tests are failing?
  3. Are you trying to get PureMagic to only say 'it's this' if it exactly matches your test strings? This would definitely cause hiccups.

I can see some improvements for your test strings for PBM, PGM and PPM so I'll add those to my PR later, but that should not affect the basic matches PureMagic already has. UPDATE These improvements are now in my PR. We can detect a SPACE, TAB or Win/*nix newline, additionally if they are immediately followed by a # that will improve confidence as well.

@cclauss cclauss force-pushed the puremagic.what branch 2 times, most recently from f27b033 to 0c49122 Compare May 19, 2024 21:03
@cclauss cclauss changed the title For Python 3.13: A drop-in replacement for imghdr.what() For Python 3.13: A drop-in replacement for imghdr.what() May 20, 2024
@cclauss cclauss force-pushed the puremagic.what branch 2 times, most recently from 0f6bf82 to 5b231b7 Compare May 21, 2024 05:50
@NebularNerd
Copy link
Contributor

Hi @cclauss, you might need to remove the : from Closes: to make GitHub link this PR to your issue.

Watching the commits with interest 🙂

@cclauss
Copy link
Contributor Author

cclauss commented May 21, 2024

That Closes, Fixes, Resolves thing only works on the default branch.

@NebularNerd
Copy link
Contributor

Ah! Did not notice you had the PR against the develop branch, my bad. 🙂

@cclauss
Copy link
Contributor Author

cclauss commented May 21, 2024

There are certain cases where imghdr.what(file, h) and puremagic.what(file, h) disagree. My hunch is that puremagic is right and these incompatibilities are part of why the Python core team wants to remove imghdr.

Do you have suggestions for how we can always deliver identical results as imghdr.what(file, h)?

The current definition in this PR is:

def what(file: os.PathLike | str | None], h: str | bytes | None) -> str:

but I think we should make two changes to deliver optional bug-for-bug compatibility:

  1. h: str | bytes | None --> h: bytes | None because imghdr.what() does not accept h = "str"
  2. Add a strict_compat: bool = True option to deliver bug-for-bug results by default.
def what(file: os.PathLike | str | None], h: bytes | None, strict_compat: bool = True) -> str:
    if strict_compat and answer := imghdr_compatibility(h):
        return answer

Thoughts on naming? strict_compat, strict_imghdr, imghdr_compat, bug_for_bug, others?

The imghdr_compatibility() (name is also up for debate) function would deal with cases like this PR's test_what_from_string_todo() function.

if not strict_compat: then we would return puremagic's results unmodified.

@NebularNerd
Copy link
Contributor

What files/cases are causing issues? Without #75 they may not work or behave as expected at present. Feel free to fling a .zip over or link to an image and I'll take a look.

@cclauss
Copy link
Contributor Author

cclauss commented May 21, 2024

What files/cases are causing issues?

I was focused on h, not file so the cases are in the parameterization of the test_what_from_string_todo() function.

        ("jpeg", b"______JFIF"),
        ("jpeg", b"______Exif"),
        ("rgb", b"\001\332"),
        ("tiff", b"II"),
        ("tiff", b"MM"),
        ("xbm", b"#define "),

It would be wonderful to add more image files in test/resources/images/ as discussed on test/test_main.py line 17.
exr, pbm, pgm, ppm, ras, rgb, xbm

@NebularNerd
Copy link
Contributor

NebularNerd commented May 21, 2024

Quick breakdown based on #75, I'm assuming you are converting to hex:

  • ("jpeg", b"______JFIF") = Not tested for (yet), Not present at fixed location in every file, or in some cases not present at all. As mentioned in PR we need a REGEX to improve confidence IF JFIF/EXIF is present. This would be among the first improvements the 2.0 goals should address. Existing JPEG matching works well enough at present
  • ("jpeg", b"______Exif") = As above
  • ("rgb", b"\001\332") = Per my PR, the existing match is too specific (literally one specific file variant), when implemented new .json will match this short header for starters, then build confidence by adding variant info, see notes in PR for more details.
  • ("tiff", b"II") = Poor match for real files would give very low confidence, Existing match follows real published file specs of b'II*\x00' for Intel / b'MM\x00*' for Motorola, more detailed notes in my PR about this.
  • ("tiff", b"MM") = As above
  • ("xbm", b"#define ") = New format needs my PR merged to be matched.

The JFIF/EXIF thing is a would be nice to have rather than critical to matching, a JPG will always have the base marker of 0xffd8ff / b'\xff\xd8\xff' so any real file thrown at PureMagic will match. Adding these would boost confidence if they are present in the file.

puremagic/main.py Outdated Show resolved Hide resolved
test/test_main.py Outdated Show resolved Hide resolved
test/test_main.py Outdated Show resolved Hide resolved
test/test_main.py Outdated Show resolved Hide resolved
test/test_main.py Outdated Show resolved Hide resolved
test/test_main.py Outdated Show resolved Hide resolved
@cclauss cclauss requested a review from cdgriffith May 21, 2024 22:38
@cdgriffith cdgriffith merged commit 5608e1b into cdgriffith:develop May 22, 2024
8 checks passed
@cclauss cclauss deleted the puremagic.what branch May 22, 2024 04:19
@cdgriffith cdgriffith mentioned this pull request Jun 16, 2024
cdgriffith added a commit that referenced this pull request Jun 16, 2024
- Adding #72 #75 #76 #81 `.what()` to be a drop in replacement for `imghdr.what()` (thanks to Christian Clauss and Andy - NebularNerd)
- Adding #67 Test on Python 3.13 beta (thanks to Christian Clauss)
- Adding #77 from __future__ import annotations (thanks to Christian Clauss
- Fixing #66 Confidence sorting (thanks to Andy - NebularNerd)

---------

Co-authored-by: Andy <NebularNerd@users.noreply.github.com>
Co-authored-by: Christian Clauss <cclauss@me.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants