Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster than a libmagic wrapper? #95

Closed
mara004 opened this issue Jul 17, 2024 · 6 comments
Closed

Faster than a libmagic wrapper? #95

mara004 opened this issue Jul 17, 2024 · 6 comments

Comments

@mara004
Copy link

mara004 commented Jul 17, 2024

The Readme claims

Advantages over using a wrapper for 'file' or 'libmagic':

  • Faster

Do you have any actual evidence for that (reproducible benchmark or similar) ?
For typically pure-python re-implementations are slower than C library bindings, unless the pure-python package uses significantly more efficient algorithms, or there is a lot of object transfer or FFI overhead involved with the binding.

@cdgriffith
Copy link
Owner

Here's a quick test:

python-magic (libmagic wrapper)

import magic
print(magic.from_buffer("#!/usr/bin/env python"))
$time python speed_test_pm.py
a /usr/bin/env python script, ASCII text executable, with no line terminators

real    0m0.108s
user    0m0.018s
sys     0m0.008s

puremagic

import puremagic
print(puremagic.from_string("#!/usr/bin/env python"))
$ time python speed_test_pure.py
.py

real    0m0.068s
user    0m0.015s
sys     0m0.000s

@mara004
Copy link
Author

mara004 commented Aug 7, 2024

For one thing, a single invocation isn't exactly reliable. For another, the above always includes import-time tasks, where libmagic is at a disadvantage because it has to locate and load the DLL.

A more reliable benchmark would be needed to actually support the "Faster" claim.

@cdgriffith
Copy link
Owner

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

@mara004
Copy link
Author

mara004 commented Aug 7, 2024

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

Well, that should be clarified in the Readme (e.g. "Faster to import" rather than just "Faster").
I took it to mean the from_*(...) calls would be claimed faster. 😅
If only importing is supposed to be faster, that will be true, but the primary concern is runtime, not startup time.
The 0.04s import-time difference may not be relevant to most users.

@cdgriffith
Copy link
Owner

Yes, I can add that note in the Readme!

cdgriffith added a commit that referenced this issue Aug 8, 2024
- Adding #95 README clarification for Faster Import (thanks to mara004)
@cdgriffith cdgriffith mentioned this issue Aug 8, 2024
@cdgriffith
Copy link
Owner

I did decide to go and just test this further because it was bothering me as I knew this was faster in the past (~10 years ago)

Testing on develop branch for 1.27 using just my computer's downloads folder.

puremagic Test File
import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())


tracemalloc.start()

import_time_start = time.perf_counter()
import puremagic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
    try:
        puremagic.from_file(file)
    except puremagic.PureError:
        unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
        unknown_total += 1
    except Exception as e:
        print(f"Error: {file} - {e}")
print("\nDownload file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()
python-magic Test File
import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())

tracemalloc.start()

import_time_start = time.perf_counter()
import magic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
   try:
       result = magic.from_file(file)
   except Exception as e:
       print(f"Error: {file} - {e}")
   else:
       if result in ("ASCII text", "data"):
           unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
           unknown_total += 1
print("Download file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

puremagic results

$ time python speed_test_pure.py

Import time: 0.030981435003923252
Current memory usage: 1.009179MB

Testing 1131 files

Download file time: 2.9169464290025644
Unknown results types: {'.ovpn', '.docx', '.img', 'README'}
Unknown total: 4
Current memory usage: 1.022658MB

Peak memory usage: 1.355427MB

real    0m3.892s
user    0m1.001s
sys     0m0.134s

python-magic results

$ time python speed_test_pm.py

Import time: 0.061987199005670846
Current memory usage: 1.83419MB

Testing 1131 files
Download file time: 4.262647944997298
Unknown results types: {'.docx', '.vba', '.y4m', '.txt', '.stl', '.pem', '.json', '.bvr', 'README', '.ovpn', '.log', '.p7b'}
Unknown total: 30
Current memory usage: 1.849338MB

Peak memory usage: 1.88779MB

real    0m5.383s
user    0m0.301s
sys     0m0.290s

In this instance was:

  • Faster
  • Less Memory Usage
  • More Accurate Matches - When discounting "ASCII text" and "data" as real results from libmagic (surprising me, honestly)

I did also ensure that the overhead for checking unknown types was the same in both cases, and that removing it also produced same speed differences.

The only time I saw the python-magic wrapper faster is when doing 1000+ iterations over a small test string. I don't have 1000+ different strings to test with, so don't know if that's because it is faster or just cached the results. Which is causing me to think maybe should add a lru cache with configurable size.

Overall, giving me lots to think of and happy with my findings. Thanks for the inspiration @mara004

Going to keep it just as Faster in, now with proof ™️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants