Faster than a libmagic wrapper? #95

mara004 · 2024-07-17T21:59:18Z

The Readme claims

Advantages over using a wrapper for 'file' or 'libmagic':

Faster

Do you have any actual evidence for that (reproducible benchmark or similar) ?
For typically pure-python re-implementations are slower than C library bindings, unless the pure-python package uses significantly more efficient algorithms, or there is a lot of object transfer or FFI overhead involved with the binding.

cdgriffith · 2024-08-07T22:45:52Z

Here's a quick test:

python-magic (libmagic wrapper)

import magic
print(magic.from_buffer("#!/usr/bin/env python"))

$time python speed_test_pm.py
a /usr/bin/env python script, ASCII text executable, with no line terminators

real    0m0.108s
user    0m0.018s
sys     0m0.008s

puremagic

import puremagic
print(puremagic.from_string("#!/usr/bin/env python"))

$ time python speed_test_pure.py
.py

real    0m0.068s
user    0m0.015s
sys     0m0.000s

mara004 · 2024-08-07T23:16:10Z

For one thing, a single invocation isn't exactly reliable. For another, the above always includes import-time tasks, where libmagic is at a disadvantage because it has to locate and load the DLL.

A more reliable benchmark would be needed to actually support the "Faster" claim.

cdgriffith · 2024-08-07T23:26:29Z

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

mara004 · 2024-08-07T23:43:30Z

The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim.

Well, that should be clarified in the Readme (e.g. "Faster to import" rather than just "Faster").
I took it to mean the from_*(...) calls would be claimed faster. 😅
If only importing is supposed to be faster, that will be true, but the primary concern is runtime, not startup time.
The 0.04s import-time difference may not be relevant to most users.

cdgriffith · 2024-08-07T23:50:37Z

Yes, I can add that note in the Readme!

- Adding #95 README clarification for Faster Import (thanks to mara004)

cdgriffith · 2024-08-08T17:50:45Z

I did decide to go and just test this further because it was bothering me as I knew this was faster in the past (~10 years ago)

Testing on develop branch for 1.27 using just my computer's downloads folder.

puremagic Test File

import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())


tracemalloc.start()

import_time_start = time.perf_counter()
import puremagic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
    try:
        puremagic.from_file(file)
    except puremagic.PureError:
        unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
        unknown_total += 1
    except Exception as e:
        print(f"Error: {file} - {e}")
print("\nDownload file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

python-magic Test File

import time
from pathlib import Path
import tracemalloc

download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())

tracemalloc.start()

import_time_start = time.perf_counter()
import magic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
   try:
       result = magic.from_file(file)
   except Exception as e:
       print(f"Error: {file} - {e}")
   else:
       if result in ("ASCII text", "data"):
           unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
           unknown_total += 1
print("Download file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")

print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop()

puremagic results

$ time python speed_test_pure.py

Import time: 0.030981435003923252
Current memory usage: 1.009179MB

Testing 1131 files

Download file time: 2.9169464290025644
Unknown results types: {'.ovpn', '.docx', '.img', 'README'}
Unknown total: 4
Current memory usage: 1.022658MB

Peak memory usage: 1.355427MB

real    0m3.892s
user    0m1.001s
sys     0m0.134s

python-magic results

$ time python speed_test_pm.py

Import time: 0.061987199005670846
Current memory usage: 1.83419MB

Testing 1131 files
Download file time: 4.262647944997298
Unknown results types: {'.docx', '.vba', '.y4m', '.txt', '.stl', '.pem', '.json', '.bvr', 'README', '.ovpn', '.log', '.p7b'}
Unknown total: 30
Current memory usage: 1.849338MB

Peak memory usage: 1.88779MB

real    0m5.383s
user    0m0.301s
sys     0m0.290s

In this instance was:

Faster
Less Memory Usage
More Accurate Matches - When discounting "ASCII text" and "data" as real results from libmagic (surprising me, honestly)

I did also ensure that the overhead for checking unknown types was the same in both cases, and that removing it also produced same speed differences.

The only time I saw the python-magic wrapper faster is when doing 1000+ iterations over a small test string. I don't have 1000+ different strings to test with, so don't know if that's because it is faster or just cached the results. Which is causing me to think maybe should add a lru cache with configurable size.

Overall, giving me lots to think of and happy with my findings. Thanks for the inspiration @mara004

Going to keep it just as Faster in, now with proof ™️

cdgriffith closed this as completed Aug 7, 2024

cdgriffith added a commit that referenced this issue Aug 8, 2024

- Adding new verbose output to command line with -v or --verbose

0165f93

- Adding #95 README clarification for Faster Import (thanks to mara004)

cdgriffith mentioned this issue Aug 8, 2024

Version 1.27 #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster than a libmagic wrapper? #95

Faster than a libmagic wrapper? #95

mara004 commented Jul 17, 2024

cdgriffith commented Aug 7, 2024

mara004 commented Aug 7, 2024 •

edited

Loading

cdgriffith commented Aug 7, 2024

mara004 commented Aug 7, 2024 •

edited

Loading

cdgriffith commented Aug 7, 2024

cdgriffith commented Aug 8, 2024

Faster than a libmagic wrapper? #95

Faster than a libmagic wrapper? #95

Comments

mara004 commented Jul 17, 2024

cdgriffith commented Aug 7, 2024

python-magic (libmagic wrapper)

puremagic

mara004 commented Aug 7, 2024 • edited Loading

cdgriffith commented Aug 7, 2024

mara004 commented Aug 7, 2024 • edited Loading

cdgriffith commented Aug 7, 2024

cdgriffith commented Aug 8, 2024

puremagic results

python-magic results

mara004 commented Aug 7, 2024 •

edited

Loading

mara004 commented Aug 7, 2024 •

edited

Loading