-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster than a libmagic wrapper? #95
Comments
Here's a quick test: python-magic (libmagic wrapper)import magic
print(magic.from_buffer("#!/usr/bin/env python"))
puremagicimport puremagic
print(puremagic.from_string("#!/usr/bin/env python"))
|
For one thing, a single invocation isn't exactly reliable. For another, the above always includes import-time tasks, where libmagic is at a disadvantage because it has to locate and load the DLL. A more reliable benchmark would be needed to actually support the "Faster" claim. |
The whole point is it's faster because it doesn't need to load in an external library? That's the point of the claim. |
Well, that should be clarified in the Readme (e.g. "Faster to import" rather than just "Faster"). |
Yes, I can add that note in the Readme! |
- Adding #95 README clarification for Faster Import (thanks to mara004)
I did decide to go and just test this further because it was bothering me as I knew this was faster in the past (~10 years ago) Testing on develop branch for 1.27 using just my computer's downloads folder. puremagic Test Fileimport time
from pathlib import Path
import tracemalloc
download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())
tracemalloc.start()
import_time_start = time.perf_counter()
import puremagic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
try:
puremagic.from_file(file)
except puremagic.PureError:
unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
unknown_total += 1
except Exception as e:
print(f"Error: {file} - {e}")
print("\nDownload file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop() python-magic Test Fileimport time
from pathlib import Path
import tracemalloc
download_files = list(x for x in Path("Downloads").glob("*") if x.is_file())
tracemalloc.start()
import_time_start = time.perf_counter()
import magic
print("Import time:", time.perf_counter() - import_time_start)
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"\nTesting {len(download_files)} files")
download_start_time = time.perf_counter()
unknown_results_types = set()
unknown_total = 0
for file in download_files:
try:
result = magic.from_file(file)
except Exception as e:
print(f"Error: {file} - {e}")
else:
if result in ("ASCII text", "data"):
unknown_results_types.add(file.suffix.lower() if file.suffix else file.stem)
unknown_total += 1
print("Download file time:", time.perf_counter() - download_start_time)
print(f"Unknown results types: {unknown_results_types}")
print(f"Unknown total: {unknown_total}")
current, peak = tracemalloc.get_traced_memory()
print(f"Current memory usage: {current / 10**6}MB")
print(f"\nPeak memory usage: {peak / 10**6}MB")
tracemalloc.stop() puremagic results
python-magic results
In this instance was:
I did also ensure that the overhead for checking unknown types was the same in both cases, and that removing it also produced same speed differences. The only time I saw the python-magic wrapper faster is when doing 1000+ iterations over a small test string. I don't have 1000+ different strings to test with, so don't know if that's because it is faster or just cached the results. Which is causing me to think maybe should add a lru cache with configurable size. Overall, giving me lots to think of and happy with my findings. Thanks for the inspiration @mara004 Going to keep it just as Faster in, now with proof ™️ |
The Readme claims
Do you have any actual evidence for that (reproducible benchmark or similar) ?
For typically pure-python re-implementations are slower than C library bindings, unless the pure-python package uses significantly more efficient algorithms, or there is a lot of object transfer or FFI overhead involved with the binding.
The text was updated successfully, but these errors were encountered: