-
Notifications
You must be signed in to change notification settings - Fork 40
Description
After upgrading to puremagic 2.x, UTF‑16 Little Endian text files (e.g., CSV) that begin with a BOM (FF FE) are now misidentified as .mp1 (MPEG Layer I audio).
This is a regression compared to puremagic 1.x.
Problem
A CSV file encoded as UTF‑16 LE starts with the bytes:
FF FE
In puremagic 1.x this file was identified as a .ini (text/plain). The detected extension (ini) was not correct, but the MIME type was correctly identified as text/plain, which is acceptable for UTF‑16 text files.
In puremagic 2.0 the same file is now incorrectly identified as .mp1 (MPEG Layer I audio).
It seems that puremagic 2.0 treats the BOM as part of the file signature and attempts to match it against binary magic numbers, leading to a false positive.
Expected behavior
UTF‑16 text files should not be classified as audio files.
Actual behavior
UTF‑16 LE CSV files are detected as .mp1.
Minimal reproducible example
import puremagic
data = b"\xff\xfe" + "a,b,c\n1,2,3\n".encode("utf-16-le")
print(puremagic.from_string(data))
print(puremagic.from_string(data, mime=True))
Output in puremagic 2.0:
mp1
audio/mpeg
Output in puremagic 1.x:
.ini
text/plain
Environment
puremagic version: 2.0.0
Python version: 3.13.11
OS: Windows 10