Skip to content

UTF‑16 LE BOM (FF FE) causes CSV files to be misidentified as .mp1 in puremagic 2.0 #134

@tomazfs

Description

@tomazfs

After upgrading to puremagic 2.x, UTF‑16 Little Endian text files (e.g., CSV) that begin with a BOM (FF FE) are now misidentified as .mp1 (MPEG Layer I audio).

This is a regression compared to puremagic 1.x.

Problem

A CSV file encoded as UTF‑16 LE starts with the bytes:

FF FE

In puremagic 1.x this file was identified as a .ini (text/plain). The detected extension (ini) was not correct, but the MIME type was correctly identified as text/plain, which is acceptable for UTF‑16 text files.
In puremagic 2.0 the same file is now incorrectly identified as .mp1 (MPEG Layer I audio).

It seems that puremagic 2.0 treats the BOM as part of the file signature and attempts to match it against binary magic numbers, leading to a false positive.

Expected behavior

UTF‑16 text files should not be classified as audio files.

Actual behavior

UTF‑16 LE CSV files are detected as .mp1.

Minimal reproducible example

import puremagic

data = b"\xff\xfe" + "a,b,c\n1,2,3\n".encode("utf-16-le")
print(puremagic.from_string(data))
print(puremagic.from_string(data, mime=True))

Output in puremagic 2.0:

mp1
audio/mpeg

Output in puremagic 1.x:

.ini
text/plain

Environment

puremagic version: 2.0.0
Python version: 3.13.11
OS: Windows 10

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions