Skip to content

AMULET.py crashes on empty --rfilter BED (pandas EmptyDataError) #30

@chlee-tabin

Description

@chlee-tabin

Summary

AMULET.py reads the --rfilter BED file via pd.read_csv(args.rfilter, sep="\t", header=None).values[:,0:3] (line 194). If the BED file is genuinely empty (a valid case for organisms without a published blacklist), pandas raises:

pandas.errors.EmptyDataError: No columns to parse from file

This happens before any of the AMULET algorithm runs, so the OverlapSummary.txt and Overlaps.txt files (produced upstream by FragmentFileOverlapCounter.py) are wasted.

Reproduction

Pass a zero-byte BED file as --rfilter:

touch /tmp/empty_blacklist.bed
bash AMULET.sh fragments.tsv.gz singlecell.csv chrs.txt /tmp/empty_blacklist.bed outdir scriptpath

Full traceback:

Traceback (most recent call last):
  File "AMULET.py", line 194, in <module>
    simplerepeats = po.getUnionPeaks([pd.read_csv(args.rfilter, sep="\t", header=None).values[:,0:3]])
  ...
  File ".../pandas/_libs/parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

Workaround

A single dummy line on a fake chromosome unblocks pandas without affecting the algorithm (cannot intersect any real fragment):

echo -e "__no_blacklist__\t0\t1" > /tmp/repeats.bed

This is what I'm doing in production for duck (Anas platyrhynchos) snATAC multiome — no duck-specific blacklist exists.

Suggested fix

Guard the pd.read_csv call against empty files and skip the filtering step in that case. Roughly:

import os
simplerepeats = np.zeros((0, 3))
if args.rfilter and os.path.getsize(args.rfilter) > 0:
    try:
        simplerepeats = po.getUnionPeaks([
            pd.read_csv(args.rfilter, sep="\t", header=None).values[:, 0:3]
        ])
    except pd.errors.EmptyDataError:
        pass  # treat as no-filter

(Or even simpler: catch the EmptyDataError and fall through to the no-filter branch.)

Why this matters

The README says the repeats filter is recommended but not strictly required. Empty BED is a legitimate use case for users working with non-mammalian / non-model-organism datasets without published ENCODE-style blacklists. Currently those users have to know about the dummy-line workaround, which is not documented anywhere.

Low priority — trivial workaround exists — but worth a 5-line fix for surface ergonomics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions