Summary
AMULET.py reads the --rfilter BED file via pd.read_csv(args.rfilter, sep="\t", header=None).values[:,0:3] (line 194). If the BED file is genuinely empty (a valid case for organisms without a published blacklist), pandas raises:
pandas.errors.EmptyDataError: No columns to parse from file
This happens before any of the AMULET algorithm runs, so the OverlapSummary.txt and Overlaps.txt files (produced upstream by FragmentFileOverlapCounter.py) are wasted.
Reproduction
Pass a zero-byte BED file as --rfilter:
touch /tmp/empty_blacklist.bed
bash AMULET.sh fragments.tsv.gz singlecell.csv chrs.txt /tmp/empty_blacklist.bed outdir scriptpath
Full traceback:
Traceback (most recent call last):
File "AMULET.py", line 194, in <module>
simplerepeats = po.getUnionPeaks([pd.read_csv(args.rfilter, sep="\t", header=None).values[:,0:3]])
...
File ".../pandas/_libs/parsers.pyx", line 581, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Workaround
A single dummy line on a fake chromosome unblocks pandas without affecting the algorithm (cannot intersect any real fragment):
echo -e "__no_blacklist__\t0\t1" > /tmp/repeats.bed
This is what I'm doing in production for duck (Anas platyrhynchos) snATAC multiome — no duck-specific blacklist exists.
Suggested fix
Guard the pd.read_csv call against empty files and skip the filtering step in that case. Roughly:
import os
simplerepeats = np.zeros((0, 3))
if args.rfilter and os.path.getsize(args.rfilter) > 0:
try:
simplerepeats = po.getUnionPeaks([
pd.read_csv(args.rfilter, sep="\t", header=None).values[:, 0:3]
])
except pd.errors.EmptyDataError:
pass # treat as no-filter
(Or even simpler: catch the EmptyDataError and fall through to the no-filter branch.)
Why this matters
The README says the repeats filter is recommended but not strictly required. Empty BED is a legitimate use case for users working with non-mammalian / non-model-organism datasets without published ENCODE-style blacklists. Currently those users have to know about the dummy-line workaround, which is not documented anywhere.
Low priority — trivial workaround exists — but worth a 5-line fix for surface ergonomics.
Summary
AMULET.pyreads the--rfilterBED file viapd.read_csv(args.rfilter, sep="\t", header=None).values[:,0:3](line 194). If the BED file is genuinely empty (a valid case for organisms without a published blacklist), pandas raises:This happens before any of the AMULET algorithm runs, so the OverlapSummary.txt and Overlaps.txt files (produced upstream by
FragmentFileOverlapCounter.py) are wasted.Reproduction
Pass a zero-byte BED file as
--rfilter:Full traceback:
Workaround
A single dummy line on a fake chromosome unblocks pandas without affecting the algorithm (cannot intersect any real fragment):
This is what I'm doing in production for duck (Anas platyrhynchos) snATAC multiome — no duck-specific blacklist exists.
Suggested fix
Guard the
pd.read_csvcall against empty files and skip the filtering step in that case. Roughly:(Or even simpler: catch the
EmptyDataErrorand fall through to the no-filter branch.)Why this matters
The README says the repeats filter is recommended but not strictly required. Empty BED is a legitimate use case for users working with non-mammalian / non-model-organism datasets without published ENCODE-style blacklists. Currently those users have to know about the dummy-line workaround, which is not documented anywhere.
Low priority — trivial workaround exists — but worth a 5-line fix for surface ergonomics.