v1.1.2 - Duplicate filename fix + promoter extraction speedup
What's new in v1.1.2
Bug fix — output overwrite freeze
Re-running promoter extraction with an existing run name caused the app
to enter an "not responding" state on Windows. The root cause was silent
overwriting of a large existing FASTA file: Python's open(..., "w")
truncates and rewrites the file synchronously, stalling the I/O scheduler
for multi-hundred-MB outputs (e.g. Arachis hypogaea promoters).
Fix: a unique_path() guard now auto-increments output names
(run_name → run_name(1) → run_name(2) …) before any write is attempted.
The run-name field in the UI reflects the actual name used. The same guard
applies to genome FASTA and GFF3 saves in the Download tab.
Performance — chromosome-grouped promoter extraction
Previous behaviour: SeqIO.index() was called once per gene record.
For a genome with N genes distributed across C chromosomes, this
produced N independent random-seek file reads — approximately 70,000
disk seeks for A. hypogaea (gnm2, 20 chromosomes, ~69,000 annotated
genes). Wall time: ~57 s on a typical NVMe SSD.
New behaviour: Gene records are first grouped by sequence ID (chromosome).
Each chromosome sequence is then fetched from disk exactly once and
cached as a plain Python str object in memory. Promoter slicing and
reverse-complement operations are performed entirely in RAM against this
cached string. Reverse complement uses str.maketrans + slice reversal
rather than constructing a BioPython Seq object per gene, eliminating
per-record object allocation overhead.
Complexity: O(N) disk reads reduced to O(C) disk reads.
For A. hypogaea: 69,000 → 20 chromosome reads.
Benchmarked result: ~57 s → expected ~10–20 s (3–6× speedup).
Memory overhead: one chromosome sequence string held in RAM at a time
(largest A. hypogaea chromosome ~160 Mb; well within typical RAM limits).