Release v1.0.8
Release Notes
Summary
This release keeps the build-datasets process reliable as external services change, particularly the AlphaFold DB PAE hosting policy and slower Ensembl/BioMart endpoints. It also speeds up dataset builds and adds tooling to prepare and incorporate AlphaFold in-house predicted structures. Finally, it fixes container path mounting in the Nextflow pipeline and refactors the configuration.
Changes
(#83) Fix and refactor nextflow scripts to enable path validation and Singularity bind mounting for data_dir and, when required, annotations_dir.
(#80) Batched Ensembl CDS retrieval (50 IDs/request) and added rate‑limit handling to speed up build-datasets.
(#80) Hardened BioMart metadata download with archive to latest fallback, Python fallback, and clearer logging.
(#80) Added --custom_pae_dir for supplying pre‑downloaded PAE; when PAE is missing, pCMAPs fall back to binary contact maps.
(#80) Enforced AF v4 for MANE builds, set non‑MANE default --af_version to v6, and propagated AF version correctly through fragment merging.
(#80) Improved robustness for empty/missing sequences and prevented Tri_context computation on NA.
(#80) Retry missing Ensembl CDS one‑by‑one (now parallelized) after the batched pass; batch workers capped at 8 to reduce timeouts.
(#80) Limit download segment concurrency to 8 for large downloads (MANE summary, BioMart, etc.) to reduce connection failures.
(#80) Custom MANE symbols now propagated from samplesheets (with debug warnings when ENSP not found in MANE summary).
(#80) Custom PDB copy summary logging added (counts copied/skipped; SEQRES insertions in debug).
(#80) Preprocessing behavior change: update_samplesheet_and_structures.py always adds symbol to final bundles but keeps samplesheet.csv clean; --include-metadata now only adds CGC/length.
(#80) prepare_samplesheet.py runs standalone from tools/preprocessing/ (adds repo root to sys.path).
(#80) PAE probe logic refined: missing streak resets only on "ok" (not transient failures).
(#78) Added parser_alphafold_predictions.py to flatten nested AlphaFold prediction outputs into <ENSP_ID>.alphafold.pdb.
(#78) Fixed samplesheet.csv sorting in update_samplesheet_and_structures.py when --cgc-list-path is provided.
Full Changelog: v1.0.7...v1.0.8