Skip to content

Release v1.0.6

Choose a tag to compare

@St3451 St3451 released this 29 Jul 21:50
· 295 commits to master since this release

MANE-Only Dataset Preprocessing and Custom Structure Integration

This release introduces a major enhancement to the dataset building pipeline in Oncodrive3D, enabling MANE-only processing, custom structural predictions, and complete decoupling from UniProt for structure annotation workflows of MANE Select transcript-associated structures.

Key Features & Improvements

Refactored Dataset Builder (build-datasets)

  • Direct MANE Integration: Switches the default data source for the MANE Select transcript-associated structures from UniProt API to the MANE Select protein set obtained directly from NCBI.

  • Custom Structures Support:

    • This feature allows you to integrate in-house AlphaFold2-predicted structures, which can be generated via the nf-core/proteinfold pipeline, into the Oncodrive3D build using two new arguments:

      • --custom_mane_pdb_dir: directory containing custom PDB files.

      • --custom_mane_metadata: path to a samplesheet.csv including the structures metadata.

    • The objective is to maximize structural coverage of the MANE Select transcriptome, compensating for proteins missing from the AlphaFold Database MANE release, which is still based on version 1.0 and lacks hundreds of proteins.

    • The samplesheet.csv (custom_mane_metadata) must include at minimum:

      • sequence: the Ensembl protein ID used as the structure identifier.

      • refseq: the amino acid sequence (in one-letter code).
        This will be used to inject the sequence into the PDB file if the structure is missing this information, which is common for predicted structures generated via nf-core/proteinfold.

New Utility: prepare_samplesheet.py

A standalone preprocessing script that:

  • Downloads and parses the full MANE.GRCh38.v1.4.ensembl_protein.faa.gz release from NCBI.

  • Cross-references these proteins with the AlphaFold MANE mapping file (mane_refseq_prot_to_alphafold.csv) to identify missing structures.

  • Generates:

    • A samplesheet.csv listing all MANE Select proteins missing from the AlphaFold Database release, including necessary metadata for structure prediction and downstream integration.

    • Individual FASTA files per Ensembl protein ID, ready for direct input to the nf-core/proteinfold pipeline.

These outputs allow users to predict and recover missing structures, enabling full coverage of the MANE proteome, independent of AlphaFold’s release schedule or UniProt mappings.


Full Changelog: v1.0.5...v1.0.6