-
Notifications
You must be signed in to change notification settings - Fork 1
historical assembly version backfill (phase 0) #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8797ac1
b0c2c60
645b7ea
237742e
8d1666b
7101396
899074a
792f34e
91ab12b
6786644
8576cb3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,141 @@ | ||
| # Pull Request: Historical Assembly Version Backfill (Phase 0) | ||
|
|
||
| ## Summary | ||
|
|
||
| This PR implements a one-time backfill process to populate historical assembly versions for all existing assemblies in the genomehubs dataset. It enables version-aware milestone tracking by capturing superseded assembly versions that were previously not tracked. | ||
|
|
||
| ## Solution Overview | ||
|
|
||
| Implements a **one-time backfill script** that: | ||
| 1. Identifies assemblies with version > 1 (indicating historical versions exist) | ||
| 2. Discovers all historical versions via NCBI FTP | ||
| 3. Fetches metadata for each historical version using NCBI Datasets CLI | ||
| 4. Outputs historical versions to a separate TSV file with `versionStatus = "superseded"` | ||
| 5. Uses smart caching to avoid re-fetching data on reruns | ||
|
|
||
| ## Key Features | ||
|
|
||
| ### 1. FTP-Based Version Discovery | ||
| - Queries NCBI FTP to find all versions of each assembly | ||
| - More reliable than API for historical data | ||
| - Based on proven DToL prototype implementation | ||
|
|
||
| ### 2. Reuses existing Parser | ||
| - Calls `parse_ncbi_assemblies.process_assembly_report()` with `version_status="superseded"` | ||
| - Fetches sequence reports and computes metrics identically to current assemblies | ||
| - Uses same YAML config system | ||
| - Generates identical TSV schema (plus `versionStatus` field) | ||
|
|
||
| ### 3. Smart 2-Tier Caching | ||
| - **Version discovery cache** (7-day expiry): Stores which versions exist | ||
| - **Metadata cache** (30-day expiry): Stores assembly metadata from Datasets CLI | ||
| - Dramatically reduces runtime on reruns (from hours to minutes) | ||
|
|
||
| ### 4. Checkpoint System | ||
| - Saves progress every 100 assemblies | ||
| - Allows resuming after interruptions | ||
| - Critical for processing ~3,694 assemblies | ||
|
|
||
| ## Files Modified/Created | ||
|
|
||
| ### New Files | ||
| - `flows/parsers/backfill_historical_versions.py` - Main backfill script | ||
| - `configs/assembly_historical.yaml` - Output schema configuration | ||
| - `tests/test_backfill.py` - Automated test suite | ||
| - `tests/test_data/assembly_test_sample.jsonl` - Test dataset (3 assemblies) | ||
| - `tests/README_test_backfill.md` - Testing documentation | ||
|
|
||
| ### Modified Files | ||
| - `flows/parsers/parse_ncbi_assemblies.py` - Added `version_status` parameter | ||
| - `flows/lib/utils.py` - Added `parse_s3_file` stub function | ||
|
|
||
| ## Schema Changes | ||
|
|
||
| ### New Field in Output | ||
| - Same schema as current assemblies, plus `versionStatus` column | ||
| - `versionStatus` - Indicates if assembly is "current" or "superseded" | ||
|
|
||
| ### Expected Results | ||
| - **Input**: ~3,694 assemblies with version > 1 | ||
| - **Output**: ~8,500 historical version records | ||
| - **Runtime**: ~10-15 hours (first run), ~15-30 minutes (with cache) | ||
|
|
||
| ## Testing | ||
|
|
||
| ### Automated Tests (4/4 passing) | ||
| ```bash | ||
| python tests/test_backfill.py | ||
| ``` | ||
|
|
||
| Tests verify: | ||
| 1. ✅ Accession parsing | ||
| 2. ✅ Assembly identification | ||
| 3. ✅ FTP version discovery | ||
| 4. ✅ Cache functionality | ||
|
|
||
| ### Manual Test (3 assemblies) | ||
| ```bash | ||
| python flows/parsers/backfill_historical_versions.py \ | ||
| --input tests/test_data/assembly_test_sample.jsonl \ | ||
| --config configs/assembly_historical.yaml \ | ||
| --checkpoint tests/test_data/test_checkpoint.json | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### One-Time Backfill | ||
| ```bash | ||
| python flows/parsers/backfill_historical_versions.py \ | ||
| --input flows/parsers/eukaryota/ncbi_dataset/data/assembly_data_report.jsonl \ | ||
| --config configs/assembly_historical.yaml \ | ||
| --checkpoint backfill_checkpoint.json | ||
| ``` | ||
| On first run: | ||
| - Takes ~10-15 hours (fetches all historical versions) | ||
| - Creates tmp/backfill_cache/ directory | ||
| - Checkpoints every 100 assemblies | ||
| - Safe to Ctrl+C and resume | ||
|
|
||
| On subsequent runs (after input update): | ||
| - Takes ~15-30 minutes (uses cache for existing) | ||
| - Only fetches NEW assemblies | ||
| - Cache expires: version discovery (7 days), metadata (30 days) | ||
|
|
||
| Output: | ||
| - outputs/assembly_historical.tsv (all superseded versions) | ||
| - Each row has version_status="superseded" | ||
| - Includes all sequence reports and metrics | ||
|
|
||
| ## Important Notes | ||
|
|
||
| ### 1. Stub Function Warning | ||
| The `parse_s3_file()` function in `flows/lib/utils.py` is currently a stub: | ||
|
|
||
| ```python | ||
| def parse_s3_file(s3_path: str) -> dict: | ||
| return {} # Placeholder | ||
| ``` | ||
|
|
||
| **Action needed**: If Rich has a real implementation, replace this stub. The function is imported by `parse_ncbi_assemblies.py` but not used by the backfill script. | ||
|
|
||
| ### 2. One-Time Process | ||
| This backfill is designed to run **once** to populate historical data. Future updates should be handled by the incremental daily pipeline. | ||
|
|
||
| ### 3. Cache Directory | ||
| The cache directory `tmp/backfill_cache/` can be safely deleted after successful completion. It contains: | ||
| - Version discovery data (FTP queries) | ||
| - Assembly metadata (Datasets CLI responses) | ||
|
|
||
| ## Next Steps (After PR Merge) | ||
|
|
||
| 1. **Run full backfill** on production dataset (~10-15 hours) | ||
| 2. **Upload output** to appropriate S3 location | ||
| 3. **Implement Phase 1**: Incremental daily updates for new historical versions | ||
| 4. **Implement Phase 2**: Version-aware milestone tracking | ||
|
|
||
| ## Questions for Reviewer | ||
|
|
||
| 1. Is the `parse_s3_file()` stub acceptable, or do you have a real implementation to use? | ||
| 2. Should historical versions go to a different S3 path than current assemblies? | ||
| 3. Any preferences for checkpoint file location/naming? | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,202 @@ | ||
| # Configuration for historical assembly version backfill | ||
| # This config defines the schema for assembly_historical.tsv | ||
| # which contains all superseded assembly versions | ||
|
|
||
| needs: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be in the file section and should refer to |
||
| - ncbi_datasets_eukaryota.types.yaml | ||
|
|
||
| attributes: | ||
| assembly_level: | ||
| header: assemblyLevel | ||
| path: assemblyInfo.assemblyLevel | ||
| assembly_span: | ||
| header: totalSequenceLength | ||
| path: assemblyStats.totalSequenceLength | ||
| assigned_percent: | ||
| header: assignedProportion | ||
| path: processedAssemblyStats.assignedProportion | ||
| assembly_status: | ||
| header: primaryValue | ||
| path: processedAssemblyInfo.primaryValue | ||
| translate: | ||
| "1": primary | ||
| assembly_type: | ||
| header: assemblyType | ||
| path: assemblyInfo.assemblyType | ||
| bioproject: | ||
| header: bioProjectAccession | ||
| path: assemblyInfo.bioprojectLineage.bioprojects.accession | ||
| separator: | ||
| - "," | ||
| biosample: | ||
| header: biosampleAccession | ||
| path: assemblyInfo.biosample.accession | ||
| separator: | ||
| - "," | ||
| chromosome_count: | ||
| header: totalNumberOfChromosomes | ||
| path: assemblyStats.totalNumberOfChromosomes | ||
| contig_count: | ||
| header: numberOfContigs | ||
| path: assemblyStats.numberOfContigs | ||
| contig_l50: | ||
| header: contigL50 | ||
| path: assemblyStats.contigL50 | ||
| contig_n50: | ||
| header: contigN50 | ||
| path: assemblyStats.contigN50 | ||
| ebp_standard_criteria: | ||
| header: ebpStandardCriteria | ||
| path: processedAssemblyStats.ebpStandardCriteria | ||
| separator: | ||
| - "," | ||
| ebp_standard_date: | ||
| header: ebpStandardDate | ||
| path: processedAssemblyStats.ebpStandardDate | ||
| gc_percent: | ||
| header: gcPercent | ||
| path: assemblyStats.gcPercent | ||
| gene_count: | ||
| header: geneCountTotal | ||
| path: annotationInfo.stats.geneCounts.total | ||
| gene_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| isolate: | ||
| header: isolate | ||
| path: assemblyInfo.biosample.attributes.name==isolate.value | ||
| last_updated: | ||
| header: releaseDate | ||
| path: assemblyInfo.releaseDate | ||
| mitochondrion_accession: | ||
| header: mitochondrionAccession | ||
| path: processedOrganelleInfo.mitochondrion.accession | ||
| separator: | ||
| - ; | ||
| mitochondrion_assembly_span: | ||
| header: mitochondrionAssemblySpan | ||
| path: processedOrganelleInfo.mitochondrion.assemblySpan | ||
| mitochondrion_gc_percent: | ||
| header: mitochondrionGcPercent | ||
| path: processedOrganelleInfo.mitochondrion.gcPercent | ||
| mitochondrion_scaffolds: | ||
| header: mitochondrionScaffolds | ||
| path: processedOrganelleInfo.mitochondrion.scaffolds | ||
| separator: | ||
| - ; | ||
| noncoding_gene_count: | ||
| header: geneCountNoncoding | ||
| path: annotationInfo.stats.geneCounts.nonCoding | ||
| noncoding_gene_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| plastid_accession: | ||
| header: plastidAccession | ||
| path: processedOrganelleInfo.plastid.accession | ||
| separator: | ||
| - ; | ||
| plastid_assembly_span: | ||
| header: plastidAssemblySpan | ||
| path: processedOrganelleInfo.plastid.assemblySpan | ||
| plastid_gc_percent: | ||
| header: plastidGcPercent | ||
| path: processedOrganelleInfo.plastid.gcPercent | ||
| plastid_scaffolds: | ||
| header: plastidScaffolds | ||
| path: processedOrganelleInfo.plastid.scaffolds | ||
| separator: | ||
| - ; | ||
| protein_count: | ||
| header: geneCountProteincoding | ||
| path: annotationInfo.stats.geneCounts.proteinCoding | ||
| protein_count.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| pseudogene_count: | ||
| header: geneCountPseudogene | ||
| path: annotationInfo.stats.geneCounts.pseudogene | ||
| pseudogene.source.date: | ||
| header: annotationReleaseDate | ||
| path: annotationInfo.releaseDate | ||
| refseq_category: | ||
| header: refseqCategory | ||
| path: assemblyInfo.refseqCategory | ||
| sample_sex: | ||
| header: sex | ||
| path: assemblyInfo.biosample.attributes.name==sex.value | ||
| scaffold_count: | ||
| header: numberOfScaffolds | ||
| path: assemblyStats.numberOfScaffolds | ||
| scaffold_l50: | ||
| header: scaffoldL50 | ||
| path: assemblyStats.scaffoldL50 | ||
| scaffold_n50: | ||
| header: scaffoldN50 | ||
| path: assemblyStats.scaffoldN50 | ||
| sequence_count: | ||
| header: numberOfComponentSequences | ||
| path: assemblyStats.numberOfComponentSequences | ||
| submitter: | ||
| header: submitter | ||
| path: assemblyInfo.submitter | ||
| ungapped_span: | ||
| header: totalUngappedLength | ||
| path: assemblyStats.totalUngappedLength | ||
| # NEW: Version-specific field to indicate this is a historical/superseded version | ||
| version_status: | ||
| header: versionStatus | ||
| path: processedAssemblyInfo.versionStatus | ||
|
|
||
| file: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this should have a
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added |
||
| display_group: general | ||
| exclusions: | ||
| attributes: | ||
| - bioproject | ||
| - biosample | ||
| identifiers: | ||
| - assembly_id | ||
| taxonomy: | ||
| - taxon_id | ||
| format: tsv | ||
| header: true | ||
| name: assembly_historical.tsv | ||
| source: INSDC | ||
| source_url_stub: https://www.ncbi.nlm.nih.gov/assembly/ | ||
|
|
||
| identifiers: | ||
| assembly_id: | ||
| header: assemblyID | ||
| path: processedAssemblyInfo.assemblyID | ||
| assembly_name: | ||
| header: assemblyName | ||
| path: assemblyInfo.assemblyName | ||
| genbank_accession: | ||
| header: genbankAccession | ||
| path: processedAssemblyInfo.genbankAccession | ||
| refseq_accession: | ||
| header: refseqAccession | ||
| path: processedAssemblyInfo.refseqAccession | ||
| wgs_accession: | ||
| header: wgsProjectAccession | ||
| path: wgsInfo.wgsProjectAccession | ||
|
|
||
| metadata: | ||
| is_primary_value: | ||
| header: primaryValue | ||
| path: processedAssemblyInfo.primaryValue | ||
| source_slug: | ||
| header: genbankAccession | ||
| path: processedAssemblyInfo.genbankAccession | ||
|
|
||
| names: | ||
| common_name: | ||
| header: commonName | ||
| path: organism.commonName | ||
|
|
||
| taxonomy: | ||
| taxon: | ||
| header: organismName | ||
| path: organism.organismName | ||
| taxon_id: | ||
| header: taxId | ||
| path: organism.taxId | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parsers should run as modules, not scripts - changing the way it is called to
python -m flows.parsers.backfill_historical_versionswill ensure all relative imports work and that prefect will be able to call it without any issuesUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated all documentation to use module syntax (
python -m flows.parsers.backfill_historical_versions) instead of script syntax. Fixed in commit 792f34e.Changes: