Skip to content
Open
141 changes: 141 additions & 0 deletions Phase_0_PR_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Pull Request: Historical Assembly Version Backfill (Phase 0)

## Summary

This PR implements a one-time backfill process to populate historical assembly versions for all existing assemblies in the genomehubs dataset. It enables version-aware milestone tracking by capturing superseded assembly versions that were previously not tracked.

## Solution Overview

Implements a **one-time backfill script** that:
1. Identifies assemblies with version > 1 (indicating historical versions exist)
2. Discovers all historical versions via NCBI FTP
3. Fetches metadata for each historical version using NCBI Datasets CLI
4. Outputs historical versions to a separate TSV file with `versionStatus = "superseded"`
5. Uses smart caching to avoid re-fetching data on reruns

## Key Features

### 1. FTP-Based Version Discovery
- Queries NCBI FTP to find all versions of each assembly
- More reliable than API for historical data
- Based on proven DToL prototype implementation

### 2. Reuses existing Parser
- Calls `parse_ncbi_assemblies.process_assembly_report()` with `version_status="superseded"`
- Fetches sequence reports and computes metrics identically to current assemblies
- Uses same YAML config system
- Generates identical TSV schema (plus `versionStatus` field)

### 3. Smart 2-Tier Caching
- **Version discovery cache** (7-day expiry): Stores which versions exist
- **Metadata cache** (30-day expiry): Stores assembly metadata from Datasets CLI
- Dramatically reduces runtime on reruns (from hours to minutes)

### 4. Checkpoint System
- Saves progress every 100 assemblies
- Allows resuming after interruptions
- Critical for processing ~3,694 assemblies

## Files Modified/Created

### New Files
- `flows/parsers/backfill_historical_versions.py` - Main backfill script
- `configs/assembly_historical.yaml` - Output schema configuration
- `tests/test_backfill.py` - Automated test suite
- `tests/test_data/assembly_test_sample.jsonl` - Test dataset (3 assemblies)
- `tests/README_test_backfill.md` - Testing documentation

### Modified Files
- `flows/parsers/parse_ncbi_assemblies.py` - Added `version_status` parameter
- `flows/lib/utils.py` - Added `parse_s3_file` stub function

## Schema Changes

### New Field in Output
- Same schema as current assemblies, plus `versionStatus` column
- `versionStatus` - Indicates if assembly is "current" or "superseded"

### Expected Results
- **Input**: ~3,694 assemblies with version > 1
- **Output**: ~8,500 historical version records
- **Runtime**: ~10-15 hours (first run), ~15-30 minutes (with cache)

## Testing

### Automated Tests (4/4 passing)
```bash
python tests/test_backfill.py
```

Tests verify:
1. ✅ Accession parsing
2. ✅ Assembly identification
3. ✅ FTP version discovery
4. ✅ Cache functionality

### Manual Test (3 assemblies)
```bash
python flows/parsers/backfill_historical_versions.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parsers should run as modules, not scripts - changing the way it is called to python -m flows.parsers.backfill_historical_versions will ensure all relative imports work and that prefect will be able to call it without any issues

Copy link
Author

@fchen13 fchen13 Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated all documentation to use module syntax (python -m flows.parsers.backfill_historical_versions) instead of script syntax. Fixed in commit 792f34e.

Changes:

  • Updated docstring in the parser
  • Updated test README with correct usage

--input tests/test_data/assembly_test_sample.jsonl \
--config configs/assembly_historical.yaml \
--checkpoint tests/test_data/test_checkpoint.json
```

## Usage

### One-Time Backfill
```bash
python flows/parsers/backfill_historical_versions.py \
--input flows/parsers/eukaryota/ncbi_dataset/data/assembly_data_report.jsonl \
--config configs/assembly_historical.yaml \
--checkpoint backfill_checkpoint.json
```
On first run:
- Takes ~10-15 hours (fetches all historical versions)
- Creates tmp/backfill_cache/ directory
- Checkpoints every 100 assemblies
- Safe to Ctrl+C and resume

On subsequent runs (after input update):
- Takes ~15-30 minutes (uses cache for existing)
- Only fetches NEW assemblies
- Cache expires: version discovery (7 days), metadata (30 days)

Output:
- outputs/assembly_historical.tsv (all superseded versions)
- Each row has version_status="superseded"
- Includes all sequence reports and metrics

## Important Notes

### 1. Stub Function Warning
The `parse_s3_file()` function in `flows/lib/utils.py` is currently a stub:

```python
def parse_s3_file(s3_path: str) -> dict:
return {} # Placeholder
```

**Action needed**: If Rich has a real implementation, replace this stub. The function is imported by `parse_ncbi_assemblies.py` but not used by the backfill script.

### 2. One-Time Process
This backfill is designed to run **once** to populate historical data. Future updates should be handled by the incremental daily pipeline.

### 3. Cache Directory
The cache directory `tmp/backfill_cache/` can be safely deleted after successful completion. It contains:
- Version discovery data (FTP queries)
- Assembly metadata (Datasets CLI responses)

## Next Steps (After PR Merge)

1. **Run full backfill** on production dataset (~10-15 hours)
2. **Upload output** to appropriate S3 location
3. **Implement Phase 1**: Incremental daily updates for new historical versions
4. **Implement Phase 2**: Version-aware milestone tracking

## Questions for Reviewer

1. Is the `parse_s3_file()` stub acceptable, or do you have a real implementation to use?
2. Should historical versions go to a different S3 path than current assemblies?
3. Any preferences for checkpoint file location/naming?

202 changes: 202 additions & 0 deletions configs/assembly_historical.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Configuration for historical assembly version backfill
# This config defines the schema for assembly_historical.tsv
# which contains all superseded assembly versions

needs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the file section and should refer to ATTR_assembly.types.yaml directly.

- ncbi_datasets_eukaryota.types.yaml

attributes:
assembly_level:
header: assemblyLevel
path: assemblyInfo.assemblyLevel
assembly_span:
header: totalSequenceLength
path: assemblyStats.totalSequenceLength
assigned_percent:
header: assignedProportion
path: processedAssemblyStats.assignedProportion
assembly_status:
header: primaryValue
path: processedAssemblyInfo.primaryValue
translate:
"1": primary
assembly_type:
header: assemblyType
path: assemblyInfo.assemblyType
bioproject:
header: bioProjectAccession
path: assemblyInfo.bioprojectLineage.bioprojects.accession
separator:
- ","
biosample:
header: biosampleAccession
path: assemblyInfo.biosample.accession
separator:
- ","
chromosome_count:
header: totalNumberOfChromosomes
path: assemblyStats.totalNumberOfChromosomes
contig_count:
header: numberOfContigs
path: assemblyStats.numberOfContigs
contig_l50:
header: contigL50
path: assemblyStats.contigL50
contig_n50:
header: contigN50
path: assemblyStats.contigN50
ebp_standard_criteria:
header: ebpStandardCriteria
path: processedAssemblyStats.ebpStandardCriteria
separator:
- ","
ebp_standard_date:
header: ebpStandardDate
path: processedAssemblyStats.ebpStandardDate
gc_percent:
header: gcPercent
path: assemblyStats.gcPercent
gene_count:
header: geneCountTotal
path: annotationInfo.stats.geneCounts.total
gene_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
isolate:
header: isolate
path: assemblyInfo.biosample.attributes.name==isolate.value
last_updated:
header: releaseDate
path: assemblyInfo.releaseDate
mitochondrion_accession:
header: mitochondrionAccession
path: processedOrganelleInfo.mitochondrion.accession
separator:
- ;
mitochondrion_assembly_span:
header: mitochondrionAssemblySpan
path: processedOrganelleInfo.mitochondrion.assemblySpan
mitochondrion_gc_percent:
header: mitochondrionGcPercent
path: processedOrganelleInfo.mitochondrion.gcPercent
mitochondrion_scaffolds:
header: mitochondrionScaffolds
path: processedOrganelleInfo.mitochondrion.scaffolds
separator:
- ;
noncoding_gene_count:
header: geneCountNoncoding
path: annotationInfo.stats.geneCounts.nonCoding
noncoding_gene_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
plastid_accession:
header: plastidAccession
path: processedOrganelleInfo.plastid.accession
separator:
- ;
plastid_assembly_span:
header: plastidAssemblySpan
path: processedOrganelleInfo.plastid.assemblySpan
plastid_gc_percent:
header: plastidGcPercent
path: processedOrganelleInfo.plastid.gcPercent
plastid_scaffolds:
header: plastidScaffolds
path: processedOrganelleInfo.plastid.scaffolds
separator:
- ;
protein_count:
header: geneCountProteincoding
path: annotationInfo.stats.geneCounts.proteinCoding
protein_count.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
pseudogene_count:
header: geneCountPseudogene
path: annotationInfo.stats.geneCounts.pseudogene
pseudogene.source.date:
header: annotationReleaseDate
path: annotationInfo.releaseDate
refseq_category:
header: refseqCategory
path: assemblyInfo.refseqCategory
sample_sex:
header: sex
path: assemblyInfo.biosample.attributes.name==sex.value
scaffold_count:
header: numberOfScaffolds
path: assemblyStats.numberOfScaffolds
scaffold_l50:
header: scaffoldL50
path: assemblyStats.scaffoldL50
scaffold_n50:
header: scaffoldN50
path: assemblyStats.scaffoldN50
sequence_count:
header: numberOfComponentSequences
path: assemblyStats.numberOfComponentSequences
submitter:
header: submitter
path: assemblyInfo.submitter
ungapped_span:
header: totalUngappedLength
path: assemblyStats.totalUngappedLength
# NEW: Version-specific field to indicate this is a historical/superseded version
version_status:
header: versionStatus
path: processedAssemblyInfo.versionStatus

file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should have a needs section to pull in full attribute definitions as per ncbi_datasets yaml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added needs section referencing ncbi_datasets_eukaryota.types.yaml to inherit the base attribute definitions. Fixed in commit 91ab12b.

display_group: general
exclusions:
attributes:
- bioproject
- biosample
identifiers:
- assembly_id
taxonomy:
- taxon_id
format: tsv
header: true
name: assembly_historical.tsv
source: INSDC
source_url_stub: https://www.ncbi.nlm.nih.gov/assembly/

identifiers:
assembly_id:
header: assemblyID
path: processedAssemblyInfo.assemblyID
assembly_name:
header: assemblyName
path: assemblyInfo.assemblyName
genbank_accession:
header: genbankAccession
path: processedAssemblyInfo.genbankAccession
refseq_accession:
header: refseqAccession
path: processedAssemblyInfo.refseqAccession
wgs_accession:
header: wgsProjectAccession
path: wgsInfo.wgsProjectAccession

metadata:
is_primary_value:
header: primaryValue
path: processedAssemblyInfo.primaryValue
source_slug:
header: genbankAccession
path: processedAssemblyInfo.genbankAccession

names:
common_name:
header: commonName
path: organism.commonName

taxonomy:
taxon:
header: organismName
path: organism.organismName
taxon_id:
header: taxId
path: organism.taxId
Loading
Loading