Skip to content

fix: recover VCF INFO fields around malformed tokens#156

Merged
nvictus merged 6 commits intoabdenlab:mainfrom
mwiewior:fix/vcf-info-double-semicolon
Feb 16, 2026
Merged

fix: recover VCF INFO fields around malformed tokens#156
nvictus merged 6 commits intoabdenlab:mainfrom
mwiewior:fix/vcf-info-double-semicolon

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

Summary

  • Replace per-field info.get() calls with a single info.iter() pass that collects all successfully parsed INFO fields into a HashMap
  • Fields on both sides of malformed tokens (e.g. ;;) are now recovered instead of being silently nullified
  • Removes noisy eprintln! stderr messages that fired once per record per failing field
  • Also improves performance by scanning the INFO string once instead of N times per record

Context

Real-world VCFs such as Ensembl variation files contain records with double semicolons (;;) in the INFO column:

dbSNP_156;TSA=SNV;E_Freq;E_gnomAD;;MA=C;MAF=0.123;MAC=19;AA=G

The previous code called info.get(&header, &field_name) for each defined INFO field. Under the hood, noodles' get() scans the raw INFO string from the beginning each time and aborts at the first tokenization error. This caused:

  1. All fields after ;; (e.g. MA, MAF, MAC, AA) to return null
  2. All fields not present in the record (e.g. COSMIC_101, ClinVar_*, CLIN_*) to also error, since the scan would hit ;; before reaching the end
  3. Noisy stderr output: Error parsing INFO field: "..." for every affected (record, field) pair

The fix uses info.iter() which advances past malformed tokens and continues parsing. All parseable fields are collected in a single pass, then looked up by name.

Test plan

  • Added test_push_vcf_record_with_double_semicolons_in_info — verifies fields before and after ;; are both recovered (not null)
  • All 106 existing tests pass
  • Clippy clean (-D warnings)
  • Manual test with Ensembl homo_sapiens-chr1.vcf.gz

🤖 Generated with Claude Code

Real-world VCFs such as Ensembl variation files contain double
semicolons (`;;`) in the INFO column. The previous approach called
`info.get()` per field, which scans from the beginning each time and
aborts at the first tokenization error, silently nullifying all fields
past the error and printing noisy stderr messages.

Replace per-field `info.get()` with a single `info.iter()` pass that
collects all successfully parsed fields into a HashMap. The noodles
iterator advances past malformed tokens, so fields on both sides of
`;;` are now recovered. This also improves performance by scanning
the INFO string once instead of N times per record.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
info: &'a dyn noodles::vcf::variant::record::Info,
header: &'a noodles::vcf::Header,
) -> HashMap<&'a str, Option<InfoFieldValue<'a>>> {
info.iter(header)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that this requires parsing all the fields, but it doesn't look like noodles provide an alternative to tokenize without parsing.

Copy link
Copy Markdown
Member

@nvictus nvictus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good to me. Thank you!

One issue is that we now swallow tokenization and parse errors as null values. I agree that the eprintlns were noisy, but just a mental note that it would be good to provide a way to emit something in the future, or perhaps use a sentinel value to distinguish "no value" from "failed to parse".

@nvictus nvictus merged commit a3a1ecd into abdenlab:main Feb 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants