# Polars-Bio VCF FORMAT Fields Issue

This notebook demonstrates that `polars-bio`'s `scan_vcf()` function does **not extract FORMAT/genotype fields** (GT, DP, AD, GQ, VAF, PL, etc.) from VCF files.

Only INFO fields are extracted, while per-sample genotype data from the FORMAT column is completely missing.

**Expected FORMAT fields in VCF:**
- `GT` - Genotype (e.g., 0/0, 0/1, 1/1)
- `DP` - Read depth
- `AD` - Allelic depths
- `GQ` - Genotype quality
- `VAF` - Variant allele fraction
- `PL` - Phred-scaled genotype likelihoods

In [14]:
from pathlib import Path
import polars as pl
import polars_bio as pb

print(f'polars-bio version: {pb.__version__}')

polars-bio version: 0.19.0


In [15]:
base = (Path.cwd() if (Path.cwd() / "notebooks").exists() else Path.cwd().parent).absolute().resolve()
print(base)
data = base / "data"
input = data / "input"
tests = input / "tests"


/home/antonkulaga/sources/just-dna-lite


## 1. Check VCF Header - What FORMAT fields are defined

In [16]:
vcf_path = tests / "antku_small.vcf"

# Read header to see what FORMAT fields are defined in the VCF
format_fields = []
with open(vcf_path, 'r') as f:
    for line in f:
        if line.startswith('##FORMAT='):
            # Extract field name from ##FORMAT=<ID=GT,...>
            start = line.find('ID=') + 3
            end = line.find(',', start)
            field_name = line[start:end]
            format_fields.append(field_name)
            print(line.strip())
        elif not line.startswith('#'):
            break

print(f"\nFORMAT fields defined in VCF header: {format_fields}")

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">

FORMAT fields defined in VCF header: ['GT', 'GQ', 'DP', 'MIN_DP', 'AD', 'VAF', 'PL']


## 2. Show a raw VCF data line to see FORMAT values

In [17]:
# Show a few data lines to see the FORMAT column structure
print("Sample VCF data lines (showing FORMAT and sample columns):")
print("=" * 80)

with open(vcf_path, 'r') as f:
    lines_shown = 0
    for line in f:
        if line.startswith('#CHROM'):
            cols = line.strip().split('\t')
            print(f"Column header: {cols}")
            print()
        elif not line.startswith('#'):
            cols = line.strip().split('\t')
            print(f"CHROM: {cols[0]}")
            print(f"POS: {cols[1]}")
            print(f"ID: {cols[2]}")
            print(f"REF: {cols[3]}")
            print(f"ALT: {cols[4]}")
            print(f"QUAL: {cols[5]}")
            print(f"FILTER: {cols[6]}")
            print(f"INFO: {cols[7]}")
            print(f"FORMAT: {cols[8]}  <-- defines field order")
            print(f"SAMPLE: {cols[9]}  <-- actual values (GT:GQ:DP:AD:VAF:PL)")
            print()
            lines_shown += 1
            if lines_shown >= 3:
                break

Sample VCF data lines (showing FORMAT and sample columns):
Column header: ['#CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'INFO', 'FORMAT', 'default']

CHROM: 1
POS: 10009
ID: .
REF: A
ALT: AC
QUAL: 0
FILTER: RefCall
INFO: .
FORMAT: GT:GQ:DP:AD:VAF:PL  <-- defines field order
SAMPLE: 0/0:27:10:6,2:0.2:0,27,34  <-- actual values (GT:GQ:DP:AD:VAF:PL)

CHROM: 1
POS: 10015
ID: .
REF: A
ALT: G
QUAL: 0
FILTER: RefCall
INFO: .
FORMAT: GT:GQ:DP:AD:VAF:PL  <-- defines field order
SAMPLE: 0/0:35:17:12,4:0.235294:0,35,41  <-- actual values (GT:GQ:DP:AD:VAF:PL)

CHROM: 1
POS: 10021
ID: .
REF: A
ALT: G
QUAL: 0
FILTER: RefCall
INFO: .
FORMAT: GT:GQ:DP:AD:VAF:PL  <-- defines field order
SAMPLE: 0/0:39:22:16,5:0.227273:0,39,45  <-- actual values (GT:GQ:DP:AD:VAF:PL)



## 3. Load VCF with polars-bio scan_vcf()

In [5]:
# Load with polars-bio - requesting INFO fields
lf = pb.scan_vcf(str(vcf_path), info_fields=["END"])

print("Schema from polars-bio scan_vcf():")
print(lf.collect_schema())

Schema from polars-bio scan_vcf():
Schema({'chrom': String, 'start': UInt32, 'end': UInt32, 'id': String, 'ref': String, 'alt': String, 'qual': Float64, 'filter': String, 'END': Int32})


In [6]:
# Show first few rows
print("\nFirst 5 rows from polars-bio:")
lf.head(5).collect()


First 5 rows from polars-bio:


5rows [00:00, 941.78rows/s]


chrom,start,end,id,ref,alt,qual,filter,END
str,u32,u32,str,str,str,f64,str,i32
"""1""",10009,10009,"""""","""A""","""AC""",0.0,"""RefCall""",
"""1""",10015,10015,"""""","""A""","""G""",0.0,"""RefCall""",
"""1""",10021,10021,"""""","""A""","""G""",0.0,"""RefCall""",
"""1""",10027,10027,"""""","""A""","""G""",0.0,"""RefCall""",
"""1""",10033,10033,"""""","""A""","""G""",0.0,"""RefCall""",


## 4. Check which columns are present vs missing

In [7]:
columns = lf.collect_schema().names()

print("Columns present in polars-bio output:")
print(columns)

print("\n" + "=" * 60)
print("ISSUE: FORMAT fields NOT present in polars-bio output:")
print("=" * 60)

for field in format_fields:
    if field not in columns and field.lower() not in [c.lower() for c in columns]:
        print(f"  - {field} (MISSING)")

print("\n" + "=" * 60)
print("CONCLUSION:")
print("=" * 60)
print(f"polars-bio scan_vcf() extracts INFO fields but does NOT extract")
print(f"FORMAT/genotype fields (GT, DP, AD, GQ, etc.) from VCF files.")
print(f"\nThis is a known limitation - the FORMAT column contains per-sample")
print(f"genotype data which is critical for variant analysis but is not")
print(f"currently supported by polars-bio.")

Columns present in polars-bio output:
['chrom', 'start', 'end', 'id', 'ref', 'alt', 'qual', 'filter', 'END']

ISSUE: FORMAT fields NOT present in polars-bio output:
  - GT (MISSING)
  - GQ (MISSING)
  - DP (MISSING)
  - MIN_DP (MISSING)
  - AD (MISSING)
  - VAF (MISSING)
  - PL (MISSING)

CONCLUSION:
polars-bio scan_vcf() extracts INFO fields but does NOT extract
FORMAT/genotype fields (GT, DP, AD, GQ, etc.) from VCF files.

This is a known limitation - the FORMAT column contains per-sample
genotype data which is critical for variant analysis but is not
currently supported by polars-bio.


## 5. WORKAROUND: scan_vcf_with_formats() from just-dna-pipelines

We've implemented a workaround that extends polars-bio's scan_vcf to include FORMAT fields.
The approach:
1. Read with polars-bio (gets chrom, start, end, id, ref, alt, qual, filter, INFO fields)
2. Read FORMAT/sample columns with regular Polars (very fast, Rust-based string operations)
3. Join on row index (both readers preserve the same row order)

In [12]:
from just_dna_pipelines.io import scan_vcf_with_formats

# Load VCF with FORMAT fields using our workaround
lf_with_formats = scan_vcf_with_formats(
    str(vcf_path),
    format_fields=["GT", "GQ", "DP", "AD", "VAF", "PL"]
)

print("Schema from scan_vcf_with_formats():")
print(lf_with_formats.collect_schema())

Schema from scan_vcf_with_formats():
Schema({'chrom': String, 'start': UInt32, 'end': UInt32, 'id': String, 'ref': String, 'alt': String, 'qual': Float64, 'filter': String, 'END': Int32, 'GT': String, 'GQ': Int64, 'DP': Int64, 'AD': List(Int64), 'VAF': List(Float64), 'PL': List(Int64), 'genotype': List(String)})


In [13]:
# Show first few rows with FORMAT fields
print("First 5 rows from scan_vcf_with_formats (FORMAT fields now included):")
lf_with_formats.head(5).collect()

First 5 rows from scan_vcf_with_formats (FORMAT fields now included):


5rows [00:00, 1674.77rows/s]


chrom,start,end,id,ref,alt,qual,filter,END,GT,GQ,DP,AD,VAF,PL,genotype
str,u32,u32,str,str,str,f64,str,i32,str,i64,i64,list[i64],list[f64],list[i64],list[str]
"""1""",10009,10009,"""""","""A""","""AC""",0.0,"""RefCall""",,"""0/0""",27,10,"[6, 2]",[0.2],"[0, 27, 34]","[""A"", ""A""]"
"""1""",10015,10015,"""""","""A""","""G""",0.0,"""RefCall""",,"""0/0""",35,17,"[12, 4]",[0.235294],"[0, 35, 41]","[""A"", ""A""]"
"""1""",10021,10021,"""""","""A""","""G""",0.0,"""RefCall""",,"""0/0""",39,22,"[16, 5]",[0.227273],"[0, 39, 45]","[""A"", ""A""]"
"""1""",10027,10027,"""""","""A""","""G""",0.0,"""RefCall""",,"""0/0""",41,24,"[15, 6]",[0.25],"[0, 42, 46]","[""A"", ""A""]"
"""1""",10033,10033,"""""","""A""","""G""",0.0,"""RefCall""",,"""0/0""",38,25,"[15, 8]",[0.32],"[0, 39, 42]","[""A"", ""A""]"


In [None]:
# Verify FORMAT fields are now present
columns_with_formats = lf_with_formats.collect_schema().names()

print("Columns present with scan_vcf_with_formats():")
print(columns_with_formats)

print("\n" + "=" * 60)
print("FORMAT fields now PRESENT:")
print("=" * 60)

for field in ["GT", "GQ", "DP", "AD", "VAF", "PL"]:
    if field in columns_with_formats:
        print(f"  - {field} (AVAILABLE)")
    else:
        print(f"  - {field} (MISSING)")

print("\n" + "=" * 60)
print("WORKAROUND SUCCESS:")
print("=" * 60)
print("scan_vcf_with_formats() successfully extracts FORMAT fields")
print("by combining polars-bio (for INFO fields) with Polars CSV reader")
print("(for FORMAT field extraction) via horizontal join on row index.")

In [None]:
# Example: Filter by genotype (something we couldn't do with plain polars-bio)
print("Example: Filter heterozygous variants (GT = '0/1'):")
het_variants = lf_with_formats.filter(pl.col("GT") == "0/1").collect()
print(f"Found {len(het_variants)} heterozygous variants")
het_variants.head()

## 5. Can info_fields be used for FORMAT fields?

**No.** Passing FORMAT field names to `info_fields` causes polars-bio to crash.

In [8]:
# This will crash - demonstrating that info_fields only works for INFO, not FORMAT
try:
    lf_crash = pb.scan_vcf(str(vcf_path), info_fields=['GT', 'DP', 'AD'])
    print(lf_crash.head(1).collect())
except Exception as e:
    print(f'CRASHED: {type(e).__name__}')
    print(f'Error: {e}')
    print()
    print('polars-bio looks for GT/DP/AD in ##INFO= header lines (not ##FORMAT=),')
    print('warns "VCF tag not found in header", then crashes with a Rust panic.')


thread '<unnamed>' panicked at /root/.cargo/git/checkouts/datafusion-bio-formats-f4a7f32bff6627c2/d4e6e27/datafusion/bio-format-vcf/src/table_provider.rs:66:55:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace


: 

## 6. Expected vs Actual Output Comparison

In [None]:
print("Expected columns (what should be available):")
expected = ["chrom", "start", "end", "id", "ref", "alt", "qual", "filter", 
            "GT", "GQ", "DP", "MIN_DP", "AD", "VAF", "PL"]  # FORMAT fields
print(expected)

print("\nActual columns (what polars-bio provides):")
print(columns)

print("\nMissing FORMAT fields:")
missing = [f for f in format_fields if f not in columns]
print(missing)

Expected columns (what should be available):
['chrom', 'start', 'end', 'id', 'ref', 'alt', 'qual', 'filter', 'GT', 'GQ', 'DP', 'MIN_DP', 'AD', 'VAF', 'PL']

Actual columns (what polars-bio provides):
['chrom', 'start', 'end', 'id', 'ref', 'alt', 'qual', 'filter', 'END']

Missing FORMAT fields:
['GT', 'GQ', 'DP', 'MIN_DP', 'AD', 'VAF', 'PL']


## 7. What genotype data looks like (manual parsing example)

In [None]:
# Manual parsing to show what data is actually in the VCF
# This is what polars-bio SHOULD be extracting

rows = []
with open(vcf_path, 'r') as f:
    for line in f:
        if line.startswith('#'):
            continue
        cols = line.strip().split('\t')
        format_keys = cols[8].split(':')  # e.g., GT:GQ:DP:AD:VAF:PL
        sample_values = cols[9].split(':')  # e.g., 0/0:27:10:6,2:0.2:0,27,34
        
        row = {
            'chrom': cols[0],
            'pos': int(cols[1]),
            'ref': cols[3],
            'alt': cols[4],
        }
        # Add FORMAT fields
        for key, val in zip(format_keys, sample_values):
            row[key] = val
        
        rows.append(row)
        if len(rows) >= 10:
            break

# Show what the data should look like with FORMAT fields
df_manual = pl.DataFrame(rows)
print("Manual parsing - showing FORMAT fields that should be in polars-bio output:")
df_manual

Manual parsing - showing FORMAT fields that should be in polars-bio output:


chrom,pos,ref,alt,GT,GQ,DP,AD,VAF,PL
str,i64,str,str,str,str,str,str,str,str
"""1""",10009,"""A""","""AC""","""0/0""","""27""","""10""","""6,2""","""0.2""","""0,27,34"""
"""1""",10015,"""A""","""G""","""0/0""","""35""","""17""","""12,4""","""0.235294""","""0,35,41"""
"""1""",10021,"""A""","""G""","""0/0""","""39""","""22""","""16,5""","""0.227273""","""0,39,45"""
"""1""",10027,"""A""","""G""","""0/0""","""41""","""24""","""15,6""","""0.25""","""0,42,46"""
"""1""",10033,"""A""","""G""","""0/0""","""38""","""25""","""15,8""","""0.32""","""0,39,42"""
"""1""",10039,"""A""","""G""","""0/0""","""38""","""35""","""23,9""","""0.257143""","""0,39,43"""
"""1""",10045,"""A""","""G""","""0/0""","""37""","""40""","""29,9""","""0.225""","""0,38,42"""
"""1""",10051,"""A""","""G""","""0/0""","""39""","""54""","""42,8""","""0.148148""","""0,40,44"""
"""1""",10055,"""TA""","""T""","""0/0""","""34""","""64""","""59,4""","""0.0625""","""0,35,39"""
"""1""",10146,"""AC""","""A""","""0/0""","""20""","""66""","""54,11""","""0.166667""","""0,23,22"""


## Summary for GitHub Issue

### Issue Title
`scan_vcf()` does not extract FORMAT/genotype fields (GT, DP, AD, GQ, etc.)

### Version
polars-bio 0.19.0

### Description
The `scan_vcf()` function only extracts INFO fields from VCF files but completely ignores the FORMAT column which contains critical per-sample genotype data.

### Expected Behavior
FORMAT fields like `GT`, `DP`, `AD`, `GQ`, `VAF`, `PL` should be available as columns in the output LazyFrame, similar to how `info_fields` parameter works for INFO fields.

### Actual Behavior
Only standard VCF columns (chrom, start, end, id, ref, alt, qual, filter) and INFO fields are extracted. FORMAT fields are completely missing.

Attempting to pass FORMAT field names (GT, DP, AD) to the `info_fields` parameter causes a crash:
- Warning: `VCF tag 'GT' not found in header; defaulting to Utf8`
- Rust panic: `called Option::unwrap() on a None value`

### Use Case
FORMAT fields are essential for variant analysis:
- `GT` (genotype) is needed to determine zygosity (homozygous vs heterozygous)
- `DP` (depth) is needed for quality filtering
- `AD` (allelic depth) is needed for allele-specific analysis
- `GQ` (genotype quality) is needed for confidence filtering

### Proposed Solution
Add a `format_fields` parameter similar to `info_fields`:
```python
pb.scan_vcf(path, info_fields=["END"], format_fields=["GT", "DP", "AD"])
```