# VCF to FASTA

A robust Python pipeline for applying VCF variants to FASTA reference sequences with comprehensive validation.

## Overview
This pipeline reads a reference FASTA file and a VCF file, applies all variants to the reference sequence, validates the changes, and outputs a new FASTA file with the mutations applied.

### Examples

Lets start by importing library

In [1]:
import sys

sys.path.append("/workspace")
from src.vcf import VCFtoFASTAPipeline

Now lets define our inputs and outputs

In [2]:
fasta_path = "/workspace/datasets/haploid/fasta/04245.denovo.fasta"
vcf_path = "/workspace/datasets/haploid/vcf/STR40.high_confidence.vcf"
output_dir = "/workspace/datasets/haploid/fasta/STR40.fasta"
log_dir = "/workspace/datasets/haploid/logs/STR40.vcftofasta.pipeline.log"

In [3]:
pipeline = VCFtoFASTAPipeline(
    fasta_path=fasta_path,
    vcf_path=vcf_path,
    output_path=output_dir,
    log_file=log_dir,
)

In [4]:
validation_passed = pipeline.run()

[2026-01-09 01:20:19] VCF to FASTA Pipeline
[2026-01-09 01:20:19] Input FASTA: /workspace/datasets/haploid/fasta/04245.denovo.fasta
[2026-01-09 01:20:19] Input VCF: /workspace/datasets/haploid/vcf/STR40.high_confidence.vcf
[2026-01-09 01:20:19] Output FASTA: /workspace/datasets/haploid/fasta/STR40.fasta
[2026-01-09 01:20:19] Log file: /workspace/datasets/haploid/logs/STR40.vcftofasta.pipeline.log

[2026-01-09 01:20:19] Checking dependencies...
[2026-01-09 01:20:19] ✓ bcftools found
[2026-01-09 01:20:19] ✓ samtools found
[2026-01-09 01:20:19] ✓ bgzip found
[2026-01-09 01:20:19] ✓ tabix found
[2026-01-09 01:20:19] FASTA index already exists: /workspace/datasets/haploid/fasta/04245.denovo.fasta.fai
[2026-01-09 01:20:19] Compressed VCF already exists: /workspace/datasets/haploid/vcf/STR40.high_confidence.vcf.gz
[2026-01-09 01:20:19] VCF index already exists: /workspace/datasets/haploid/vcf/STR40.high_confidence.vcf.gz.tbi
[2026-01-09 01:20:19] 
Number of variants in VCF: 17
[2026-01-09 01:

Great, looks like we successfully asembled the genome using the reference fasta and the vcf variant file.  Lets do another example to load up multiple files.

In [5]:
fasta_path = "/workspace/datasets/haploid/fasta/04245.denovo.fasta"

strains = ["STR134", "STR267", "STR286"]

for strain in strains:
    vcf_path = f"/workspace/datasets/haploid/vcf/{strain}.high_confidence.vcf"
    output_dir = f"/workspace/datasets/haploid/fasta/{strain}.fasta"
    log_dir = f"/workspace/datasets/haploid/logs/{strain}.vcftofasta.pipeline.log"

    pipeline = VCFtoFASTAPipeline(
        fasta_path=fasta_path,
        vcf_path=vcf_path,
        output_path=output_dir,
        log_file=log_dir,
    )
    validation_passed = pipeline.run()
    

[2026-01-09 01:20:20] VCF to FASTA Pipeline
[2026-01-09 01:20:20] Input FASTA: /workspace/datasets/haploid/fasta/04245.denovo.fasta
[2026-01-09 01:20:20] Input VCF: /workspace/datasets/haploid/vcf/STR134.high_confidence.vcf
[2026-01-09 01:20:20] Output FASTA: /workspace/datasets/haploid/fasta/STR134.fasta
[2026-01-09 01:20:20] Log file: /workspace/datasets/haploid/logs/STR134.vcftofasta.pipeline.log

[2026-01-09 01:20:20] Checking dependencies...
[2026-01-09 01:20:20] ✓ bcftools found
[2026-01-09 01:20:20] ✓ samtools found
[2026-01-09 01:20:20] ✓ bgzip found
[2026-01-09 01:20:20] ✓ tabix found
[2026-01-09 01:20:20] FASTA index already exists: /workspace/datasets/haploid/fasta/04245.denovo.fasta.fai
[2026-01-09 01:20:20] Compressing VCF file...
[2026-01-09 01:20:20] Command: bgzip -c /workspace/datasets/haploid/vcf/STR134.high_confidence.vcf > /workspace/datasets/haploid/vcf/STR134.high_confidence.vcf.gz
[2026-01-09 01:20:20] VCF compressed successfully
[2026-01-09 01:20:20] Indexing VC