Parse a VCF Header to JSON

Motivation

My primary motivation to write this was to simplify uploading VCF files to BigQuery. The GCP variant upload tool has features I don't like (such as switching to 0-based and renaming chromosome to "reference"), and is in fact not working at this moment at all (3/14/2022).

However, as an intermediate step I decided it could be useful to also be able to output a "raw" version in which the entire header is parsed to JSON with no reformatting or assumptions for BigQuery. (See "to do")

Base VCF Fields

Schema for the base VCF fields, CHROM, POS, ID, REF, ALT, and QUAL are not specified in the VCF header. Therefore, they are not parsed by this tool. I wrote a basic schema for these fields that can be included in the JSON output using the --include_base or -b flag.

Usage:

./vcf_metadata_to_json \
    --in_vcf small.vcf \
    --out_schema small.json \
    --schema_type bq \
    --metadata_sections INFO FORMAT *FILTER \
    --include_base

Can output to stdout and take input from stdin.

Assuming vcf_metadata_to_json is in your path:

cat tiny.vcf | vcf_metadata_to_json

BigQuery Schema

The correct format for BigQuery schema is a flat list of JSON object literals in the following format:

{
    "name": "Name of the variable",
    "type": "Type of the variable in all-caps. Chiefly: [INTEGER|FLOAT|STRING|BOOL|DATE|STRUCT]",
    "description": "A description of the variable",
    "mode": "[NULLABLE|REQUIRED|REPEATED]"
}

See here for a detailed description of the schema particularly type and mode

To Do:

Facilitate other cloud formats (e.g. Azure, AWS) and SQL engines.
Options for "flattening" the schema because the schema to upload to BigQuery needs to be a simple list of key-value pairs. (see above.
R utility to convert the schema to bq_field format in the bigrquery library
Option for exporting schema when uploading INFO, FORMAT, and FILTER sections as JSON object literals rather than an expanded set of columns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

small.json

small.json

small.vcf

small.vcf

tinyvcf.vcf

tinyvcf.vcf

vcf_metadata_to_json

vcf_metadata_to_json

Repository files navigation

Parse a VCF Header to JSON

Motivation

Base VCF Fields

Usage:

BigQuery Schema

To Do:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
small.json		small.json
small.vcf		small.vcf
tinyvcf.vcf		tinyvcf.vcf
vcf_metadata_to_json		vcf_metadata_to_json

abalter/vcf_header_parser

Folders and files

Latest commit

History

Repository files navigation

Parse a VCF Header to JSON

Motivation

Base VCF Fields

Usage:

BigQuery Schema

To Do:

About

Resources

Stars

Watchers

Forks

Languages