# Converting .vcf.gz to GRG

Many datasets are stored in `.vcf.gz` format by "default." If these datasets are large, they are usually stored using [BGZIP](https://www.htslib.org/doc/bgzip.html) so that they can be indexed for semi-random access. The two different kinds of index files for `BGZIP` are [tabix](https://www.htslib.org/doc/tabix.html) or [bcftools](https://samtools.github.io/bcftools/bcftools.html).

In this tutorial we'll show the (very simple) process of converting `.vcf.gz` data to `GRG` format, which is much smaller (usually at least `25x` smaller) and faster (many orders of magnitude) for performing computations.

**What you'll need:**
* Python dependencies "pygrgl": `pip install pygrgl`
* Command line tools "wget" and "tabix": `sudo apt install wget tabix` (or your distribution's equivalent)

### Get Dataset

For our example, we'll just download a very small simulated dataset that is stored as `.vcf.gz`.

In [1]:
%%bash

# Download a small example dataset
wget https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz -O vcf_convert.example.vcf.gz

--2026-02-06 12:31:22--  https://github.com/aprilweilab/grg_pheno_sim/raw/refs/heads/main/demos/data/test-200-samples.vcf.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz [following]
--2026-02-06 12:31:23--  https://raw.githubusercontent.com/aprilweilab/grg_pheno_sim/refs/heads/main/demos/data/test-200-samples.vcf.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 494022 (482K) [application/octet-stream]
Saving to: ‘vcf_convert.example.vcf.gz’

     0K .......... .......... .......... .......... .......... 10% 1.87M 0

## Convert to GRG

Lets first attempt to convert to GRG without having an index for the file.

In [2]:
%%bash

# -j controls how many threads to use.
grg construct -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg || true

Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
terminate called after throwing an instance of 'grgl::ApiMisuseFailure'
  0%|          | 0/100 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 300, in star_build_grg
    return build_grg(*args)
  File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 266, in build_grg
    shape_grg = build_shape(range_triple, args, auto_args, input_file, output_file)
  File "/home/ddehaas/GrgProject/public/grgl/pygrgl/clicmd/construct.py", line 253, in build_sh

The `grg` tool did not like that. Why? `WARNING: Conversion from VCF without a tabix index is very slow, and not recommended.`

For large datasets, trying to convert a `.vcf.gz` to GRG without an index will be very slow. This warning (and failure) is there to prevent you from accidentally doing this. However, we know that our dataset is really small so we don't care -- we can use the `--force` flag to force GRG construction.

In [3]:
%%bash

# -j controls how many threads to use.
grg construct --force -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg

Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 100/100 [00:06<00:00, 15.57it/s]
Merging...


=== GRG Statistics ===
Nodes: 12945
Edges: 117253
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 100000001
Wrote simplified GRG with:
  Nodes: 12945
  Edges: 117253
Wrote GRG to vcf_convert.example.grg


That gave us a valid GRG, by essentially "brute force" accessing the VCF file without an index. This works fine on a small file, but not a large one. So instead, lets try indexing this file with `tabix` and trying against. **NOTE**: `grg` only supports `tabix`-style indexes, and not `bcftools`-style indexes on `.vcf.gz` files.

In [4]:
%%bash
tabix vcf_convert.example.vcf.gz

Now we can construct a GRG without having to use `--force`, and we don't get any warnings.

In [5]:
%%bash

# -j controls how many threads to use.
grg construct -j 1 vcf_convert.example.vcf.gz -o vcf_convert.example.grg

Will not count variants in VCF files (too slow)
Could not count number of variants in vcf_convert.example.vcf.gz. Using the default of 100 (use --parts to override).
Processing input file in 100 parts.
Auto-calculating number of trees per part.
Converting segments of input data to graphs
100%|██████████| 100/100 [00:03<00:00, 30.38it/s]
Merging...


=== GRG Statistics ===
Nodes: 12945
Edges: 117253
Samples: 400
Mutations: 10893
Ploidy: 2
Phased: true
Populations: 0
Range of mutations: 55829 - 9999127
Specified range: 0 - 100000001
Wrote simplified GRG with:
  Nodes: 12945
  Edges: 117253
Wrote GRG to vcf_convert.example.grg


## Related Topics

* Often, it is better to convert a `vcf.gz` to [IGD](https://picovcf.readthedocs.io/en/latest/igd_overview.html) first and then convert to GRG. IGD files can be substantially faster to access than `.vcf.gz` files. See [Converting IGD to GRG](IGDToGRG.html).