# Exploring VCF Storage Solutions
# Notebook 2. List of common bcftools operations
2025-02-12 Daniel P. Brink

This notebook is Work-in-progress. It will be updated as new commands 

# 1. Introduction

The purpose of this notebook is to make an inventory of some common VCF queries and operations that the `bcftools` CLI is able to do. As such, it might serve more as a reference point than an exploratory notebook. Readers that are already familiar with `bcftools` might want to skip forward to the next notebooks.

The examples will be based on the [official documentation for `bcftools`](https://samtools.github.io/bcftools/bcftools.html). See also
examples on [queries](https://samtools.github.io/bcftools/howtos/query.html) and [filtering](https://samtools.github.io/bcftools/howtos/filtering.html) operations.


# 2. Setup

Since we will mainly be using `bcftools` in this notebook, we will use the same conda environment used in notebook 1. It was installed with the following commands:

```bash
CONDA_SUBDIR=osx-64 conda create -n explore_vcf_storage_solutions
conda activate explore_vcf_storage_solutions
conda install mamba -y
mamba install jupyter -y
mamba install bcftools -y 
pip install humanfriendly
```


In [2]:
import sys
import os
import requests
import humanfriendly

#Check that Conda and the libraries are installed as expected:
print(f"Current Conda environment: {os.environ['CONDA_DEFAULT_ENV']}")
print(f"Current Python version: {sys.version}")
!bcftools --version

Current Conda environment: explore_vcf_storage_solutions
Current Python version: 3.13.1 | packaged by conda-forge | (main, Jan 13 2025, 09:48:16) [Clang 18.1.8 ]
bcftools 1.21
Using htslib 1.21
Copyright (C) 2024 Genome Research Ltd.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.


# 3. Commonly used bcftools operations

`bcftools` is a toolbox that can perform a wide range of operations from variant calling, annotation, and format conversion to queries, filtering, and subsetting. For the intents and purposes of the subsequent notebooks, we will mainly be using the following commands:

- `bcftools view`
- `bcftools merge`
- `bcftools query`


In [1]:
downsampled_vcf_gz = "./input_data_temp/1kG_p3_chr1_first_200_samples_c1.vcf.gz"
downsampled_bcf="./input_data_temp/1kG_p3_chr1_first_200_samples_c1.bcf"

## 3.1. Commonly used flags

- --no-header
- --header-only 
- --no-version
- -o
- -Ob
- -OZ 

## 3.2. Subset VCF on sample names
Subsetting a VCF based on sample name was already demonstrated in Notebook 1 when subsetting on the first 200 samples. But for quick reference, here is a command to subset on two samples named `HG00096` and `HG00097`. The `-s` flag takes a list of sample names as input.

(In order not to print a million lines to the output, skip the header and print only the first ten variant. `2>/dev/null` is to avoid a _broken pipe_ error in Jupyter).

In [3]:
%%time
!bcftools view -s HG00096,HG00097 --no-header {downsampled_bcf} 2>/dev/null  | head -n 10

1	10177	rs367896724	A	AC	100	PASS	AC=2;AF=0.425319;AN=4;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL	GT	1|0	0|1
1	10352	rs555500075	T	TA	100	PASS	AC=2;AF=0.4375;AN=4;NS=2504;DP=88915;EAS_AF=0.4306;AMR_AF=0.4107;AFR_AF=0.4788;EUR_AF=0.4264;SAS_AF=0.4192;AA=|||unknown(NO_COVERAGE);VT=INDEL	GT	1|0	1|0
1	10616	rs376342519	CCGCCGTTGCAAAGGCGCGCCG	C	100	PASS	AC=4;AF=0.993011;AN=4;NS=2504;DP=2365;EAS_AF=0.9911;AMR_AF=0.9957;AFR_AF=0.9894;EUR_AF=0.994;SAS_AF=0.9969;VT=INDEL	GT	1|1	1|1
1	11008	rs575272151	C	G	100	PASS	AC=0;AF=0.0880591;AN=4;NS=2504;DP=2232;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;VT=SNP	GT	0|0	0|0
1	11012	rs544419019	C	G	100	PASS	AC=0;AF=0.0880591;AN=4;NS=2504;DP=2090;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;VT=SNP	GT	0|0	0|0
1	13110	rs540538026	G	A	100	PASS	AC=1;AF=0.0267572;AN=4;NS=2504;DP=23422;EAS_AF=0.002;AMR_AF=0.036;AFR_A

## 3.3. Extract specific columns from the VCF

here: the CHROM, POST, and INFO columns

In [8]:
!bcftools query -f '%CHROM\t%POS\t%INFO\n' {downsampled_bcf} 2>/dev/null  | head -n 10

1	10177	AC=116;AF=0.425319;AN=400;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL
1	10352	AC=158;AF=0.4375;AN=400;NS=2504;DP=88915;EAS_AF=0.4306;AMR_AF=0.4107;AFR_AF=0.4788;EUR_AF=0.4264;SAS_AF=0.4192;AA=|||unknown(NO_COVERAGE);VT=INDEL
1	10616	AC=397;AF=0.993011;AN=400;NS=2504;DP=2365;EAS_AF=0.9911;AMR_AF=0.9957;AFR_AF=0.9894;EUR_AF=0.994;SAS_AF=0.9969;VT=INDEL
1	11008	AC=34;AF=0.0880591;AN=400;NS=2504;DP=2232;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;VT=SNP
1	11012	AC=34;AF=0.0880591;AN=400;NS=2504;DP=2090;EAS_AF=0.0367;AMR_AF=0.0965;AFR_AF=0.1346;EUR_AF=0.0885;SAS_AF=0.0716;AA=.|||;VT=SNP
1	13110	AC=20;AF=0.0267572;AN=400;NS=2504;DP=23422;EAS_AF=0.002;AMR_AF=0.036;AFR_AF=0.0053;EUR_AF=0.0567;SAS_AF=0.044;AA=g|||;VT=SNP
1	13116	AC=70;AF=0.0970447;AN=400;NS=2504;DP=22340;EAS_AF=0.0248;AMR_AF=0.121;AFR_AF=0.0295;EUR_AF=0.1869;SAS_AF=0.1534;AA=t|||;VT=SNP
1	13118	AC=70;AF=0.09

## 3.4. Subset on a specific variant type
Here: INDELs. 

There are two ways: either based on the annotation in the INFO column:

In [None]:
!bcftools view --no-header -i 'INFO/VT="INDEL"' {downsampled_bcf} | wc -l

or by calculate it based on the values of the REF and ALT columns:

In [None]:
!bcftools view --no-header -v indels {downsampled_bcf} | wc -l

Note that these two methods might give different results!

## 3.5. Subset on a genomic range

This is done with the flag `-r`. We need to set the region (i.e. scaffold name) and the coordinate of the desired range. In the example data, chr1 is named 1. To get the first 15000 bp from chr1, we can thus do: 

In [15]:
!bcftools view --no-header -r 1:1-15000 {downsampled_bcf} 

1	10177	rs367896724	A	AC	100	PASS	AC=116;AF=0.425319;AN=400;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL	GT	1|0	0|1	0|1	1|0	0|0	1|0	1|0	1|0	1|0	0|0	0|0	0|0	0|0	0|0	0|0	0|0	0|1	1|0	0|0	0|0	1|0	0|0	0|0	0|0	0|1	1|0	0|1	0|1	0|1	0|1	1|0	0|0	1|0	1|0	0|0	0|1	0|0	0|0	1|0	0|1	1|0	0|0	1|0	1|0	0|0	1|0	0|1	0|1	0|0	0|0	1|0	1|0	0|0	0|0	0|1	0|0	0|0	1|0	1|1	1|0	0|1	0|0	0|0	1|1	0|1	0|0	0|1	0|1	0|0	1|0	1|0	1|0	0|1	0|0	1|0	1|0	1|0	0|0	1|0	0|0	0|1	0|1	1|0	0|1	1|1	0|0	0|1	0|0	1|0	0|0	0|0	1|0	0|0	0|0	0|0	1|0	1|0	0|0	0|1	0|0	1|0	0|0	1|0	0|1	1|0	0|1	0|1	0|1	1|0	1|0	0|0	0|0	0|0	0|0	0|0	1|0	0|1	0|0	0|0	0|0	0|1	1|0	1|0	1|0	1|0	1|0	0|0	0|1	0|1	0|0	0|0	0|0	0|0	1|0	0|1	0|0	0|0	0|0	0|1	0|1	1|0	0|0	0|0	0|0	1|0	0|0	1|0	0|0	0|1	0|1	0|0	0|0	0|1	0|0	1|0	0|0	0|1	1|0	0|1	0|0	1|0	1|0	0|0	0|1	1|1	0|0	1|1	0|1	0|0	1|0	1|0	0|1	0|0	0|1	0|0	0|0	0|0	0|0	0|0	0|0	0|1	0|0	0|0	0|0	0|1	0|1	1|0	0|1	0|0	0|0	0|1	1|0	0|0	0|0	1|0	0|0	1|0	0|0	0|0	0|0
=158;AF=0.437

More examples to be added in future versions of this notebook.
