# BCFtools Training Tutorial
This repository contains a step-by-step guide to using **BCFtools** for manipulating and analyzing VCF (Variant Call Format) files. The tutorial covers reading VCF files, extracting chromosome names, renaming chromosomes, counting SNPs and indels, counting variants per chromosome, and splitting VCF files into SNPs and indels.
## Prerequisites
- **BCFtools**: Ensure BCFtools is installed (`sudo apt install bcftools` on Ubuntu/Debian or equivalent).
- **gzip**: For handling compressed VCF files.
- **wget**: For downloading files.
- **bash**: For scripting and command-line operations.
- A Unix-like environment (Linux, macOS, or WSL on Windows).
## Directory Setup
Create a working directory for the tutorial:
```bash
mkdir bcftoolstraining
cd bcftoolstrainingDownload a sample VCF file containing SNPs and indels:
wget https://github.com/vappiah/vcf-file-manipulation/raw/refs/heads/main/data/all.snps_indels.vcf.gzList the downloaded file:
ls
# Output: all.snps_indels.vcf.gzIndex the VCF file for faster querying using both .csi and .tbi formats:
bcftools index all.snps_indels.vcf.gz
bcftools index -t all.snps_indels.vcf.gzCheck the created index files:
ls
# Output: all.snps_indels.vcf.gz all.snps_indels.vcf.gz.csi all.snps_indels.vcf.gz.tbiList the sample names in the VCF file:
bcftools query -l all.snps_indels.vcf.gz
# Output: NA21112, HG00096, HG00101, ..., NA21127You can also query sample names from a remote VCF file (e.g., from the 1000 Genomes Project):
link=https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
bcftools query -l $link
# Output: HG00096, HG00097, HG00099, ..., NA21144Extract all chromosome names from the VCF file:
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz > allchr.txtGet unique chromosome names:
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | uniq > chr.txtCount unique chromosomes:
cat chr.txt | wc -l
# Output: 24Count total chromosome entries (including duplicates):
cat allchr.txt | wc -l
# Output: 922022View the first and last few chromosomes:
head -n 10 chr.txt
# Output: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
tail -n 10 chr.txt
# Output: 2, 3, 4, 5, 6, 7, 8, 9, X, YCount variants for specific chromosomes (e.g., X and Y):
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | grep 'X' | wc -l
# Output: 17251
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | grep 'Y' | wc -l
# Output: 8Create a file mapping old chromosome names to new names:
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | uniq > chromosomes.txt
head -n 3 chromosomes.txt
# Output: 10, 11, 12Create a names.txt file with old and new chromosome names:
cat > names.txt << EOF
10 CHR10
11 CHR11
12 CHR12
EOFRename chromosomes in the VCF file:
bcftools annotate --rename-chrs names.txt all.snps_indels.vcf.gz -Oz -o renamed.vcf.gzVerify the renamed chromosomes:
bcftools query -f '%CHROM\n' renamed.vcf.gz | uniq > newnames.txt
head -n 3 newnames.txt
# Output: CHR10, CHR11, CHR12Count the number of SNPs:
bcftools view -v snps all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 884455Count the number of indels:
bcftools view -v indels all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 37417Count variants for specific chromosomes:
bcftools view -r 10 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 23620
bcftools view -r 11 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 17833
bcftools view -r 12 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 16504Count variants for multiple chromosomes:
bcftools view -r 10,11,12 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 57957Exclude specific chromosomes:
bcftools view -t ^10 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 898402Create a script to count variants for all chromosomes:
touch variant_count.sh
chmod +x variant_count.shEdit variant_count.sh:
#!/bin/bash
chromlist=($(cat chromosomes.txt))
for chrom in ${chromlist[@]}
do
count=$(bcftools view -r $chrom all.snps_indels.vcf.gz | grep -v -c '^#')
echo "$chrom:$count"
doneRun the script:
./variant_count.sh
# Output:
# 10:23620
# 11:17833
# 12:16504
# ...
# X:17251
# Y:8Note: The original commands in the input had a typo (both outputs were named snps.vcf.gz and indels.vcf.gz for indels). Corrected commands are provided.
Split into SNPs:
bcftools view -v snps all.snps_indels.vcf.gz -Oz -o snps.vcf.gzSplit into indels:
bcftools view -v indels all.snps_indels.vcf.gz -Oz -o indels.vcf.gzVerify the output files:
ls
# Output: snps.vcf.gz, indels.vcf.gz, ...all.snps_indels.vcf.gz: Original VCF file.all.snps_indels.vcf.gz.csi,all.snps_indels.vcf.gz.tbi: Index files.allchr.txt: All chromosome entries.chr.txt: Unique chromosome names.chromosomes.txt: Unique chromosome names for scripting.names.txt: Mapping of old to new chromosome names.renamed.vcf.gz: VCF with renamed chromosomes.newnames.txt: Unique chromosome names from renamed VCF.snps.vcf.gz: VCF containing only SNPs.indels.vcf.gz: VCF containing only indels.variant_count.sh: Script to count variants per chromosome.
- Ensure the VCF file is indexed before running queries for efficiency.
- The
uniqcommand is used to remove duplicate chromosome names. - The
grep -v -c '^#'command counts non-header lines in VCF output, representing variants. - The remote VCF file from the 1000 Genomes Project can be used for larger-scale analysis but may require downloading or streaming.
- Correct file naming is critical when splitting VCF files to avoid overwriting.
- BCFtools Documentation
- 1000 Genomes Project
- Sample VCF file source: vappiah/vcf-file-manipulation