GitHub - arkhammknight/BCFTools

# BCFtools Training Tutorial

This repository contains a step-by-step guide to using **BCFtools** for manipulating and analyzing VCF (Variant Call Format) files. The tutorial covers reading VCF files, extracting chromosome names, renaming chromosomes, counting SNPs and indels, counting variants per chromosome, and splitting VCF files into SNPs and indels.

## Prerequisites

- **BCFtools**: Ensure BCFtools is installed (`sudo apt install bcftools` on Ubuntu/Debian or equivalent).
- **gzip**: For handling compressed VCF files.
- **wget**: For downloading files.
- **bash**: For scripting and command-line operations.
- A Unix-like environment (Linux, macOS, or WSL on Windows).

## Directory Setup

Create a working directory for the tutorial:

```bash
mkdir bcftoolstraining
cd bcftoolstraining

Step-by-Step Instructions

1. Downloading a Sample VCF File

Download a sample VCF file containing SNPs and indels:

wget https://github.com/vappiah/vcf-file-manipulation/raw/refs/heads/main/data/all.snps_indels.vcf.gz

List the downloaded file:

ls
# Output: all.snps_indels.vcf.gz

2. Indexing the VCF File

Index the VCF file for faster querying using both .csi and .tbi formats:

bcftools index all.snps_indels.vcf.gz
bcftools index -t all.snps_indels.vcf.gz

Check the created index files:

ls
# Output: all.snps_indels.vcf.gz  all.snps_indels.vcf.gz.csi  all.snps_indels.vcf.gz.tbi

3. Querying Sample Names

List the sample names in the VCF file:

bcftools query -l all.snps_indels.vcf.gz
# Output: NA21112, HG00096, HG00101, ..., NA21127

You can also query sample names from a remote VCF file (e.g., from the 1000 Genomes Project):

link=https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
bcftools query -l $link
# Output: HG00096, HG00097, HG00099, ..., NA21144

4. Extracting Chromosome Names

Extract all chromosome names from the VCF file:

bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz > allchr.txt

Get unique chromosome names:

bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | uniq > chr.txt

Count unique chromosomes:

cat chr.txt | wc -l
# Output: 24

Count total chromosome entries (including duplicates):

cat allchr.txt | wc -l
# Output: 922022

View the first and last few chromosomes:

head -n 10 chr.txt
# Output: 10, 11, 12, 13, 14, 15, 16, 17, 18, 19
tail -n 10 chr.txt
# Output: 2, 3, 4, 5, 6, 7, 8, 9, X, Y

Count variants for specific chromosomes (e.g., X and Y):

bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | grep 'X' | wc -l
# Output: 17251
bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | grep 'Y' | wc -l
# Output: 8

5. Renaming Chromosomes

Create a file mapping old chromosome names to new names:

bcftools query -f '%CHROM\n' all.snps_indels.vcf.gz | uniq > chromosomes.txt
head -n 3 chromosomes.txt
# Output: 10, 11, 12

Create a names.txt file with old and new chromosome names:

cat > names.txt << EOF
10 CHR10
11 CHR11
12 CHR12
EOF

Rename chromosomes in the VCF file:

bcftools annotate --rename-chrs names.txt all.snps_indels.vcf.gz -Oz -o renamed.vcf.gz

Verify the renamed chromosomes:

bcftools query -f '%CHROM\n' renamed.vcf.gz | uniq > newnames.txt
head -n 3 newnames.txt
# Output: CHR10, CHR11, CHR12

6. Counting SNPs and Indels

Count the number of SNPs:

bcftools view -v snps all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 884455

Count the number of indels:

bcftools view -v indels all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 37417

7. Counting Variants per Chromosome

Count variants for specific chromosomes:

bcftools view -r 10 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 23620
bcftools view -r 11 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 17833
bcftools view -r 12 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 16504

Count variants for multiple chromosomes:

bcftools view -r 10,11,12 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 57957

Exclude specific chromosomes:

bcftools view -t ^10 all.snps_indels.vcf.gz | grep -v -c '^#'
# Output: 898402

8. Automating Variant Counting per Chromosome

Create a script to count variants for all chromosomes:

touch variant_count.sh
chmod +x variant_count.sh

Edit variant_count.sh:

#!/bin/bash
chromlist=($(cat chromosomes.txt))
for chrom in ${chromlist[@]}
do
    count=$(bcftools view -r $chrom all.snps_indels.vcf.gz | grep -v -c '^#')
    echo "$chrom:$count"
done

Run the script:

./variant_count.sh
# Output:
# 10:23620
# 11:17833
# 12:16504
# ...
# X:17251
# Y:8

9. Splitting VCF into SNPs and Indels

Note: The original commands in the input had a typo (both outputs were named snps.vcf.gz and indels.vcf.gz for indels). Corrected commands are provided.

Split into SNPs:

bcftools view -v snps all.snps_indels.vcf.gz -Oz -o snps.vcf.gz

Split into indels:

bcftools view -v indels all.snps_indels.vcf.gz -Oz -o indels.vcf.gz

Verify the output files:

ls
# Output: snps.vcf.gz, indels.vcf.gz, ...

Files Generated

all.snps_indels.vcf.gz: Original VCF file.
all.snps_indels.vcf.gz.csi, all.snps_indels.vcf.gz.tbi: Index files.
allchr.txt: All chromosome entries.
chr.txt: Unique chromosome names.
chromosomes.txt: Unique chromosome names for scripting.
names.txt: Mapping of old to new chromosome names.
renamed.vcf.gz: VCF with renamed chromosomes.
newnames.txt: Unique chromosome names from renamed VCF.
snps.vcf.gz: VCF containing only SNPs.
indels.vcf.gz: VCF containing only indels.
variant_count.sh: Script to count variants per chromosome.

Notes

Ensure the VCF file is indexed before running queries for efficiency.
The uniq command is used to remove duplicate chromosome names.
The grep -v -c '^#' command counts non-header lines in VCF output, representing variants.
The remote VCF file from the 1000 Genomes Project can be used for larger-scale analysis but may require downloading or streaming.
Correct file naming is critical when splitting VCF files to avoid overwriting.

Resources

BCFtools Documentation
1000 Genomes Project
Sample VCF file source: vappiah/vcf-file-manipulation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Step-by-Step Instructions

1. Downloading a Sample VCF File

2. Indexing the VCF File

3. Querying Sample Names

4. Extracting Chromosome Names

5. Renaming Chromosomes

6. Counting SNPs and Indels

7. Counting Variants per Chromosome

8. Automating Variant Counting per Chromosome

9. Splitting VCF into SNPs and Indels

Files Generated

Notes

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.markdown		README.markdown
all.snps_indels.vcf.gz		all.snps_indels.vcf.gz
all.snps_indels.vcf.gz.csi		all.snps_indels.vcf.gz.csi
all.snps_indels.vcf.gz.tbi		all.snps_indels.vcf.gz.tbi
allchr.txt		allchr.txt
chr.txt		chr.txt
chromosomes.txt		chromosomes.txt
indels.vcf.gz		indels.vcf.gz
names.txt		names.txt
newnames.txt		newnames.txt
renamed.vcf.gz		renamed.vcf.gz
snps.vcf.gz		snps.vcf.gz
varian_count.sh		varian_count.sh

Folders and files

Latest commit

History

Repository files navigation

Step-by-Step Instructions

1. Downloading a Sample VCF File

2. Indexing the VCF File

3. Querying Sample Names

4. Extracting Chromosome Names

5. Renaming Chromosomes

6. Counting SNPs and Indels

7. Counting Variants per Chromosome

8. Automating Variant Counting per Chromosome

9. Splitting VCF into SNPs and Indels

Files Generated

Notes

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages