# pVCF to PLINK 2.0

> This notebook shows how to interact with genomic data in bed/bim/bam format using PLINK 2.0. We will learn how to convert between PLINK 1.x and PLINK 2.x file formats, merge variants from different chromosomes into a single file and filter them based on variant completeness and minor allelic frequencies (MAF). Please note the extended runtime of this notebook and that no subsequent analyses are contingent on its outputted files.

- runtime: 4hrs
- recommended instance: mem1_ssd1_v2_x16
- estimated cost: <£1.50

This notebook depends on:
* **PLINK install**


## List the exome sequences data directories in your project

Please note, that depending on your project's MTA the list of files might differ.

In [1]:
ls /mnt/project/Bulk/'Exome sequences'/

'Exome OQFE CRAM files'
'Exome OQFE variant call files (VCFs)'
'Population level exome OQFE variants, BGEN format - final release'
'Population level exome OQFE variants, BGEN format - interim 450k release'
'Population level exome OQFE variants, PLINK format - final release'
'Population level exome OQFE variants, PLINK format - interim 450k release'
'Population level exome OQFE variants, pVCF format - final release'
'Population level exome OQFE variants, pVCF format - interim 450k release'


## List the population variant files in PLINK 1.x (bed/bim/fam) format

In [1]:
ls -lah /mnt/project/Bulk/'Exome sequences'/'Population level exome OQFE variants, pVCF format - final release'/*c1_b1_*gz

-r--r--r-- 1 root root 26G Oct 14 14:12 '/mnt/project/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - interim 450k release/ukb23148_c1_b1_v1.vcf.gz'


### Install and test the PLINK2 binary
#### We recommend installing plink using the links available here:
https://www.cog-genomics.org/plink/2.0/

#### You can download the binary (AVX2 Intel; for example, using `wget <URL>`), before unzipping (`unzip <zip file>`) then making it exectutable (`chmod a+x <name>`)

#### if preferred, Plink is also available in the following locations:
https://anaconda.org/bioconda/plink2; https://github.com/chrchang/plink-ng

#### Once installed, continue with the below code chunks.


In [1]:
#test plink works
./plink2 --version

PLINK v2.00a6LM AVX2 Intel (3 Oct 2023)


### Next install and test BCFTOOLS
#### Following instructions here: http://samtools.github.io/bcftools/howtos/install.html, enter the following code (NB a large amount of text output will follow):

In [None]:
git clone --recurse-submodules https://github.com/samtools/htslib.git
git clone https://github.com/samtools/bcftools.git
cd bcftools
autoheader && autoconf && ./configure --disable-libgsl --enable-perl-filters
make
cd ..
export BCFTOOLS_PLUGINS=$(pwd)/bcftools/plugins
p=$(pwd)/bcftools
PATH=$PATH:$p #set the path to the utility
bcftools --version

## Get reference genome

In [None]:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai

In [16]:
#upload reference genome
dx mkdir ref_gen
dx upload GRCh38* --path ref_gen/

ID                    file-GKJPkb8J8Y94Q60z5zZ68FZX
Class                 file
Project               project-GJ4fY70J8Y9170Yj4GJP2778
Folder                /ref_gen
Name                  GRCh38_full_analysis_set_plus_decoy_hla.fa
State                 [33mclosing[0m
Visibility            visible
Types                 -
Properties            -
Tags                  -
Outgoing links        -
Created               Thu Dec 15 10:54:13 2022
Created by            evoclive
 via the job          job-GKJJq4jJ8Y98y2Jx6B6JFgxY
Last modified         Thu Dec 15 10:54:23 2022
Media type            
archivalState         "live"
cloudAccount          "cloudaccount-dnanexus"
ID                    file-GKJPkj0J8Y9JVjF964FgFvpv
Class                 file
Project               project-GJ4fY70J8Y9170Yj4GJP2778
Folder                /ref_gen
Name                  GRCh38_full_analysis_set_plus_decoy_hla.fa.fai
State                 [33mclosing[0m
Visibility            visible
Types                 -
Prop

In [2]:
REF=`ls *fa`
echo $REF

GRCh38_full_analysis_set_plus_decoy_hla.fa


## Find pVCF path(s)

In [13]:
dx find data --brief --name ukb23157_c1_b1_v1.vcf.gz | xargs dx download



In [6]:
VCF=`ls *vcf.gz`
echo $VCF

ukb23157_c1_b1_v1.vcf.gz


## Run bcftools normalization
This procedure left-aligns and normalizes indels, checks if REF alleles match the reference and split multiallelic sites into multiple rows. More info here: https://samtools.github.io/bcftools/bcftools.html#norm

In [None]:
time bcftools norm -f $REF -m -any -Oz -o ${VCF%.*.*}.norm.vcf.gz $VCF #takes three hours

In [16]:
VCF=`ls *norm.vcf.gz`
echo $VCF

ukb23157_c1_b1_v1.norm.vcf.gz


## Make a Plink bed file

In [17]:
./plink2 \
    --vcf $VCF \
    --vcf-idspace-to _ \
    --double-id \
    --allow-extra-chr 0 \
    --make-bed \
    --vcf-half-call m \
    --out "${VCF/.vcf.gz/""}"

PLINK v2.00a6LM AVX2 Intel (27 Sep 2023)       www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukb23157_c1_b1_v1.norm.log.
Options in effect:
  --allow-extra-chr 0
  --double-id
  --make-bed
  --out ukb23157_c1_b1_v1.norm
  --vcf ukb23157_c1_b1_v1.norm.vcf.gz
  --vcf-half-call m
  --vcf-idspace-to _

Start time: Fri Sep 29 13:50:38 2023
140744 MiB RAM detected, ~133178 available; reserving 70372 MiB for main
workspace.
Using up to 72 threads (change this with --threads).
--vcf: 27598 variants scanned.
--vcf: ukb23157_c1_b1_v1.norm-temporary.pgen +
ukb23157_c1_b1_v1.norm-temporary.pvar.zst +
ukb23157_c1_b1_v1.norm-temporary.psam written.
469835 samples (0 females, 0 males, 469835 ambiguous; 469835 founders) loaded
from ukb23157_c1_b1_v1.norm-temporary.psam.
27598 variants loaded from ukb23157_c1_b1_v1.norm-temporary.pvar.zst.
Note: No phenotype data present.
Writing ukb23157_c1_b1_v1.norm.fam ... done.
Writing ukb

## Convert the pVCF to PLINK 2.x formated dataset (pgen/pvar/psam)
PLINK 2.x formated files are faster to work with and have significntly smaller size than PLINK 1.x formated files.
However, PLINK 1.x is more popular format with wider support.

In [18]:
time ./plink2 \
  --no-pheno \
  --vcf "$VCF" \
  --vcf-half-call 'haploid' \
  --make-pgen \
  --out "${VCF/.vcf.gz/""}"

PLINK v2.00a6LM AVX2 Intel (27 Sep 2023)       www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ukb23157_c1_b1_v1.norm.log.
Options in effect:
  --make-pgen
  --no-psam-pheno
  --out ukb23157_c1_b1_v1.norm
  --vcf ukb23157_c1_b1_v1.norm.vcf.gz
  --vcf-half-call haploid

Start time: Fri Sep 29 13:59:12 2023
140744 MiB RAM detected, ~133202 available; reserving 70372 MiB for main
workspace.
Using up to 72 threads (change this with --threads).
--vcf: 27598 variants scanned.
--vcf: ukb23157_c1_b1_v1.norm-temporary.pgen +
ukb23157_c1_b1_v1.norm-temporary.pvar.zst +
ukb23157_c1_b1_v1.norm-temporary.psam written.
469835 samples (0 females, 0 males, 469835 ambiguous; 469835 founders) loaded
from ukb23157_c1_b1_v1.norm-temporary.psam.
27598 variants loaded from ukb23157_c1_b1_v1.norm-temporary.pvar.zst.
Note: No phenotype data present.
Writing ukb23157_c1_b1_v1.norm.psam ... done.
Writing ukb23157_c1_b1_v1.norm.pvar ... 10

## Convert to BED/BIM/FAM (PLINK 1.x format)

`--max-alleles` - excludes variants with more than the indicated value. When a variant has exactly one ALT allele and it's a missing-code, these filters treat it as having only one allele.
> see here: https://groups.google.com/g/plink2-users/c/rxMlVLIX-JA?pli=1 and https://github.com/meyer-lab-cshl/plinkQC/issues/10

In [19]:
./plink2 \
  --no-pheno \
  --vcf "$VCF" \
  --vcf-half-call 'haploid' \
  --max-alleles 2 \
  --make-bed \
  --out test_vcf_bed

PLINK v2.00a6LM AVX2 Intel (27 Sep 2023)       www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test_vcf_bed.log.
Options in effect:
  --make-bed
  --max-alleles 2
  --no-psam-pheno
  --out test_vcf_bed
  --vcf ukb23157_c1_b1_v1.norm.vcf.gz
  --vcf-half-call haploid

Start time: Fri Sep 29 14:10:03 2023
140744 MiB RAM detected, ~133216 available; reserving 70372 MiB for main
workspace.
Using up to 72 threads (change this with --threads).
--vcf: 27598 variants scanned.
--vcf: test_vcf_bed-temporary.pgen + test_vcf_bed-temporary.pvar.zst +
test_vcf_bed-temporary.psam written.
469835 samples (0 females, 0 males, 469835 ambiguous; 469835 founders) loaded
from test_vcf_bed-temporary.psam.
27598 variants loaded from test_vcf_bed-temporary.pvar.zst.
Note: No phenotype data present.
27598 variants remaining after main filters.
Writing test_vcf_bed.fam ... done.
Writing test_vcf_bed.bim ... done.
Writing test_vcf_bed.bed .

## Validate the output files

In [20]:
./plink2 \
  --pfile "${VCF/.vcf.gz/""}" \
  --validate

PLINK v2.00a6LM AVX2 Intel (27 Sep 2023)       www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --pfile ukb23157_c1_b1_v1.norm
  --validate

Start time: Fri Sep 29 14:17:18 2023
140744 MiB RAM detected, ~133169 available; reserving 70372 MiB for main
workspace.
Using up to 72 threads (change this with --threads).
469835 samples (0 females, 0 males, 469835 ambiguous; 469835 founders) loaded
from ukb23157_c1_b1_v1.norm.psam.
27598 variants loaded from ukb23157_c1_b1_v1.norm.pvar.
Validating ukb23157_c1_b1_v1.norm.pgen... done.
End time: Fri Sep 29 14:17:18 2023


In [21]:
./plink2 \
  --bfile test_vcf_bed \
  --validate

PLINK v2.00a6LM AVX2 Intel (27 Sep 2023)       www.cog-genomics.org/plink/2.0/
(C) 2005-2023 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to plink2.log.
Options in effect:
  --bfile test_vcf_bed
  --validate

Start time: Fri Sep 29 14:17:23 2023
140744 MiB RAM detected, ~133172 available; reserving 70372 MiB for main
workspace.
Using up to 72 threads (change this with --threads).
469835 samples (0 females, 0 males, 469835 ambiguous; 469835 founders) loaded
from test_vcf_bed.fam.
27598 variants loaded from test_vcf_bed.bim.
Validating test_vcf_bed.bed... done.
End time: Fri Sep 29 14:17:23 2023


In [None]:
#upload file in case required
system("dx upload ukb23157_c1_b1_v1.norm.vcf.gz --path bed_maf/")