The human reference genome is intended to be a linear consensus sequence of many individuals. This can influence mapping accuracy, as natural variation of an individual (due to ethniticy, etc.) will reduce alignment scores and thus suppress reads in certain areas. This can influence variant calling, as reads containing both known and novel variation can score low enough to be omitted by mapping software.

To address this, we encode the known variation sites into the reference genome and use a graph aligner. We will use genome in a bottle's dataset and a graph aligner for analysis.

This was ran on a machine with 128GB of memory and 32 cores (m5.8xlarge) and 2 tb disk.

In [49]:
# for analysis, we'll be using genome in a bottle, so we use their known variants. This could just as easily be dbsnp/etc.
!wget https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv4.2.1/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
!wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

--2024-10-03 11:32:04--  https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv4.2.1/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz
Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:41e:250::7, ...
Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 125932193 (120M) [application/x-gzip]
Saving to: ‘HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz’


2024-10-03 11:32:20 (7.46 MB/s) - ‘HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz’ saved [125932193/125932193]

--2024-10-03 11:32:20--  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 983659424 (938M) [applicati

In [41]:
# we just do chromosome 1 to make things a bit faster for this example
!samtools faidx hg38.fa.gz chr1 | bgzip -c - > chr1.fa.gz
!samtools faidx chr1.fa.gz

In [50]:
!gunzip -c HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | awk '/^#/ || /^chr1\t/' | bgzip -c > chr1.vcf.bgz

In [85]:
# cleanup any old stuff
!rm -rf .gen
!rm hg38.db

In [86]:
!gen init
!gen --db hg38.db import --name hg38 --fasta ./chr1.fa.gz --shallow

Gen repository initialized.
Created it


In [87]:
!gen --db hg38.db update --name hg38 --vcf chr1.vcf.bgz



In [5]:
# now, we want to align using vg
!wget https://github.com/vgteam/vg/releases/download/v1.60.0/vg
!chmod +x ./vg

--2024-10-02 10:47:00--  https://github.com/vgteam/vg/releases/download/v1.60.0/vg
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/24727800/76d354fd-0852-4676-b3af-82df87da06a0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20241002%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20241002T144701Z&X-Amz-Expires=300&X-Amz-Signature=e89751dbe24c5a1e1c176bd81248d12522051a3498f0100e80fd5b72d6b87554&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dvg&response-content-type=application%2Foctet-stream [following]
--2024-10-02 10:47:00--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/24727800/76d354fd-0852-4676-b3af-82df87da06a0?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F202

In [8]:
!gen --db hg38.db export --name hg38 --gfa hg38.gfa

In [20]:
# for alignment, a random genome in a bottle sample
!wget -O R1.fq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_001.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_002.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_003.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_004.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_005.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_006.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_007.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_008.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_001.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_002.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_003.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_004.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_005.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_006.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_007.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R1_008.fastq.gz
!wget -O R2.fq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_001.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_002.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_003.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_004.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_005.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_006.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_007.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_008.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_001.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_002.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_003.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_004.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_005.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_006.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_007.fastq.gz https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L002_R2_008.fastq.gz

--2024-10-02 16:35:58--  https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R1_001.fastq.gz
Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:41e:250::13, ...
Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.13|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 432595315 (413M) [application/x-gzip]
Saving to: ‘U5b_AGTTCC_L001_R1_001.fastq.gz’


2024-10-02 16:36:34 (11.7 MB/s) - ‘U5b_AGTTCC_L001_R1_001.fastq.gz’ saved [432595315/432595315]

--2024-10-02 16:36:34--  https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/140313_D00360_0014_AH8GGVADXX/Project_RM8398/Sample_U5b/U5b_AGTTCC_L001_R2_001.fastq.gz
Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.13, 130.14.250.7, 2607:f220:4

In [None]:
# This builds our indices up
!./vg mod -X 32 hg38.gfa > hg38.mod.gfa
!./vg autoindex -p index -w map -g hg38.mod.gfa

In [None]:
# this maps reads
!./vg map -x index.xg -g index.gcsa -f R1.fq.gz -f R2.fq.gz > align.gam
!./vg gamsort -p align.gam > align.sorted.gam
!./vg index -l align.sorted.gam

In [None]:
# this summarizes our reads per position and then calls variants
!./vg pack -e -x index.xg -g align.sorted.gam -o aln.pack
!./vg call index.xg -k aln.pack > align.vcf

In [2]:
# look at some variants from our data vs. input
!gunzip -c chr1.vcf.bgz | grep -P "chr1\t" | head -n 25

chr1	783006	.	A	G	50	PASS	platforms=4;platformnames=PacBio,Illumina,10X,CG;datasets=4;datasetnames=CCS15kb_20kb,HiSeqPE300x,10XChromiumLR,CGnormal;callsets=6;callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xGATK,10XLRGATK,CGnormal,HiSeqPE300xfreebayes;datasetsmissingcall=IonExome,SolidSE75bp;callable=CS_CCS15kb_20kbDV_callable,CS_CCS15kb_20kbGATK4_callable;filt=CS_CGnormal_filt;difficultregion=HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.segdups_sorted_merged,lowmappabilityall	GT:PS:DP:ADALL:AD:GQ	1/1:.:652:16,234:0,82:312
chr1	783175	.	T	C	50	PASS	platforms=4;platformnames=PacBio,Illumina,10X,Solid;datasets=4;datasetnames=CCS15kb_20kb,HiSeqPE300x,10XChromiumLR,SolidSE75bp;callsets=6;callsetnames=CCS15kb_20kbDV,CCS15kb_20kbGATK4,HiSeqPE300xGATK,10XLRGATK,HiSeqPE300xfreebayes,SolidSE75GATKHC;datasetsmissingcall=CGnormal,IonExome;callable=CS_CCS15kb_20kbDV_callable,CS_CCS15kb_20kbGATK4_callable;difficultregion=HG001.hg38.300x.bam.bilkentuniv.010920.dups,hg38.segdups_sorted_me

In [6]:
# we can see support for variants based on reads aligned from this sequencing run/sample
!grep -P "chr1\t" align.vcf | head -n 25

1-chr1	783006	>126167>126169	A	G	220.146	PASS	AT=>126167>126168>126169,>126167>8632817>126169;DP=38	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:38:24,14:-26.597657,-5.060118,-49.472423:215:-1.098612:21.203342:14
1-chr1	783175	>126174>126176	T	C	24.9355	PASS	AT=>126174>126175>126176,>126174>8532755>126176;DP=27	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:27:23,4:-8.152426,-6.140191,-51.312071:20:-1.108288:17.000000:4
1-chr1	784860	>126228>126230	T	C	137.863	PASS	AT=>126228>126229>126230,>126228>8500678>126230;DP=6	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:6:0,6:-14.601243,-2.614278,-1.313245:13:-1.147402:5.849057:6
1-chr1	785417	>126247>126249	G	A	183.11	PASS	AT=>126247>126248>126249,>126247>74686>126249;DP=8	GT:DP:AD:GL:GQ:GP:XD:MAD	1/1:8:0,8:-19.265059,-3.282439,-1.437336:18:-1.112797:8.246754:8
1-chr1	797392	>126623>126625	G	A	16.7555	PASS	AT=>126623>126624>126625,>126623>55966>126625;DP=4	GT:DP:AD:GL:GQ:GP:XD:MAD	0/1:4:3,1:-3.256084,-2.086064,-7.427831:11:-1.164035:9.014085:1
1-chr1	798618	>126663>126665	C	T	251.478	PASS	AT=

Now, these are variants we encoded into our graph. They could be related to an individual's ethnicity, disease state, etc.
Following this, we usually want to call variants not included in our reference graph -- novel variants.

The way variant calling in graphs works is to identify bubbles in the graph and establish read support for those bubbles. To
call novel variants, we want to use alignments with discrepencies to our reference graph to add new bubbles to the graph.
Then we treat this just like our above workflow for calling novel variants.

In [None]:
# We need to create a .vg/pg format from our autoindexed workflow for this workflow. It's annoying this isn't done for you 
# by autoindex currently. This is because no output from autoindex is acceptable by augment, and autoindex alters node ids from our
# input gfa file, so the node ids from our initial gfa will not be the same as what is in the alignment.
!vg convert index.xg > index.pg
!vg augment index.pg align.sorted.gam -A aug.gam > augment.vg
!vg index augment.vg -x augment.xg
!vg pack -x augment.xg -g aug.gam -o aln.aug.pack
# use -A in script, cehck
!vg call augment.xg -k aln.aug.pack > align.aug.vcf

In [9]:
# this file will contain both our known variants (above) as well as novel variants.
!grep -P "chr1\t" align.aug.vcf | head -n 25

1-chr1	10008	>20252164>9511549	AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCT	CACCCTCCCATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC	30308.1	PASS	AT=>20252164>20252165>20252166>20252167>20252168>20252169>20252170>20252171>20252172>20252173>20252174>20252175>20252176>20252177>20252178>20252179>20252180>20252181>20252182>20252183>20252184>20252185>20252186>20252187>20252188>20252189>102012>28253009>28253010>28253011>28253012>28253013>28253014>28253015>28253016>28253017>28253018>28253019>28253020>28253021>28253022>28253023>28253024>28253025>28253026>28253027>28253028>28253029>28253030>28253031>28253032>28253033>28253034>28253035>28253036>28253037>28253038>28253039>102013>19174260>19174261>19174262>19174263>19174264>19174265>19174266>19174267>19174268>19174269>19174270>19174271>19174272>19174273>19174274>19174