<a href="https://colab.research.google.com/github/gilsonauerswald/Bioinformatic_Projects/blob/main/R_05_Advanced_analysis_of_VCF_files_Quality_control%2C_Filter_and_visualize_the_genomic_variants.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basic analysis of VCF files in R : Clinical annotation of genomic variants**

**Example VCF file**

In [None]:
## fileformat=VCFv4.2
## FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
## FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
## FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype Likelihoods">
# CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    SAMP001    SAMP002
#20    1291018 rs11449 G    A    .    PASS    .    GT.   0/0.   0/1
#20    2300608 rs84825 C    T    .    PASS    .    GT:GP    0/1:.    0/1:0.03,0.97,0
#20    2301308 rs84823 T    G    .    PASS    .    GT:PL    ./.:.    1/1:10,5,0

**Marker information**

In [None]:
CHROM  the chromosome.
POS    the genome coordinate of the first base in the variant.
          Within a chromosome, VCF records are sorted in order of increasing position.
ID     a semicolon-separated list of marker identifiers.
REF    the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC").
ALT    the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC").             If there is more than one alternate alleles, the field should be a
          comma-separated list of alternate alleles.
QUAL   probability that the ALT allele is incorrectly specified, expressed on the the phred scale
          (-10log10(probability)).
FILTER either "PASS" or a semicolon-separated list of failed quality control filters.
INFO   additional information (no white space, tabs, or semi-colons permitted).
FORMAT colon-separated list of data subfields reported for each sample.
           The format fields in the Example are explained below.

**Processing VCF files in R**

In [None]:
install.packages("ggplot2")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(ggplot2)

In [None]:
install.packages("tidyverse")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Load the libraries
library('tidyverse')

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.6
[32m✔[39m [34mforcats  [39m 1.0.1     [32m✔[39m [34mstringr  [39m 1.6.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mpurrr    [39m 1.2.0     [32m✔[39m [34mtidyr    [39m 1.3.1
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


**Load the Sample VCF file as a Dataframe**

In [None]:
# Load sample vcf into a dataframe object
sample_vcf_tp53 <- read.table('/content/cohort_high_quality.vcf', header = FALSE, comment.char = "#", sep = "\t")

“number of items read is not a multiple of the number of columns”


In [None]:
# Load sample vcf into a dataframe object
sample_vcf_tp53 <- read.table('https://raw.githubusercontent.com/pine-bio-support/Merge-VCF-files/main/aneuploid_samples_freebayes_tp53.vcf', header = FALSE, comment.char = "#", sep = "\t")

In [None]:
# Display first few lines of the vcf dataframe
head(sample_vcf_tp53)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr1,7651698,.,G,A,55.1479,.,AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=3.0103;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=44;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=72;QR=0;RO=0;RPL=0;RPP=7.35324;RPPR=0;RPR=2;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=0;SRP=0;SRR=0;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"1/1:2:0,2:0:0:2:72:-6.71863,-0.60206,0",.,.,.,.,.
2,chr1,7652017,.,A,C,0.033793,.,AB=0.6;ABP=3.44459;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=9;DPB=9;DPRA=0.625;EPP=3.73412;EPPR=4.45795;GTI=1;LEN=1;MEANALT=1;MQM=40.3333;MQMR=44;NS=3;NUMALT=1;ODDS=4.94386;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=55;QR=172;RO=6;RPL=1;RPP=3.73412;RPPR=8.80089;RPR=2;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=3;SRP=3.0103;SRR=3;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:3:1,2:1:14:2:41:-2.94229,0,-0.496493","0/0:4:4,0:4:122:0:0:0,-1.20412,-11.1074",.,.,"0/1:2:1,1:1:36:1:14:-0.797523,0,-2.93406",.
3,chr1,7652028,.,A,C,0.160379,.,AB=0.428571;ABP=3.32051;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=12;DPB=12;DPRA=0.7;EPP=3.73412;EPPR=9.04217;GTI=1;LEN=1;MEANALT=1;MQM=40.3333;MQMR=43.6667;NS=3;NUMALT=1;ODDS=3.39292;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=73;QR=249;RO=9;RPL=2;RPP=3.73412;RPPR=3.25157;RPR=1;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:5:3,2:3:77:2:50:-2.95813,0,-5.60572","0/0:5:5,0:5:136:0:0:0,-1.50515,-12.3349",.,.,"0/1:2:1,1:1:36:1:23:-1.69452,0,-2.93406",.
4,chr1,7652033,.,T,A,0.00146701,.,AB=0.4;ABP=3.44459;AC=1;AF=0.166667;AN=6;AO=2;CIGAR=1X;DP=11;DPB=11;DPRA=1.66667;EPP=7.35324;EPPR=9.04217;GTI=0;LEN=1;MEANALT=1;MQM=44;MQMR=42.4444;NS=3;NUMALT=1;ODDS=8.54865;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=28;QR=245;RO=9;RPL=2;RPP=7.35324;RPPR=3.25157;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/0:5:5,0:5:101:0:0:0,-1.50515,-9.24844","0/1:5:3,2:3:108:2:28:-1.15406,0,-8.39599",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.
5,chr1,7652035,.,A,C,0.00146373,.,AB=0.4;ABP=3.44459;AC=1;AF=0.166667;AN=6;AO=2;CIGAR=1X;DP=11;DPB=11;DPRA=1.66667;EPP=7.35324;EPPR=9.04217;GTI=0;LEN=1;MEANALT=1;MQM=38.5;MQMR=43.6667;NS=3;NUMALT=1;ODDS=8.54865;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=28;QR=245;RO=9;RPL=2;RPP=7.35324;RPPR=3.25157;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:5:3,2:3:64:2:28:-1.15157,0,-4.35605","0/0:5:5,0:5:145:0:0:0,-1.50515,-13.1898",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.
6,chr1,7652056,.,A,C,0.0707786,.,AB=0.5;ABP=3.0103;AC=1;AF=0.166667;AN=6;AO=3;CIGAR=1X;DP=10;DPB=10;DPRA=3;EPP=9.52472;EPPR=3.32051;GTI=0;LEN=1;MEANALT=1;MQM=40.3333;MQMR=42.4286;NS=3;NUMALT=1;ODDS=4.13005;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=67;QR=239;RO=7;RPL=3;RPP=9.52472;RPPR=3.32051;RPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=18.2106;SRR=7;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:6:3,3:3:99:3:67:-4.42412,0,-7.21488","0/0:3:3,0:3:104:0:0:0,-0.90309,-9.56264",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.


**Define the Columns**

In [None]:
# Define column names
names(sample_vcf_tp53) <- c('CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO')
head(sample_vcf_tp53)

Unnamed: 0_level_0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,NA,NA,NA,NA,NA,NA,NA
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>.1,<chr>.2,<chr>.3,<chr>.4,<chr>.5,<chr>.6
1,chr1,7651698,.,G,A,55.1479,.,AB=0;ABP=0;AC=2;AF=1;AN=2;AO=2;CIGAR=1X;DP=2;DPB=2;DPRA=0;EPP=3.0103;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=44;MQMR=0;NS=1;NUMALT=1;ODDS=7.37776;PAIRED=1;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=72;QR=0;RO=0;RPL=0;RPP=7.35324;RPPR=0;RPR=2;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=0;SRP=0;SRR=0;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"1/1:2:0,2:0:0:2:72:-6.71863,-0.60206,0",.,.,.,.,.
2,chr1,7652017,.,A,C,0.033793,.,AB=0.6;ABP=3.44459;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=9;DPB=9;DPRA=0.625;EPP=3.73412;EPPR=4.45795;GTI=1;LEN=1;MEANALT=1;MQM=40.3333;MQMR=44;NS=3;NUMALT=1;ODDS=4.94386;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=55;QR=172;RO=6;RPL=1;RPP=3.73412;RPPR=8.80089;RPR=2;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=3;SRP=3.0103;SRR=3;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:3:1,2:1:14:2:41:-2.94229,0,-0.496493","0/0:4:4,0:4:122:0:0:0,-1.20412,-11.1074",.,.,"0/1:2:1,1:1:36:1:14:-0.797523,0,-2.93406",.
3,chr1,7652028,.,A,C,0.160379,.,AB=0.428571;ABP=3.32051;AC=2;AF=0.333333;AN=6;AO=3;CIGAR=1X;DP=12;DPB=12;DPRA=0.7;EPP=3.73412;EPPR=9.04217;GTI=1;LEN=1;MEANALT=1;MQM=40.3333;MQMR=43.6667;NS=3;NUMALT=1;ODDS=3.39292;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=73;QR=249;RO=9;RPL=2;RPP=3.73412;RPPR=3.25157;RPR=1;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:5:3,2:3:77:2:50:-2.95813,0,-5.60572","0/0:5:5,0:5:136:0:0:0,-1.50515,-12.3349",.,.,"0/1:2:1,1:1:36:1:23:-1.69452,0,-2.93406",.
4,chr1,7652033,.,T,A,0.00146701,.,AB=0.4;ABP=3.44459;AC=1;AF=0.166667;AN=6;AO=2;CIGAR=1X;DP=11;DPB=11;DPRA=1.66667;EPP=7.35324;EPPR=9.04217;GTI=0;LEN=1;MEANALT=1;MQM=44;MQMR=42.4444;NS=3;NUMALT=1;ODDS=8.54865;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=28;QR=245;RO=9;RPL=2;RPP=7.35324;RPPR=3.25157;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/0:5:5,0:5:101:0:0:0,-1.50515,-9.24844","0/1:5:3,2:3:108:2:28:-1.15406,0,-8.39599",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.
5,chr1,7652035,.,A,C,0.00146373,.,AB=0.4;ABP=3.44459;AC=1;AF=0.166667;AN=6;AO=2;CIGAR=1X;DP=11;DPB=11;DPRA=1.66667;EPP=7.35324;EPPR=9.04217;GTI=0;LEN=1;MEANALT=1;MQM=38.5;MQMR=43.6667;NS=3;NUMALT=1;ODDS=8.54865;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=28;QR=245;RO=9;RPL=2;RPP=7.35324;RPPR=3.25157;RPR=0;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=3;SRP=5.18177;SRR=6;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:5:3,2:3:64:2:28:-1.15157,0,-4.35605","0/0:5:5,0:5:145:0:0:0,-1.50515,-13.1898",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.
6,chr1,7652056,.,A,C,0.0707786,.,AB=0.5;ABP=3.0103;AC=1;AF=0.166667;AN=6;AO=3;CIGAR=1X;DP=10;DPB=10;DPRA=3;EPP=9.52472;EPPR=3.32051;GTI=0;LEN=1;MEANALT=1;MQM=40.3333;MQMR=42.4286;NS=3;NUMALT=1;ODDS=4.13005;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=67;QR=239;RO=7;RPL=3;RPP=9.52472;RPPR=3.32051;RPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=18.2106;SRR=7;TYPE=snp,GT:DP:AD:RO:QR:AO:QA:GL,"0/1:6:3,3:3:99:3:67:-4.42412,0,-7.21488","0/0:3:3,0:3:104:0:0:0,-0.90309,-9.56264",.,.,"0/0:1:1,0:1:36:0:0:0,-0.30103,-3.53612",.


In [None]:
# Select only the first 7 columns and ignore the rest
sample_vcf_tp53 <- select(sample_vcf_tp53, c('CHROM','POS','ID','REF','ALT','QUAL','FILTER'))
head(sample_vcf_tp53)

Unnamed: 0_level_0,CHROM,POS,ID,REF,ALT,QUAL,FILTER
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>
1,chr1,7651698,.,G,A,55.1479,.
2,chr1,7652017,.,A,C,0.033793,.
3,chr1,7652028,.,A,C,0.160379,.
4,chr1,7652033,.,T,A,0.00146701,.
5,chr1,7652035,.,A,C,0.00146373,.
6,chr1,7652056,.,A,C,0.0707786,.


**Load the Reference VCF File**

In [None]:
#Load Reference VCF to a Dataframe
clinvar_vcf_tp53 <- read.table('https://raw.githubusercontent.com/pine-bio-support/Merge-VCF-files/main/clinVar_all_tp53_edt.vcf',
                              header = FALSE, comment.char = "#", sep = "\t")
head(clinvar_vcf_tp53)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr17,7666228,133418,C,G,.,.,RS=144366923;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=no_assertion_provided;CLNSIG=not_provided
2,chr17,7667874,925574,G,C,.,.,"RS=1049800949;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
3,chr17,7667880,926695,G,A,.,.,"RS=2072740569;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
4,chr17,7667888,925707,G,A,.,.,"RS=2072740671;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
5,chr17,7667899,920680,A,C,.,.,"RS=886596112;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
6,chr17,7667901,927570,C,T,.,.,"RS=2072740794;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"


**Define the Columns**

In [None]:
# Define column names
names(clinvar_vcf_tp53) <- c('CHROM','POS','ID','REF','ALT','QUAL','FILTER','INFO')
head(clinvar_vcf_tp53)

Unnamed: 0_level_0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr17,7666228,133418,C,G,.,.,RS=144366923;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=no_assertion_provided;CLNSIG=not_provided
2,chr17,7667874,925574,G,C,.,.,"RS=1049800949;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
3,chr17,7667880,926695,G,A,.,.,"RS=2072740569;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
4,chr17,7667888,925707,G,A,.,.,"RS=2072740671;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
5,chr17,7667899,920680,A,C,.,.,"RS=886596112;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"
6,chr17,7667901,927570,C,T,.,.,"RS=2072740794;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign"


**Extract Clinical Significance**

In [None]:
#Extract clinical significance
clinvar_vcf_tp53 <- clinvar_vcf_tp53 %>% separate(INFO, c('INFO', 'Significance'), sep=';CLNSIG=')
head(clinvar_vcf_tp53)

Unnamed: 0_level_0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,Significance
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr17,7666228,133418,C,G,.,.,RS=144366923;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=no_assertion_provided,not_provided
2,chr17,7667874,925574,G,C,.,.,"RS=1049800949;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign
3,chr17,7667880,926695,G,A,.,.,"RS=2072740569;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign
4,chr17,7667888,925707,G,A,.,.,"RS=2072740671;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign
5,chr17,7667899,920680,A,C,.,.,"RS=886596112;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign
6,chr17,7667901,927570,C,T,.,.,"RS=2072740794;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign


**Find Common Variants between Sample and Reference VCF files**

In [None]:
#Find common variants between sample and reference dataframe
sample_tp53_clnvar <- inner_join(sample_vcf_tp53, clinvar_vcf_tp53, by = c("CHROM" = "CHROM", "POS"="POS","REF" = "REF"))
head(sample_tp53_clnvar)

Unnamed: 0_level_0,CHROM,POS,ID.x,REF,ALT.x,QUAL.x,FILTER.x,ID.y,ALT.y,QUAL.y,FILTER.y,INFO,Significance
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<chr>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,chr17,7667901,.,C,T,1.62128e-14,.,927570,T,.,.,"RS=2072740794;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Likely_benign
2,chr17,7668134,.,G,A,5946.43,.,1294447,A,.,.,"CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Benign
3,chr17,7669911,.,C,T,0.0,.,1269356,T,.,.,"MC=SO:0001627|intron_variant;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Benign
4,chr17,7673183,.,G,T,2.41323e-13,.,1246968,A,.,.,"MC=SO:0001627|intron_variant;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Benign
5,chr17,7674089,.,A,C,8086.02,.,1243959,C,.,.,"MC=SO:0001627|intron_variant;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Benign
6,chr17,7674109,.,G,A,7402.95,.,1243351,A,.,.,"MC=SO:0001627|intron_variant;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNREVSTAT=criteria_provided,_single_submitter",Benign


**Extract Relevant Variant Information**

In [None]:
#Select only relevant columns
sample_tp53_clnvar <- select(sample_tp53_clnvar, c("CHROM", "POS", "REF", "ALT.x", "QUAL.x", "ALT.y", "Significance"))
#Rename Columns
sample_tp53_clnvar <- rename(sample_tp53_clnvar, "ALT.sample"=ALT.x, "ALT.clnvar"=ALT.y, "QUAL"=QUAL.x)
head(sample_tp53_clnvar)

Unnamed: 0_level_0,CHROM,POS,REF,ALT.sample,QUAL,ALT.clnvar,Significance
Unnamed: 0_level_1,<chr>,<int>,<chr>,<chr>,<dbl>,<chr>,<chr>
1,chr17,7667901,C,T,1.62128e-14,T,Likely_benign
2,chr17,7668134,G,A,5946.43,A,Benign
3,chr17,7669911,C,T,0.0,T,Benign
4,chr17,7673183,G,T,2.41323e-13,A,Benign
5,chr17,7674089,A,C,8086.02,C,Benign
6,chr17,7674109,G,A,7402.95,A,Benign


**Save the output file**

In [None]:
#Save the output
write.table(sample_tp53_clnvar,file="sample_tp53_clnvar_annotated.txt", sep='\t',  quote = F, row.names = FALSE)

**Identify Pathogenic Variants**

In [None]:
# Tabulate the frequency of diverse clinical significant variants
table(sample_tp53_clnvar$Significance)


                      Benign         Benign/Likely_benign 
                          12                            2 
               Likely_benign                   Pathogenic 
                           9                            3 
Pathogenic/Likely_pathogenic       Uncertain_significance 
                           1                            5 

In [None]:
# Extract clinically pathogenic variants.
sample_tp53_clnvar_pathogenic <- filter(sample_tp53_clnvar, Significance == "Pathogenic")
sample_tp53_clnvar_pathogenic

CHROM,POS,REF,ALT.sample,QUAL,ALT.clnvar,Significance
<chr>,<int>,<chr>,<chr>,<dbl>,<chr>,<chr>
chr17,7674893,C,A,4487.83,T,Pathogenic
chr17,7674945,G,A,1.11766,A,Pathogenic
chr17,7674957,G,A,1.01858e-13,A,Pathogenic
