In [None]:
mkdir ./results_HsMetric
mkdir ./results_varscan
mkdir ./results_variantAnn

Starting files you should have:
-

**resources_varCall**
- Sureselect.chr21.bed
- Sureselect.chr21.headerReorder.interval_list
- chr21.fa.gz (and associated indices and dicts)
- access-5k-mappable.hg19.bed 
- annovar (directory)
    - annotate_variation.pl  
    - convert2annovar.pl
    - humandb (directory)
        - hg19_refGeneMrna.fa
        - hg19_refGene.txt
    - table_annovar.pl)
- refFlat.txt
- hg19 reference files (.dict, .fa.gz, .fa.gz.fai, .fa.gz.gzi)
- chr21 reference files (.dict, .fa.gz, .fa.gz.gzi, .fa.gz.gzi)

**materials_varCall**
- FRFZ.chr21.hg19.bam
- FRFZ.chr21.hg19.mapped.bam (contains only mapped reads from FRFZ.chr21.hg19.bam) 
- CPTRES7.realigned.chr21.bam
- CPTRES4.realigned.chr21.bam
- GMTS_all.shuf.vcf.gz
- GMTS_all.shuf.vcf.gz.tbi

**results_mutect2**
- CPTRES7vs4.mutec2.vcf
- CPTRES7vs4.vcf.gz

### Generate coverage metrics

We have already generated the *.intervals_list file but this command is an example of how it would be generated 

```/opt/conda/envs/variant_calling/bin/picard BedToIntervalList I=./resources_varCall/Sureselect.chr21.bed O=./resources_varCall/Sureselect.chr21.interval_list SD=./resources_varCall/hg19.fa.gz```

In [1]:
#This step takes a minute or two to run
/opt/conda/envs/variant_calling/bin/picard CollectHsMetrics BAIT_INTERVALS=./resources_varCall/Sureselect.chr21.headerReorder.interval_list TARGET_INTERVALS=./resources_varCall/Sureselect.chr21.headerReorder.interval_list INPUT=./materials_varCall/FRFZ.chr21.hg19.mapped.bam OUTPUT=./results_HsMetric/FRFZ.chr21.hsmetrics.txt


INFO	2021-03-03 10:26:59	CollectHsMetrics	

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    CollectHsMetrics -BAIT_INTERVALS ./resources_varCall/Sureselect.chr21.headerReorder.interval_list -TARGET_INTERVALS ./resources_varCall/Sureselect.chr21.headerReorder.interval_list -INPUT ./materials_varCall/FRFZ.chr21.hg19.mapped.bam -OUTPUT ./results_HsMetric/FRFZ.chr21.hsmetrics.txt
**********


10:26:59.803 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/envs/variant_calling/share/picard-2.18.29-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Mar 03 10:26:59 PST 2021] CollectHsMetrics BAIT_INTERVALS=[./resources_varCall/Sureselect.chr21.headerReorder.interval_list] TA

### Calling somatic variants (First Method)

Mutect2 

We will not be running this command, but the following are example commands we would use to generate ```CPTRES7vs4.vcf``` and ```CPTRES7vs4.mutec2.vcf```, our input files for the next few steps

 ```/opt/conda/envs/r-bio/bin/gatk Mutect2 -R ./resources_varCall/hg19.fa.gz \
 -I ./materials_varCall/CPTRES7.realigned.chr21.bam \
 -I ./materials_varCall/CPTRES4.realigned.chr21.bam \
 -tumor CPTRES7:DS:CPTRES7 \
 -normal CPTRES4:DS:CPTRES4 \
 -O ./results_mutect2/CPTRES7vs4.vcf```

``` /opt/conda/envs/r-bio/bin/gatk FilterMutectCalls -R ./resources_varCall/hg19.fa.gz \
 -V ./results_mutect2/CPTRES7vs4.vcf \
 -O ./results_mutect2/CPTRES7vs4.mutec2.vcf```

#### VCF file manipulation

In [2]:
#zip (bgzip) and index the vcf file
for file in ./results_mutect2/CPTRES7vs4*vcf; do /opt/conda/envs/r-bio/bin/bgzip $file; /opt/conda/envs/r-bio/bin/tabix -p vcf $file.gz; done

In [3]:
#flag variants with coverage less than 20
/opt/conda/envs/r-bio/bin/bcftools filter -s "DP20" -O z -e 'INFO/DP<20' ./results_mutect2/CPTRES7vs4.mutec2.vcf.gz > ./results_mutect2/CPTRES7vs4.mutec2.filtered.vcf.gz

[W::vcf_parse_format] Extreme FORMAT/MPOS value encountered and set to missing at chr21:10794142


In [4]:
#Get some statistics
/opt/conda/envs/r-bio/bin/bcftools stats ./results_mutect2/CPTRES7vs4.mutec2.filtered.vcf.gz

# This file was produced by bcftools stats (1.11+htslib-1.11) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  ./results_mutect2/CPTRES7vs4.mutec2.filtered.vcf.gz
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	./results_mutect2/CPTRES7vs4.mutec2.filtered.vcf.gz
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SN

### Calling somatic variants (Second Method)

#### Generate pileup files

In [5]:
#create bam indices
/opt/conda/envs/r-bio/bin/samtools index ./materials_varCall/CPTRES7.realigned.chr21.bam ./materials_varCall/CPTRES7.realigned.chr21.bam.bai
/opt/conda/envs/r-bio/bin/samtools index ./materials_varCall/CPTRES4.realigned.chr21.bam ./materials_varCall/CPTRES4.realigned.chr21.bam.bai

In [6]:
# generate pileup (without ref)
/opt/conda/envs/r-bio/bin/samtools mpileup -Q 0 -f ./resources_varCall/chr21.fa.gz ./materials_varCall/CPTRES7.realigned.chr21.bam > ./results_varscan/CPTRES7.chr21.pileup
/opt/conda/envs/r-bio/bin/samtools mpileup -Q 0 -f ./resources_varCall/chr21.fa.gz ./materials_varCall/CPTRES4.realigned.chr21.bam > ./results_varscan/CPTRES4.chr21.pileup


[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files


In [7]:
# call variants with Varscan (may need interactive node for RAM needs)
/opt/conda/envs/r-bio/bin/varscan somatic ./results_varscan/CPTRES4.chr21.pileup ./results_varscan/CPTRES7.chr21.pileup CPTRES7vs4 --output-vcf --output-snp ./results_varscan/CPTRES7vs4.chr21.snp --output-indel ./results_varscan/CPTRES7vs4.chr21.indel

Normal Pileup: ./results_varscan/CPTRES4.chr21.pileup
Tumor Pileup: ./results_varscan/CPTRES7.chr21.pileup
NOTICE: While dual input files are still supported, using a single mpileup file (normal-tumor) with the --mpileup 1 setting is strongly recommended.
Min coverage:	8x for Normal, 6x for Tumor
Min reads2:	2
Min strands2:	1
Min var freq:	0.2
Min freq for hom:	0.75
Normal purity:	1.0
Tumor purity:	1.0
Min avg qual:	15
P-value thresh:	0.99
Somatic p-value:	0.05
1417403 positions in tumor
1416817 positions shared in normal
945223 had sufficient coverage for comparison
944186 were called Reference
0 were mixed SNP-indel calls and filtered
1020 were called Germline
2 were called LOH
15 were called Somatic
0 were called Unknown
0 were called Variant


#### Calling CNV

We will not be running this command, but here is an example of the ```cnvkit.py batch``` command.

CNV calling with CNVkit batch:

```/opt/conda/envs/r-bio/bin/cnvkit.py batch ./materials/CPTRES7.realigned.chr21.bam --normal ./materials/CPTRES4.realigned.chr21.bam --targets ./resources/Sureselect.chr21.bed --annotate ./resources/refFlat.txt --fasta ./resources/hg19.fa.gz --access ./resources/access-5k-mappable.hg19.bed --output-reference myref.cnn --output-dir ./results --diagram --scatter```


#### VCF file manipulation

Let's merge the snp and indel files

In [1]:
#zip (bgzip) and index the vcf file
for file in ./results_varscan/CPTRES7vs4*vcf; do /opt/conda/envs/r-bio/bin/bgzip $file; /opt/conda/envs/r-bio/bin/tabix -p vcf $file.gz; done

In [2]:
#Concatenate the snp and indel files
/opt/conda/envs/r-bio/bin/bcftools concat -O z -o ./results_varscan/CPTRES7vs4.varscan.vcf.gz ./results_varscan/CPTRES7vs4.chr21.snp.vcf.gz ./results_varscan/CPTRES7vs4.chr21.indel.vcf.gz


Checking the headers and starting positions of 2 files
Concatenating ./results_varscan/CPTRES7vs4.chr21.snp.vcf.gz	0.008932 seconds
Concatenating ./results_varscan/CPTRES7vs4.chr21.indel.vcf.gz	0.000701 seconds


In [3]:
#flag variants with coverage less than 20
/opt/conda/envs/r-bio/bin/bcftools filter -s "DP20" -O z -e 'INFO/DP<20' ./results_varscan/CPTRES7vs4.varscan.vcf.gz > ./results_varscan/CPTRES7vs4.varscan.filtered.vcf.gz

In [4]:
#keep PASS and germline
/opt/conda/envs/r-bio/bin/bcftools filter -O z -i 'FILTER=="PASS" & INFO/SS=="1"' ./results_varscan/CPTRES7vs4.varscan.filtered.vcf.gz > ./results_varscan/CPTRES7vs4.germ.vcf.gz

In [5]:
#What is the transition to transversion ratio?
/opt/conda/envs/r-bio/bin/bcftools stats ./results_varscan/CPTRES7vs4.germ.vcf.gz

# This file was produced by bcftools stats (1.11+htslib-1.11) and can be plotted using plot-vcfstats.
# The command line was:	bcftools stats  ./results_varscan/CPTRES7vs4.germ.vcf.gz
#
# Definition of sets:
# ID	[2]id	[3]tab-separated file names
ID	0	./results_varscan/CPTRES7vs4.germ.vcf.gz
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
# 
#   Note that ro

DP	0	201	0	0.000000	1	0.104493
DP	0	202	0	0.000000	1	0.104493
DP	0	203	0	0.000000	1	0.104493
DP	0	204	0	0.000000	3	0.313480
DP	0	205	0	0.000000	2	0.208986
DP	0	206	0	0.000000	4	0.417973
DP	0	207	0	0.000000	2	0.208986
DP	0	208	0	0.000000	3	0.313480
DP	0	210	0	0.000000	2	0.208986
DP	0	211	0	0.000000	1	0.104493
DP	0	212	0	0.000000	1	0.104493
DP	0	214	0	0.000000	1	0.104493
DP	0	215	0	0.000000	1	0.104493
DP	0	217	0	0.000000	2	0.208986
DP	0	220	0	0.000000	2	0.208986
DP	0	223	0	0.000000	2	0.208986
DP	0	224	0	0.000000	2	0.208986
DP	0	226	0	0.000000	2	0.208986
DP	0	227	0	0.000000	1	0.104493
DP	0	229	0	0.000000	2	0.208986
DP	0	230	0	0.000000	2	0.208986
DP	0	232	0	0.000000	1	0.104493
DP	0	234	0	0.000000	1	0.104493
DP	0	235	0	0.000000	1	0.104493
DP	0	236	0	0.000000	3	0.313480
DP	0	237	0	0.000000	1	0.104493
DP	0	238	0	0.000000	1	0.104493
DP	0	243	0	0.000000	1	0.104493
DP	0	245	0	0.000000	1	0.104493
DP	0	246	0	0.000000	1	0.104493
DP	0	248	0	0.000000	1	0.104493
DP	0	249	0	0.000000	2	0.208986
DP	0	250

### Variant Annotation

For this part we will use the results of cphort germien sequencing (GATK HapotypeCaller). Note: genotype have been shuffled to preserve privacy. Note2: GATK was run on individual file, not cohort, hence no use of gVCF : missing are assumed homozygous reference. 

In [6]:
#Breakmulti allele sites
/opt/conda/envs/variant_calling/bin/vcfbreakmulti ./materials_varCall/GMTS_all.shuf.vcf.gz | /opt/conda/envs/r-bio/bin/bgzip -c >  ./results_variantAnn/GMTS_all.shuf.BM.vcf.gz

In [7]:
#index the output
/opt/conda/envs/r-bio/bin/tabix -p vcf ./results_variantAnn/GMTS_all.shuf.BM.vcf.gz

In [8]:
#run Table annovar on the GMTS variant file. 
./resources_varCall/annovar/table_annovar.pl --vcfinput --nastring . --protocol refGene --operation g --buildver hg19 --outfile ./results_variantAnn/GMTSann ./results_variantAnn/GMTS_all.shuf.BM.vcf.gz ./resources_varCall/annovar/humandb/


NOTICE: Running with system command <convert2annovar.pl -includeinfo -allsample -withfreq -format vcf4 ./results_variantAnn/GMTS_all.shuf.BM.vcf.gz > ./results_variantAnn/GMTSann.avinput>
NOTICE: Finished reading 8192 lines from VCF file
NOTICE: A total of 8055 locus in VCF file passed QC threshold, representing 5386 SNPs (3375 transitions and 2011 transversions) and 2669 indels/substitutions
NOTICE: Finished writing 204668 SNPs (128250 transitions and 76418 transversions) and 101422 indels/substitutions for 38 samples

NOTICE: Running with system command <./resources_varCall/annovar/table_annovar.pl --nastring . --protocol refGene --operation g --buildver hg19 -outfile ./results_variantAnn/GMTSann ./results_variantAnn/GMTSann.avinput ./resources_varCall/annovar/humandb/ -otherinfo>
-----------------------------------------------------------------
NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variation.pl -geneanno -buildver hg19 -dbtyp

In [10]:
#export genotypes to TSV file. One sample.variant per row. 
/opt/conda/envs/variant_calling/bin/vcf2tsv -g ./results_variantAnn/GMTS_all.shuf.BM.vcf.gz > ./results_variantAnn/GMTS.geno.txt