#  Generating GRCh38 Medical Genes Benchmark

This notebook details the steps to generate the challenging medically relevant genes benchmark. All paths are from the top level directory of the repository and large dependency files that can not be stored in a git repo are indicated with URL in comments.


1) Look up coordinates for gene symbols in ENSEMBLE GRCh38 Human Genes v100 of the union of Mandelker et al Supplementary Table 13, COSMIC Cancer Gene Census, Steve Lincoln Medical Gene Lists -- GRCh38_lookup_MRG_symbol_coordinates_ENSEMBL.R

2) Find overlap of genes with HG002 v4.2.1, then add slop and find overlap with HG002 hifiasm v0.11 dip.bed

3) Find genes that were < 90% overlap with GRCh38 v4.2.1 and fully covered with overlapping segdups and flanking sequence in HG002 hifiasm v0.11 GRCh38 dip.bed, find union of GRCh37 and GRCh38 MRG lists, then add genes that are unique to GRCh37 but still fully fully covered with overlapping segdups and flanking sequence in HG002 hifiasm v0.11 GRCh38 dip.bed -- find_coordinates_of_MRG_GRCh37_GRCh38_union.R

4) Use coordinates for benchmark then remove 
    - homopolymers and imperfect homopolymers > 20
    - SVs with 50bp flanking and overlapping tandem repeats
    - hifiasm error
    - GRCh38 GAPs
    - Remove partially covered tandem repeats
    - Remove MHC region
    
5) Generate stratification files for Complex Variants in Tandem Repeats
    - GRCh38_MRG_stratification_ComplexVar_in_TR.bed

# Downloading Data Dependencies

## hifiasm Variants and Diploid Regions

In [None]:
mkdir -p data/hifiasm_dipcall_output
wget -O data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.bed \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/SupplementaryFiles/HG002v11-align2-GRCh38/HG002v11-align2-GRCh38.dip.bed

wget -O data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/SupplementaryFiles/HG002v11-align2-GRCh38/HG002v11-align2-GRCh38.dip.vcf.gz

## Reference Genome

In [None]:
wget -O resources/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz

gunzip resources/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa.gz
samtools faidx resources/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
    

## Genomic Stratifications

In [None]:
mkdir -p resources/giab_stratifications
wget -O resources/giab_stratifications/GRCh38_segdups.bed.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/SegmentalDuplications/GRCh38_segdups.bed.gz

wget -O resources/giab_stratifications/GRCh38_MHC.bed.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/OtherDifficult/GRCh38_MHC.bed.gz

wget -O resources/giab_stratifications/GRCh38_AllTandemRepeats_gt100bp_slop5.bed \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh37/LowComplexity/GRCh37_AllTandemRepeats_gt100bp_slop5.bed.gz

wget -O resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v2.0/GRCh38/LowComplexity/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz

## GIAB Benchmark Sets

### CMRG Draft Benchmarks

In [None]:
mkdir -p data/manually_created_files/cmrg_draft_benchmarks
## v0.02.03 small variant benchmark
wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed \
    https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.02/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed

wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.02/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz

wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz.tbi \
    https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.02/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz.tbi

## Download v0.01 SV benchmark
wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed
wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz
wget -O data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz.tbi \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz.tbi

### V4.2.1 GRCh38 benchmark

In [None]:
mkdir -p data/v4.2.1_benchmark_regions
wget -O data/v4.2.1_benchmark_regions/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed \
    https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/NISTv4.2.1/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed

# From these genes to be benchmarked remove the following regions that we exclude from the diploid assembly based variant calls:

    - homopolymers and imperfect homopolymers > 20
    - SVs with 50bp flanking and overlapping tandem repeats
    - hifiasm error
    - GRCh38 GAPs
    - Remove partially covered tandem repeats



## Remove homopolymers > 20bp

In [None]:
bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_bedtools_merge.bed \
    -b data/giab_stratifications/GRCh38/GRCh38_SimpleRepeat_homopolymer_gt20_slop5.bed.gz \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_SimpleRepeat_homopolymer_gt20_slop5.bed

## Remove imperfect homopolymers > 20bp

In [None]:
bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_SimpleRepeat_homopolymer_gt20_slop5.bed \
    -b data/giab_stratifications/GRCh38/GRCh38_SimpleRepeat_imperfecthomopolgt20_slop5.bed.gz \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_SimpleRepeat_imperfecthomopolgt20_slop5.bed

## SVs with 50bp flanking and overlapping tandem repeats

In [None]:
gunzip -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    | awk 'length($4)>49 || length($5)>49' \
    | awk '{FS=OFS="\t"} {print $1,$2-1,$2+length($4)}' \
    > workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp.bed

intersectBed -wa \
    -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    -b workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp.bed \
    | multiIntersectBed -i stdin workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp.bed \
    |  awk '{FS=OFS="\t"} {print $1,$2-50,$3+50}' \
    | mergeBed -i stdin -d 1000 \
    > workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000.bed

bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_SimpleRepeat_imperfecthomopolgt20_slop5.bed \
    -b workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000.bed \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_SVsgt49bp_repeatexpanded_slop50_merge1000.bed
    

## Remove hifiasm error on chr21

In [None]:
# GRCh38_hifiasm_error.bed was created through manual curation of clusters of errors identified during evaluation steps of benchmark development

bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_SVsgt49bp_repeatexpanded_slop50_merge1000.bed \
    -b data/manually_created_files/GRCh38_hifiasm_error.bed \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_hifiasm_error.bed
    

## Remove GRCh38 GAPs

In [None]:
# GRCh38_MRG_GAPs.bed was created through manual curation of clusters of errors identified during evaluation steps of benchmark development

bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_hifiasm_error.bed \
    -b data/manually_created_files/GRCh38_MRG_GAPs.bed \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_MRG_GAPs.bed
    

## Sort

In [None]:
cat workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_MRG_GAPs.bed \
    | sed 's/^chr//' \
    | sort -k1,1n -k2,2n \
    | sed 's/^/chr/' \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_MRG_GAPs_sorted.bed

## Remove partially covered tandem repeats


In [None]:
complementBed \
    -i workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_MRG_GAPs_sorted.bed \
    -g resources/human.b38.genome \
    | intersectBed -wa \
        -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
        -b stdin \
    | subtractBed \
        -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_GRCh38_MRG_GAPs_sorted.bed \
        -b stdin \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_partial_tandem_repeats.bed

## Remove MHC as it is covered in MHC benchmark

In [None]:
bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_subtract_partial_tandem_repeats.bed \
    -b resources/giab_stratifications/GRCh38_MHC.bed.gz \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.bed

cat workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.bed \
    | awk '{sum+=$3-$2} END {print sum}'

bedtools intersect \
    -a data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    -b workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_bedtools_merge.bed \
    -header \
    > workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.vcf

bgzip -f workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.vcf

tabix -f workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.vcf.gz

__NOTES:__

1. `workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.vcf.gz` is `https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.02/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz` _after updates to the headers_

2. `workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark.bed` is `https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_benchmark_v0.02/GRCh38/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed`

## SV benchmark

## Prepare release benchmark files

In [None]:
mkdir -p workflow/SV_benchmark/GRCh38
# Find SVs MRG benchmark gene coordinates
bedtools intersect \
    -a workflow/smallvar_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000.bed \
    -b workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_bedtools_merge.bed \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates.bed

# Subset to SVs only gt49bp 
gunzip -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    | awk '{FS="\t|,"} {if($1 ~ /^#/ || length($4)-length($5)>49 || length($5)-length($4)>49 || length($6)-length($4)>49) print}' \
    | intersectBed \
    -a workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates.bed \
    -b stdin -c \
    | awk '$4>0' \
    | cut -f1-3 \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp.bed 

# Find isolated SVs  

gunzip \
    -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    | awk '{FS="\t|,"} {if($1 ~ /^#/ || length($4)-length($5)>9 || length($5)-length($4)>9 || length($6)-length($4)>9) print}' \
    | intersectBed \
        -a workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp.bed \
        -b stdin -c \
        | awk '$4==1' \
        | cut -f1-3 \
        > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp_isolated.bed 

# Find complex SVs  

gunzip -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    | awk '{FS="\t|,"} {if($1 ~ /^#/ || length($4)-length($5)>9 || length($5)-length($4)>9 || length($6)-length($4)>9) print}' \
    | intersectBed \
        -a workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp.bed \
        -b stdin -c \
    | awk '$4>1' \
    | cut -f1-3 \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp_complexSVs.bed 

# Remove complex SVs from MRG gene candidate coordinates and remove GAPs 
bedtools subtract \
    -a workflow/smallvar_benchmark/GRCh38/HG002_GRCh38_CMRG_coordinates_temp_bedtools_merge.bed \
    -b workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp_complexSVs.bed \
    | bedtools subtract \
        -a stdin \
        -b data/manually_created_files/GRCh38_MRG_GAPs.bed \
    > workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark_temp.bed

#HG002v11-align2-GRCh37.dip_complexindelsgt9bpinRepeats.bed  from the SV benchmark bed:
# Find tandem repeats and homopolymers that have multiple indels >9bp, since these can add up to >49bp and should be subtracted from the benchmark SV bed
gunzip -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    | awk '{FS="\t|,"} {if($1 ~ /^#/ || length($4)-length($5)>9 || length($5)-length($4)>9 || length($6)-length($4)>9) print}' \
    | intersectBed \
        -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
        -b stdin -c \
    | awk '$4>1' \
    | cut -f1-3 \
    | intersectBed -v \
        -a stdin \
        -b workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_SVsgt49bp_repeatexpanded_slop50_merge1000_intersect_MRG_benchmark_coordinates_onlygt49bp.bed \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_complexindelsgt9bpinRepeats.bed

bedtools subtract \
    -a workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark_temp.bed \
    -b workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38.dip_complexindelsgt9bpinRepeats.bed \
    > workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark.bed


cat workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark.bed \
    | awk '{sum+=$3-$2} END {print sum}'

In [None]:
# Decompose for truvari comparison
#vt decompose -s HG002v11-align2-GRCh37.dip.vcf -o HG002v11-align2-GRCh37.dip_decomposed.vcf
#python script to remove ambiguous (non-ACTGN) REF

gunzip -c data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf.gz \
    > data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf

## remove ambiguous (non-ACGTN) in REF field. Adjust path to where you keep this file
python scripts/fix_reference_allele.py \
    --input_vcf_file data/hifiasm_dipcall_output/HG002v11-align2-GRCh38.dip.vcf \
    --output_file workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig.dip.vcf

## zip for bcftools
bgzip -c workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig.dip.vcf \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig.dip.vcf.gz

tabix workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig.dip.vcf.gz

## split multiallelic to biallelic
#bcftools norm -m- (multiallelic split)

bcftools norm -m- \
    -Oz workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig.dip.vcf.gz \
    -o workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_m.vcf.gz

## left align and normalize indels. Adjust to path where your reference.fa is located. 
#bcftools norm -f  (normalization)

bcftools norm \
    -f resources/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa \
    -Oz -o workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mf.vcf.gz \
    workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_m.vcf.gz

#bcftools norm -f /Users/jmw7/v4.1_development/HG001/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta-index/genome.fa -Oz -o workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38._noambig_norm_mf.vcf.gz workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38._noambig_norm_m.vcf.gz


## remove duplicate records
#bcftools norm -d  (remove duplicate records)

bcftools norm \
    -d none \
    -Oz workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mf.vcf.gz \
    -o workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd.vcf.gz
    
## remove MHC region. Adjust to path where MHC.bed is located. 

bedtools subtract \
    -header \
    -a workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd.vcf.gz \
    -b resources/giab_stratifications/GRCh38_MHC.bed.gz \
    | bgzip -c \
    >  workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC.vcf.gz



In [None]:
## intersect w/ benchmark bed and subset to >39bp in REF or ALT fields. Adjust to path of benchmark.bed
#intersect w/ MRG target regions and subset >39 bp

bedtools intersect \
    -header \
    -a workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC.vcf.gz \
    -b workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark.bed \
    | awk '$1 ~ /^#/ || length($4)>39 || length($5)>39' \
    | bgzip -c \
    > workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC_intersectBenchBED_gt39bp.vcf.gz

## index vcf, required by truvari
tabix -f -p vcf workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC_intersectBenchBED_gt39bp.vcf.gz


# NOTE:  workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC_intersectBenchBED_gt39bp.vcf.gz matches https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz

## Find benchmark variants between 35 and 49 base pairs in size and exclude overlapping tandem repeats plus slop 50bp on either side. Remove these from the benchmark regions bed so that it includes only SVs that are greater than 49 base pairs
gunzip workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC_intersectBenchBED_gt39bp.vcf.gz
python scripts/SVs_between_35_50bp.py \
    --input workflow/SV_benchmark/GRCh38/HG002v11-align2-GRCh38_noambig_norm_mfd_noMHC_intersectBenchBED_gt39bp.vcf \
    --output workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_SVs_gt34_and_lt_50bp.vcf

vcf2bed < workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_SVs_gt34_and_lt_50bp.vcf \
    > workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_SVs_gt34_and_lt_50bp.bed

intersectBed -wa \
    -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    -b workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_SVs_gt34_and_lt_50bp.bed \
    | multiIntersectBed \
        -i stdin workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_SVs_gt34_and_lt_50bp.bed \
        |  awk '{FS=OFS="\t"} {print $1,$2-50,$3+50}' \
        > workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_gt34_and_lt_50bp_repeatexpanded_slop50.bed

bedtools subtract \
    -a workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark.bed \
    -b workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.02_gt34_and_lt_50bp_repeatexpanded_slop50.bed \
    >  workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark_removed_SVs_gt35bp_and_lt50bp.bed

In [None]:
bedtools subtract \
    -a workflow/SV_benchmark/GRCh38/HG002_GRCh38_MRG_draft_SV_benchmark_removed_SVs_gt35bp_and_lt50bp.bed \
    -b data/manually_created_files/GRCh38_CD4_gaps_slop50.bed \
    >  workflow/SV_benchmark/GRCh38/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed

In [None]:
# Steps for creating excluding errors found in curation of the MRG benchmark, in order to create v1.0 bed files for small variants and SVs
# Updated 4/27/21 to exclude 50bp on either side of errors and unsure variants found in curation, because sometimes a deletion and overlapping SNVs outside a repeat weren't excluded from benchmark stats. Also, remove ".;" at beginning of INFO field introduced by svanalyzer svwiden. Produces v1.00.01 small variant and structural variant MRG benchmarks

mkdir -p workflow/release_benchmark_generation
##errors to exclude found in curation of small variants
# download tab-delimited file from GRCh37andGRCh38 tab in google doc that that curations for all evaluations as well as common Fps, Fns, and FP_FNs - https://docs.google.com/spreadsheets/d/1Pn7WP78JfWKCO2Df31n_4gzDOwtP69flgDyyjeBS6JE/edit?usp=sharing
# create bed with sites curated as unsure or incorrect in the benchmark in GRCh38 coordinates

cut -f3,7,9,12,20 data/manually_created_files/combined\ curation\ responses\ from\ benchmarking\ with\ sm\ variant\ v0.02.03\ -\ GRCh37andGRCh38.tsv \
    | grep 'sure\|o' \
    | grep -v ^ref \
    | awk '{FS=OFS="\t"} {print $2, $3-50, $3+length($4)+50}' \
    | sed 's/^chr//' \
    | sort -k1,1n -k2,2n -k3,3n \
    | sed 's/^/chr/' \
    | mergeBed -i stdin -d 100 \
    > workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure.bed

#expand bed coordinates to completely cover any overlapping homopolymers and tandem repeats
intersectBed -wa \
    -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    -b workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure.bed \
    | multiIntersectBed \
        -i stdin workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure.bed \
    | mergeBed \
        -i stdin \
    > workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_repeatexpanded.bed


cd data/manually_created_files/

grep ^Common combined\ curation\ responses\ from\ benchmarking\ with\ sm\ variant\ v0.02.03\ -\ GRCh37andGRCh38.tsv \
    | cut -f3,7,9,12,20 \
    | grep 'sure\|o' \
    | grep -v ^ref \
    | awk '{FS=OFS="\t"} {print $2, $3-50, $3+length($4)+50}' \
    | sed 's/^chr//' \
    | sort -k1,1n -k2,2n -k3,3n \
    | sed 's/^/chr/' \
    | mergeBed -i stdin -d 100 \
    > ../../workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_Commononly.bed

grep -v ^Common combined\ curation\ responses\ from\ benchmarking\ with\ sm\ variant\ v0.02.03\ -\ GRCh37andGRCh38.tsv \
    | cut -f3,7,9,12,20 \
    | grep 'sure\|o' \
    | grep -v ^ref \
    | grep ^GRCh38 \
    | awk '{FS=OFS="\t"} {print $2, $3, $3+length($4)}' \
    | sed 's/^chr//' \
    | sort -k1,1n -k2,2n -k3,3n \
    | sed 's/^/chr/' \
    > ../../workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_evaluationonly.bed
cd ../..

In [None]:
wc -l workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_evaluationonly.bed
#      63


intersectBed \
    -wa -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    -b workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_Commononly.bed \
    | multiIntersectBed \
        -i stdin workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_Commononly.bed \
    | mergeBed -i stdin \
    | intersectBed -v \
        -a workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_evaluationonly.bed \
        -b stdin \
    | wc -l
# 4

In [None]:
##errors to exclude found in complex variants in tandem repeats
# From the google doc that contains curations of Fns and FN_FPs after comparing HiCanu2.1 dipcall vcf to MRG small variant benchmark v0.02.03, download xlsx and export as tab-delimited file in excel, since the tab-delimited option wasn't working in google docs - https://docs.google.com/spreadsheets/d/10s0vx8RzK_FmyVYDzNIfeBjsWyCMhOcc2mqjjPP6Jsk/edit?usp=sharing
# create bed with sites curated as unsure or incorrect in the benchmark in GRCh38 coordinates
cut -f6,7,9,14 data/manually_created_files/HiCanu_2.1_HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03_intersected_subtract_FPs_repeatexpanded_slop50_manual_curation_sites.tsv_manual_curation_sites.txt \
    | grep 'sure\|o' \
    | grep -v position \
    | awk '{FS=OFS="\t"} {print $1, $2, $2+length($3)}' \
    | sed 's/^chr//' \
    | sort -k1,1n -k2,2n -k3,3n \
    | sed 's/^/chr/' \
    | mergeBed -i stdin -d 100 \
    > workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure.bed

#expand bed coordinates to completely cover any overlapping homopolymers and tandem repeats
intersectBed -wa \
    -a resources/giab_stratifications/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
    -b workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure.bed \
    | multiIntersectBed \
        -i stdin workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure.bed \
    | mergeBed -i stdin > workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure_repeatexpanded.bed

#used NCBI remap with default parameters to find coordinates on GRCh37, producing GRCh37_curation_medicalgene_smallvar_complexrepeat_errorsorunsure_repeatexpanded.bed renamed from remapped_GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure_repeatexpanded.bed
#exclude these from comparison of HiCanu to GRCh37 MRG benchmark, and curated remaining sites in HiCanu_2.1_HG002_GRCh37_difficult_medical_gene_smallvar_benchmark_v0.02.03_manual_curation_sites.tsv, and all were correct in benchmark, so no additional bed is needed


##subtract curated errors from small variant benchmark beds

#exclude errors/unsure sites found in SV curation, small variant curation, small variant complex TR curation, and FPs from complex TR comparison
subtractBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed \
    -b data/manually_created_files/GRCh38_curation_medicalgene_SV_errorsorunsure_repeatexpanded.bed \
    | subtractBed \
        -a stdin \
        -b workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_errorsorunsure_repeatexpanded.bed \
    | subtractBed \
        -a stdin \
        -b workflow/release_benchmark_generation/GRCh38_curation_medicalgene_smallvar_complexrepeat_errorsorunsure_repeatexpanded.bed \
    | subtractBed \
        -a stdin \
        -b data/manually_created_files/HiCanu_2.1_HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03_intersected_FPs_repeatexpanded_slop50.bed \
    > workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v1.00.01.bed


In [None]:
cp data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz \
    workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v1.00.01.vcf.gz 
cp data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz.tbi \
    workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v1.00.01.vcf.gz.tbi 


##subtract curated errors from SV benchmark beds

#download GRCh38 v0.01 bed from https://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_HG002_medical_genes_SV_benchmark_v0.01/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed
#exclude errors/unsure sites found in SV curation
subtractBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed \
    -b data/manually_created_files/GRCh38_curation_medicalgene_SV_errorsorunsure_repeatexpanded.bed \
    > workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.bed

In [None]:
#Make sure the bed size doesn't decrease more than expected
awk '{sum+=$3-$2} END {print sum}' data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed 
#11698514

awk '{sum+=$3-$2} END {print sum}' workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v1.00.01.bed 
#11663057

#Make sure the number of variants in the bed don't decrease more than expected
intersectBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz \
    -b data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.bed \
    | wc -l
#21567

intersectBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v0.02.03.vcf.gz \
    -b workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_smallvar_benchmark_v1.00.01.bed \
    | wc -l
#21232

#Make sure the bed size doesn't decrease more than expected
awk '{sum+=$3-$2} END {print sum}' data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed 
#11962175
awk '{sum+=$3-$2} END {print sum}' workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.bed 
#11961541

#Make sure the number of variants in the bed don't decrease more than expected
intersectBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz \
    -b data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.bed \
    | wc -l
#218
intersectBed \
    -a data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz \
    -b workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.bed \
    | wc -l
#217

In [None]:
cp data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz \
    workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.vcf.gz 
cp data/manually_created_files/cmrg_draft_benchmarks/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.01.vcf.gz.tbi \
    workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.vcf.gz.tbi 

#remove ".;" at beginning of INFO field introduced by svanalyzer svwiden
gunzip -c workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.vcf.gz  \
    | sed 's/\.;REPTYPE/REPTYPE/' \
    | bgzip -c \
    > workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v1.00.01.vcf.gz

tabix -f workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v1.00.01.vcf.gz

cp workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v0.02.00.bed \
    workflow/release_benchmark_generation/HG002_GRCh38_difficult_medical_gene_SV_benchmark_v1.00.01.bed 

### Software versions

See `environment.yml` for dependencies.