# Prepare Stratification BEDS fro GitHub and validation

This notebook describes the process for preparing the new (2019) GRCh37 and GRCh38 stratification BEDs for depositing on GitHub.  These files have been transferred from the following directories on the NAS for processing and validation within the GRCh38_stratificaiton validation R project directory.

#### GRCh37 Stratifications
/Volumes/giab/analysis/benchmarking-tools/resources/stratification-bed-files

#### GRCh38 Stratifications
/Volumes/giab/analysis/GRCh38_stratification_bed_files

#### Ns files
refN and PSA Y files for subtraction generated with refN_analysis_and_creation.sh

## Prepare first batch of files

In [1]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
INPUT38=${WKDIR}GRCh38/
INPUT37=${WKDIR}GRCh37/
OUTPUT38=${WKDIR}GRCh38/sorted_merged_noNs/
OUTPUT37=${WKDIR}GRCh37/sorted_merged_noNs/
refNs=${WKDIR}refNs/

Mon Jan  6 15:53:06 EST 2020


#### Create header file for BEDs

In [2]:
echo "#This is part of a collection of bed files intended to stratify performance by genome context using GA4GH benchmarking tools" > ${WKDIR}header.txt

### Sort -> cut to chrom 1:22 XY -> merge -> subtract refNs (gap) -> subtract Pseudoautosomal Y regions 

#### GRCh38

In [5]:
date
bedtools --version 

for f in ${INPUT38}*.gz
do
    OUTFILE=${OUTPUT38}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    sed 's/^[a-zA-Z0-9_]/chr&/'|
    subtractBed -a stdin -b ${refNs}gap_38_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_GRCh38.bed | 
    cat ${WKDIR}header.txt - | 
 	bgzip > ${OUTFILE}
    echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh38_stratificationBEDs_for_Github_md5s.txt
    
done

Mon Jan  6 16:44:24 EST 2020
bedtools v2.29.1
processed:   GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_diTR_11to50_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_diTR_51to200_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_diTR_gt200_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_homopolymer_4to6_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_homopolymer_7to11_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_homopolymer_gt11_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz
processed:   GRCh38_SimpleRepeat_quadTR_20to50_slop5.bed.gz
processed:   GRCh38_SimpleRep

#### GRCh37

In [6]:
date
bedtools --version 

for f in ${INPUT37}*.gz
do
    OUTFILE=${OUTPUT37}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    subtractBed -a stdin -b ${refNs}hg19.gap_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_hg19.bed | 
    cat header.txt - | 
 	bgzip > ${OUTFILE}
     echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh37_stratificationBEDs_for_Github_md5s.txt
    
done

Mon Jan  6 17:03:28 EST 2020
bedtools v2.29.1
processed:   AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processed:   AllTandemRepeats_201to10000bp_slop5.bed.gz
processed:   AllTandemRepeats_51to200bp_slop5.bed.gz
processed:   AllTandemRepeats_gt10000bp_slop5.bed.gz
processed:   AllTandemRepeats_gt100bp_slop5.bed.gz
processed:   AllTandemRepeats_lt51bp_slop5.bed.gz
processed:   AllTandemRepeatsandHomopolymers_slop5.bed.gz
processed:   BadPromoters_gb-2013-14-5-r51-s1.bed.gz
processed:   GRCh37_chainSelf_sorted_merged.bed.gz
processed:   GRCh37_chainSelf_sorted_merged_gt10kb.bed.gz
processed:   GRCh37_notinchainSelf_sorted_merged.bed.gz
processed:   GRCh37_notinchainSelf_sorted_merged_gt10kb.bed.gz
processed:   HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_comphetindel10bp_slop50.bed.gz
processed:   HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf_RTGandPGphasetransfer_comphetsnp10bp_

#### Notes from running script for both genomes:

1)Had issues with merging Genome Specific SVs after subtractbed, mergebed seeing the files as unsorted. Decided to reorder process such that merge was completed prior to subtractbed.  Turns out this might not be necessary.  Reran again and had issues. Issues are inconsitent. Things I have tried are running in different browser, running as shell script, clearing output & restarting Kernal and running on a different machine.  I have encountered the "sort" error inconsistently, not always the same file.  The Malloc error has been a problem with RCh38_SimpleRepeat_homopolymer_4to6_slop5.bed.gz however I have gotten it to complete sucessfully.   Errors encountered are below:

Sort Error: "Error: Sorted input specified, but the file stdin has the following out of order record"

Memory Error: "sort(20634,0x1131065c0) malloc: *** error for object 0x7ff200000001: pointer being freed was not allocated
sort(20634,0x1131065c0) malloc: *** set a breakpoint in malloc_error_break to debug"

2) In our troubleshooting NDO and I tried to replace the UNIX sort with bedtools sortBed.  While this appeared to work output file was not sorted by chromosome following mergeBed.  I have decided to just continue with UNIX sort which does result in what appears to be a merged-sorted file.  

3) For BEDs Justin wants to make sure we sort with all three fields, sort -k1,1n -k2,2n -k3,3n


#### Notes on BEDs:

1) After talkign with NDO decided to only work with refseq_union_cds.bed.gz (for GRCh37). The other files that had been sorted and trimmed down to 3 fields appeared to have issues.  The script in this notebook should accomplish what was being done with the other files.

2) Previously I have not included HG002_SVs_Tier1plusTier2_v0.6.1.bed.gz (for GRCh37) in my validation.  I asked JW to download from DNAnexus for me because it was not on the NAS.  I have now added this file for comparison with remapped in the validation and have put it on the NAS as well.

3) Filenames in the output directories will be cleaned up to remove any reference to "sort" and/or "merge". All files in the output directories ...sorted_merged_noNs/ have been sorted and merged.  The use of sort and merge was inconsistent in naming so to be consistent we removed.  It will not hurt if someone sorts and merges these files but will benefit those who have not sorted/merged. 

4) To clarify the difference ... only a subset of the GRCh37 mappability beds were generated for GRCh38, Justin created the lowmappabilityall_GRCh38equivalent.bed.gz as the union of the GRCh37 mappability beds with matching GRCh38 beds. Whereas lowmappaiblityall.bed.gz is the union of all GRCh37 low mappability beds.  Only lowmappabilityall_GRCh38equivalent.bed.gz will be in this release.  lowmappaiblityall.bed.gz will be removed from the directory.

## Prepare second batch of files

Justin decided to include some all and notin all and Genomic Specific files that were not originally included.  I'm runnign as a separate code chunk due to issues I had previously with sort and memory errors.  The same script will be used but point to different directories that house the additional files.

In [1]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
INPUT38batch2=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/
INPUT37batch2=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/
OUTPUT38=${WKDIR}GRCh38/sorted_merged_noNs/
OUTPUT37=${WKDIR}GRCh37/sorted_merged_noNs/
refNs=${WKDIR}refNs/

Wed Jan 22 16:12:02 EST 2020


<font bold color='red'>NOTE: second batch output files (OUTPUTXX) in GRChXX/sorted_merged_noNs/ were moved to GRChXX/sorted_merged_noNs/second_batch_from_JZ. This was done to aid in the validation of these files with Second_Batch_ChromANDCoveragePlots.R script.  In hindsight I should have just made a directory of symbolic links rather than moving the files.</font>

#### GRCh37

In [2]:
##Some input files were zipped and others were not.  This is to check if file is zipped and if not then zip.

date
for f in ${INPUT37batch2}/original_files/*
do
OUTFILE=${INPUT37batch2}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT37batch2}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Wed Jan 22 16:12:15 EST 2020
already zipped: HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
bgzipped: HG002_GRCh37_1_22_v4.1_draft_benchmark_comphetindel10bp_slop50.bed
bgzipped: HG002_GRCh37_1_22_v4.1_draft_benchmark_comphetsnp10bp_slop50.bed
bgzipped: HG002_GRCh37_1_22_v4.1_draft_benchmark_complexindel10bp_slop50.bed
bgzipped: HG002_GRCh37_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed
bgzipped: HG002_GRCh37_1_22_v4.1_draft_benchmark_snpswithin10bp_slop50.bed
already zipped: HG002_genomespecific_CNVsandSVs_v4.1.bed.gz
already zipped: HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
already zipped: HG002_genomespecific_complexandSVs_v4.1.bed.gz
bgzipped: HG2_SKor_TrioONTCanu_intersect_HG2_SKor_TrioONTFlye_intersect_HG2_SKor_CCS15_gt10kb_GRCh37.bed
bgzipped: L1Hs_gt500.bed
bgzipped: MHC_GRCh37.bed
bgzipped: SP_SPonly_primary_assm_alignments_sorted_merge

In [4]:
date
bedtools --version 

for f in ${INPUT37batch2}*.gz
do
    OUTFILE=${OUTPUT37}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    subtractBed -a stdin -b ${refNs}hg19.gap_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_hg19.bed | 
    cat header.txt - | 
 	bgzip > ${OUTFILE}
     echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh37_stratificationBEDs_for_Github_md5s.txt
    
done

Wed Jan 22 16:26:38 EST 2020
bedtools v2.29.1
processed:   HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
processed:   HG002_GRCh37_1_22_v4.1_draft_benchmark_comphetindel10bp_slop50.bed.gz
processed:   HG002_GRCh37_1_22_v4.1_draft_benchmark_comphetsnp10bp_slop50.bed.gz
processed:   HG002_GRCh37_1_22_v4.1_draft_benchmark_complexindel10bp_slop50.bed.gz
processed:   HG002_GRCh37_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed.gz
processed:   HG002_GRCh37_1_22_v4.1_draft_benchmark_snpswithin10bp_slop50.bed.gz
processed:   HG002_genomespecific_CNVsandSVs_v4.1.bed.gz
processed:   HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
processed:   HG002_genomespecific_complexandSVs_v4.1.bed.gz
processed:   HG2_SKor_TrioONTCanu_intersect_HG2_SKor_TrioONTFlye_intersect_HG2_SKor_CCS15_gt10kb_GRCh37.bed.gz
processed:   L1Hs_gt500.bed.gz
processed:   MHC_GRCh37.bed.gz
proces

#### GRCh38

In [5]:
##Some input files were zipped and others were not.  This is to check if file is zipped and if not then zip.

date
for f in ${INPUT38batch2}/original_files/*
do
OUTFILE=${INPUT38batch2}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT38batch2}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Wed Jan 22 16:40:25 EST 2020
bgzipped: GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb.bed
already zipped: GRCh38_HG002_genomespecific_CNVsandSVs_v4.1.bed.gz
already zipped: GRCh38_alldifficultregions.bed.gz
already zipped: GRCh38_alllowmapandsegdupregions.bed.gz
bgzipped: GRCh38_mrcanavar_intersect_ccs_1000_window_size_cnv_threshold_intersect_ont_1000_window_size_cnv_threshold.bed
already zipped: GRCh38_notinalldifficultregions.bed.gz
already zipped: GRCh38_notinalllowmapandsegdupregions.bed.gz
bgzipped: GRCh38_union_HG002_CCS_15kb_20kb_merged_ONT_1000_window_size_combined_elliptical_outlier_threshold.bed
already zipped: HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
bgzipped: HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetindel10bp_slop50.bed
bgzipped: HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetsnp10bp_slop50.bed
bgzipped: HG002_GRCh38_1_22_v4.1_draft_benchmark_complexindel10bp_slop50.bed
bgzipped: HG002_GRCh38_1_22_v4.1_draft_benchmark_ot

In [7]:
date
bedtools --version 

for f in ${INPUT38batch2}*.gz
do
    OUTFILE=${OUTPUT38}/$( basename ${f})
    
    gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    sed 's/^[a-zA-Z0-9_]/chr&/'|
    subtractBed -a stdin -b ${refNs}gap_38_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_GRCh38.bed | 
    cat header.txt - |  
 	bgzip > ${OUTFILE}
    echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh38_stratificationBEDs_for_Github_md5s.txt
    
done

Wed Jan 22 17:05:49 EST 2020
bedtools v2.29.1
processed:   GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set_REF_N_slop_15kb.bed.gz
processed:   GRCh38_HG002_genomespecific_CNVsandSVs_v4.1.bed.gz
processed:   GRCh38_alldifficultregions.bed.gz
processed:   GRCh38_alllowmapandsegdupregions.bed.gz
processed:   GRCh38_mrcanavar_intersect_ccs_1000_window_size_cnv_threshold_intersect_ont_1000_window_size_cnv_threshold.bed.gz
processed:   GRCh38_notinalldifficultregions.bed.gz
processed:   GRCh38_notinalllowmapandsegdupregions.bed.gz
processed:   GRCh38_union_HG002_CCS_15kb_20kb_merged_ONT_1000_window_size_combined_elliptical_outlier_threshold.bed.gz
processed:   HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
processed:   HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetindel10bp_slop50.bed.gz
processed:   HG002_GRCh38_1_22_v4.1_draft_benchmark_comphetsnp10bp_slop50.bed.gz
processed:   HG002_GRCh38_1_22_v4.1_draft_benchmark_complexindel10bp_slop50.bed.gz
processed:   HG0

## Prepare third batch of files

These files are ones Justin had prepared for the previous (second) batch however I missed copying them over.  

In [1]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
ckzip38=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/original_files/added_012320/
ckzip37=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/original_files/added_012320/
INPUT38batch3=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/third_batch/
INPUT37batch3=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/third_batch/
OUTPUT38=${WKDIR}GRCh38/sorted_merged_noNs/
OUTPUT37=${WKDIR}GRCh37/sorted_merged_noNs/
refNs=${WKDIR}refNs/

Fri Jan 24 13:11:33 EST 2020


#### GRCh37

In [2]:
##Some input files were zipped and others were not.  This is to check if file is zipped and if not then zip.

date
for f in ${ckzip37}/*
do
OUTFILE=${INPUT37batch3}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT37batch3}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Fri Jan 24 13:11:38 EST 2020
already zipped: HG001_genomespecific_complexandSVs_v3.3.2_PG_RTG.bed.gz
already zipped: HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG002_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG003_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: HG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG004_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: HG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG005_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: notinHG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: notinHG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: notinHG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: notinHG005_ge

In [3]:
date
bedtools --version 

for f in ${INPUT37batch3}*.gz
do
    OUTFILE=${OUTPUT37}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    subtractBed -a stdin -b ${refNs}hg19.gap_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_hg19.bed | 
    cat header.txt - | 
 	bgzip > ${OUTFILE}
     echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh37_stratificationBEDs_for_Github_md5s.txt
    
done

Fri Jan 24 13:11:47 EST 2020
bedtools v2.29.1
processed:   HG001_genomespecific_complexandSVs_v3.3.2_PG_RTG.bed.gz
processed:   HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG002_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG003_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   HG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG004_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   HG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG005_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   notinHG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   notinHG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   notinHG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   notinHG005_genomespecific_complexan

#### GRCh38

In [4]:
##Some input files were zipped and others were not.  This is to check if file is zipped and if not then zip.

date
for f in ${ckzip38}/*
do
OUTFILE=${INPUT38batch3}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT38batch3}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Fri Jan 24 13:24:38 EST 2020
already zipped: GRCh38_allOtherDifficultregions.bed.gz
already zipped: GRCh38_remapped_HG001_genomespecific_complexandSVs_v3.3.2_PG_RTG.bed.gz
already zipped: GRCh38_remapped_HG002_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: GRCh38_remapped_HG003_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: GRCh38_remapped_HG004_genomespecific_complexandSVs_v3.3.2.bed.gz
already zipped: GRCh38_remapped_HG005_genomespecific_complexandSVs_v3.3.2.bed.gz


In [5]:
date
bedtools --version 

for f in ${INPUT38batch3}*.gz
do
    OUTFILE=${OUTPUT38}/$( basename ${f})
    
    gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    sed 's/^[a-zA-Z0-9_]/chr&/'|
    subtractBed -a stdin -b ${refNs}gap_38_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_GRCh38.bed | 
    cat header.txt - |  
 	bgzip > ${OUTFILE}
    echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh38_stratificationBEDs_for_Github_md5s.txt
    
done

Fri Jan 24 13:24:48 EST 2020
bedtools v2.29.1
processed:   GRCh38_allOtherDifficultregions.bed.gz
processed:   GRCh38_remapped_HG001_genomespecific_complexandSVs_v3.3.2_PG_RTG.bed.gz
processed:   GRCh38_remapped_HG002_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   GRCh38_remapped_HG003_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   GRCh38_remapped_HG004_genomespecific_complexandSVs_v3.3.2.bed.gz
processed:   GRCh38_remapped_HG005_genomespecific_complexandSVs_v3.3.2.bed.gz


## 1/30/20 Reprocess HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed
After plotting coverage for  the prior version (in second batch) HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed it was noticed that coverage of GRCh38 was significantly higher than GRCh37.  JZ noticed he had a problem with the preparation of the original file so he remade.  The 1/30/20 file will replace any previous versions of this file. 

In [4]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
ckzip38=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/original_files/added_013020/
INPUT38batch2=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/
OUTPUT38=${WKDIR}GRCh38/sorted_merged_noNs/second_batch_from_JZ/
refNs=${WKDIR}refNs/

Fri Jan 31 10:03:20 EST 2020


In [5]:
date
for f in ${ckzip38}/*
do
OUTFILE=${INPUT38batch2}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT38batch2}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Fri Jan 31 10:03:57 EST 2020
bgzipped: HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed


In [6]:
date
bedtools --version 

    OUTFILE=${OUTPUT38}/$( basename /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed.gz)
    
    gzcat /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed.gz |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    sed 's/^[a-zA-Z0-9_]/chr&/'|
    subtractBed -a stdin -b ${refNs}gap_38_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_GRCh38.bed | 
    cat header.txt - |  
 	bgzip > ${OUTFILE}
    echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> GRCh38_stratificationBEDs_for_Github_md5s.txt


Fri Jan 31 10:04:50 EST 2020
bedtools v2.29.1
processed:   HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed


##  2/11/20 JZ revised union files to incorporate decoy (mm-2-merged.bed)
JZ realized he did not use the decoy file (mm-2-merged.bed) in his union files.  He re-ran Merge_difficult_regions.sh for the following files.  Note: alllowmapandsegdupregions.bed.gz and notinalllowmapandsegdupregions.bed.gz did not change but were re-run as they were part of the script. 

In [6]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
ckzip37=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/original_files/revised_021120/
INPUT37_021120=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/revised_021120/
OUTPUT37_021120=${WKDIR}GRCh37/sorted_merged_noNs/revised_021120/
refNs=${WKDIR}refNs/

Tue Feb 11 15:34:38 EST 2020


In [8]:
##Some input files were zipped and others were not.  This is to check if file is zipped and if not then zip.

date
for f in ${ckzip37}/*
do
OUTFILE=${INPUT37_021120}/$( basename ${f})

if [[ $f == *.bed ]] 
then
    echo "bgzipped: "$( basename ${f})
    bgzip < $f > ${OUTFILE}.gz
else
     cp $f ${INPUT37_021120}/$( basename ${f})
     echo "already zipped: "$( basename ${f})
fi
done

Tue Feb 11 15:36:53 EST 2020
already zipped: HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
already zipped: HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
already zipped: HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: HG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: allOtherDifficultregions.bed.gz
already zipped: alldifficultregions.bed.gz
already zipped: alllowmapandsegdupregions.bed.gz
bgzipped: mm-2-merged.bed
already zipped: notinHG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
already zipped: notinHG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
already zipped: notinHG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
already zipped: notinHG003_genomesp

In [9]:
date
bedtools --version 

for f in ${INPUT37_021120}*.gz
do
    OUTFILE=${OUTPUT37_021120}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    subtractBed -a stdin -b ${refNs}hg19.gap_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_hg19.bed | 
    cat ${WKDIR}header.txt - | 
 	bgzip > ${OUTFILE}
     echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> ${WKDIR}GRCh37_stratificationBEDs_for_Github_md5s.txt
    
done

Tue Feb 11 15:38:04 EST 2020
bedtools v2.29.1
processed:   HG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
processed:   HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
processed:   HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG004_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   HG005_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   allOtherDifficultregions.bed.gz
processed:   alldifficultregions.bed.gz
processed:   alllowmapandsegdupregions.bed.gz
processed:   mm-2-merged.bed.gz
processed:   notinHG001_genomespecific_RTG_PG_v3.3.2_SVs_alldifficultregions.bed.gz
processed:   notinHG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz
processed:   notinHG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz
processed:   notinHG003_genomespecific_complexan

## 2/21/20 Revised mappability and functional region files
Nate revised some of the mappability and functional region files.  Previous versions of these files will be removed.

In [1]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
INPUT37_022120=${WKDIR}GRCh37/NDO_revised_022120/
OUTPUT37_022120=${WKDIR}GRCh37/sorted_merged_noNs/revised_022120/
refNs=${WKDIR}refNs/

Fri Feb 21 14:05:29 EST 2020


In [3]:
date
bedtools --version 

for f in ${INPUT37_022120}*.gz
do
    OUTFILE=${OUTPUT37_022120}/$( basename ${f})
    
	gzcat $f |
    sed 's/^chr//'|
    sed 's/^X/23/;s/^Y/24/'|
    grep -Ev '^M|^MT|^[0-9][0-9]_|^[0-9]_|^[0-9]\||^[0-9][0-9]\||^Un|^HS'|
    sort -k1,1n -k2,2n -k3,3n  |
    mergeBed -i stdin |
	sed 's/^23/X/;s/^24/Y/' |
    subtractBed -a stdin -b ${refNs}hg19.gap_1thruY_sorted_merged.bed | 
	subtractBed -a stdin -b ${refNs}PSA_Y_hg19.bed | 
    cat ${WKDIR}header.txt - | 
 	bgzip > ${OUTFILE}
     echo "processed:  " $( basename ${f})
    md5 ${OUTFILE} >> ${WKDIR}GRCh37_stratificationBEDs_for_Github_md5s.txt
    
done

Fri Feb 21 14:06:55 EST 2020
bedtools v2.29.1
processed:   GRCh37_refseq_cds_merged.bed.gz
processed:   lowmappabilityall.bed.gz
processed:   notin_GRCh37_refseq_cds_merged.bed.gz
processed:   notinlowmappabilityall.bed.gz


### MD5s

In [9]:
date
WKDIR=/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/
#original files from NAS
#first batch- all were zipped to begin with
INPUT38=${WKDIR}GRCh38/
INPUT37=${WKDIR}GRCh37/
#second batch 
INPUT38batch2ORG=${WKDIR}GRCh38/newfiles_from_JZ_for_processing/original_files/
INPUT37batch2ORG=${WKDIR}GRCh37/newfiles_from_JZ_for_processing/original_files/
#third batch
ckzip38=${INPUT38batch2ORG}/added_012320/
ckzip38revised=${INPUT38batch2ORG}/added_013020/
ckzip37=${INPUT37batch2ORG}/added_012320/
ckzip37_revised_021120=${INPUT37batch2ORG}/revised_021120/
#022120 revised
INPUT37_022120=${INPUT37}/NDO_revised_022120/

#Script Output files
OUTPUT38=${WKDIR}GRCh38/sorted_merged_noNs/
OUTPUT38batch2=${WKDIR}GRCh38/sorted_merged_noNs/second_batch_from_JZ/
OUTPUT37=${WKDIR}GRCh37/sorted_merged_noNs/
OUTPUT37batch2=${WKDIR}GRCh37/sorted_merged_noNs/second_batch_from_JZ/
OUTPUT37_revised_021120=${WKDIR}GRCh37/sorted_merged_noNs/revised_021120/
OUTPUT37_revised_022120=${WKDIR}GRCh37/sorted_merged_noNs/revised_022120/

Fri Feb 21 14:27:54 EST 2020


#### GRCh37 original files
these are files that Justin created and were transferred from NAS

In [10]:
date
echo "FIRST BATCH" 
for f in ${INPUT37}/*
do
md5 $f | tee -a GRCh37_stratificationBEDs_original_files_md5s.txt
done

echo "SECOND BATCH" 
for f in ${INPUT37batch2ORG}/*
do
md5 $f | tee -a GRCh37_stratificationBEDs_original_files_md5s.txt
done

echo "THIRD BATCH"
for f in ${ckzip37}/*
do
md5 $f | tee -a GRCh37_stratificationBEDs_original_files_md5s.txt
done

echo "revised 021120"
for f in ${ckzip37_revised_021120}/*
do
md5 $f | tee -a GRCh37_stratificationBEDs_original_files_md5s.txt
done

echo "revised 022120"
for f in ${INPUT37_022120}/*
do
md5 $f | tee -a GRCh37_stratificationBEDs_original_files_md5s.txt
done

Fri Feb 21 14:27:57 EST 2020
FIRST BATCH
md5: /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//ALL_GRCh37_sorted_merged_renamed_BEDs_for_GitHub: Is a directory
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz) = 1349fd487bb1e3e7ba39474a4ead2097
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//AllTandemRepeats_201to10000bp_slop5.bed.gz) = 46ac79107aa5945bf92f44820b89695f
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//AllTandemRepeats_51to200bp_slop5.bed.gz) = 4b1f2b0896454c1f80f21c124049ee64
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//AllTandemRepeats_gt10000bp_slop5.bed.gz) = 056c8a67da06edb71b06f82b7a14b0d9
MD5 (/Users/jmcdani/Docume

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_snpswithin10bp_slop50.bed.gz) = a6dc64306df97d296a71e512ca1d6ae4
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//HG005_HG006_HG007_FB_GATKHC_CG_MetaSV_allsvs_merged.bed.gz) = 44153248965c754897d23f4ea98001c3
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_geno2haplo_compoundhet_slop50.bed.gz) = 04cf05be4be7194afdfedc7e54e56960
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_varswithin50bp.bed.gz) = 244da99199bfdc45b8e35c9091a78219
md5: /Users/jmcdani/Documents/GiaB/Benc

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//human_g1k_v37_gemmap_l250_m2_e1_nonuniq.sort.bed.gz) = 2381f0e9daa84b9e2f379f5df85f9b71
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//human_g1k_v37_l100_gc15_slop50.bed.gz) = eafb341aa0801b307cbfa0326eb25611
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//human_g1k_v37_l100_gc15to20_slop50.bed.gz) = 950e60d442de4248474981089128751a
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//human_g1k_v37_l100_gc20to25_slop50.bed.gz) = c1eddec2a1f65fd06238bc4764ffb931
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37//human_g1k_v37_l100_gc25to30_slop50.bed.gz) = bda555febd0ddb3a30bd0b42c2fa25b7
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//allOtherDifficultregions.bed.gz) = 3f6230efc53bf4e240c1bf7362406226
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//alldifficultregions.bed.gz) = 5686cf44bad3143c0fb3335a75b5679a
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//alllowmapandsegdupregions.bed.gz) = 4c085421c509cd248fbf4bccc5fa48d3
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//example_of_no_ref_regions_input_file_b37_bedtools_slop_15kb_merged.bed) = 656e5c74035c4a8c8208e0f14e97448b
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratifica

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//revised_021120//HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz) = 65617ec02382e721417937ad83944eaa
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//revised_021120//HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz) = cc441f6619be36a7a4bb5d505cb503d6
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//revised_021120//HG003_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz) = 7bf3cddb1a7843a2702235564b8918ad
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/newfiles_from_JZ_for_processing/original_files//revised_021

#### GRCh38 original files
these are files Justin created and transferred from NAS

In [12]:
date
echo "FIRST BATCH"
for f in ${INPUT38}/*
do
md5 $f | tee -a GRCh38_stratificationBEDs_original_files_md5s.txt
done

echo "SECOND BATCH"
for f in ${INPUT38batch2ORG}/*
do
md5 $f | tee -a GRCh38_stratificationBEDs_original_files_md5s.txt
done

echo "THIRD BATCH"
for f in ${ckzip38}/*
do
md5 $f | tee -a GRCh38_stratificationBEDs_original_files_md5s.txt
done


Tue Feb 11 15:56:09 EST 2020
FIRST BATCH
md5: /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//ALL_GRCh38_sorted_merged_renamed_BEDs_for_GitHub: Is a directory
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz) = d6d3eb30e09c3d2ed2f2689dd5f8d3ec
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz) = acff62266d425b01e1762bf6d53c9582
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz) = 01ad1bc3b61c74dbdbd93e854f76b23b
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz) = 86bd3b7ad00092d4bf03635aaef9573

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_notinchainSelf_sorted_merged.bed.gz) = e9255bb4de1fb26d03085ebb7db39830
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_notinchainSelf_sorted_merged_gt10kb.bed.gz) = bdbb1fe76618548154b29e49cc1b965e
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_notinlowmappabilityall.bed.gz) = c0d3ecdc7ca29474e67c901ab62cd288
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_refseq_cds_merged.bed.gz) = d638bb07665a36e6a602fecddba9b136
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_BadPromoters_gb-2013-14-5-r51-s1.bed.gz) = e236139a276f4dc640c4e575559ecb8e
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_PG2016-1.0_NA12878_b37_comphetindel10bp_slop50.bed.gz) = 568084b50d3397f0d4c35cd141a8b21a
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_PG2016-1.0_NA12878_b37_comphetsnp10bp_slop50.bed.gz) = 472c22edc983a722eb6fbd38b851c1ef
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_PG2016-1.0_NA12878_b37_complexindel10bp_slop50.bed.gz) = 2dc9bdc4167e61211bee1b7ba36a1b2b
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_PG2016-1.0_NA12878_b37_snpswithin10bp_slop50.bed.gz) = 41f3697d8f61864596b25cc662f99598
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38//GRCh38_remapped_PacBio_MetaSV_

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/original_files//MHC_GRCh38.bed) = 3ad5379af2e4dfb8b9ebc1a1ddb06f36
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/original_files//SVMergeInversions.GRCh38.120519.clustered_slop150_chr1_22.bed) = a4f6626b50cfec06c3bad20b3576e2bc
md5: /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/original_files//added_012320: Is a directory
md5: /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/original_files//added_013020: Is a directory
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/newfiles_from_JZ_for_processing/original_files//expanded

#### GRCh37 ouput files
files output from the "Prepare Stratification BEDs for GitHub" script

In [11]:
date
echo "FIRST BATCH"
for f in ${OUTPUT37}/*.gz
do
md5 $f
done

echo "SECOND BATCH"
for f in ${OUTPUT37batch2}/*.gz
do
md5 $f
done

echo "revised 021120"
for f in ${OUTPUT37_revised_021120}/*.gz
do
md5 $f
done

echo "revised 022120"
for f in ${OUTPUT37_revised_022120}/*.gz
do
md5 $f
done

Fri Feb 21 14:42:37 EST 2020
FIRST BATCH
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz) = f2e791be098f3dda45729c85637b2805
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//AllTandemRepeats_201to10000bp_slop5.bed.gz) = 8ef62f7ba5e6426bb05ae3a886308724
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//AllTandemRepeats_51to200bp_slop5.bed.gz) = f70e28e5924f2c850b7b143b8aaa27b4
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//AllTandemRepeats_gt10000bp_slop5.bed.gz) = 17bfc25d321dfb450879a02a4a175664
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_no

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_comphetsnp10bp_slop50.bed.gz) = e7d4f7c4d8216fb2a92fed25db147b9b
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_complexindel10bp_slop50.bed.gz) = 5fa5b67a799a18740edecdea516813d7
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//HG005_GRCh37_highconf_CG-IllFB-IllGATKHC-Ion-SOLID_CHROM1-22_v.3.3.2_highconf_snpswithin10bp_slop50.bed.gz) = 4f647a4276a60eb2db1e503e2b67667c
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//HG005_HG006_HG007_FB_GATKHC_CG_MetaSV_allsvs_merged.b

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//human_g1k_v37_gemmap_l150_m2_e0_nonuniq.sort.bed.gz) = 0f539e38bc86b1890e4ea078059c07fe
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//human_g1k_v37_gemmap_l150_m2_e1_nonuniq.sort.bed.gz) = 2ce4b38a614541c83eff9d43760b19f9
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//human_g1k_v37_gemmap_l250_m0_e0_nonuniq.sort.bed.gz) = 3f89243deb75fe5b9a4cb62ad39cfaea
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//human_g1k_v37_gemmap_l250_m1_e0_nonuniq.sort.bed.gz) = 67cf2b18cdeb371557750e90dba50cda
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs//hum

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/second_batch_from_JZ//HG005_genomespecific_complexandSVs_v3.3.2.bed.gz) = eb9aef6d5f149f5b5abd42e0a884f53f
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/second_batch_from_JZ//HG2_SKor_TrioONTCanu_intersect_HG2_SKor_TrioONTFlye_intersect_HG2_SKor_CCS15_gt10kb_GRCh37.bed.gz) = 69c08ad1b1536d26f1f0424e1c593e75
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/second_batch_from_JZ//L1Hs_gt500.bed.gz) = e90c1a556696491bb049f5e50c4d8fec
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/second_batch_from_JZ//MHC_GRCh37.bed.gz) = e01f1ece88408c5d1450e7fa39871f37
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_val

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/revised_022120//notin_GRCh37_refseq_cds_merged.bed.gz) = 752ab990845e7cd3a4c1d0b3a06c29c9
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh37/sorted_merged_noNs/revised_022120//notinlowmappabilityall.bed.gz) = 1af4c140e01a7aaa8985d11f5e18acf7


#### GRCh38 output files 
files output from the "Prepare Stratification BEDs for GitHub" script

In [10]:
date
echo "FIRST BATCH"
for f in ${OUTPUT38}/*.gz
do
md5 $f
done

echo "SECOND BATCH"
for f in ${OUTPUT38batch2}/*.gz
do
md5 $f
done

Fri Feb  7 10:48:23 EST 2020
FIRST BATCH
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz) = df2759867268f613c4b5d3043556b7aa
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz) = d031dd132534d047bedf167d59d66070
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz) = e47464b9736943531ad6bc5488aaea5a
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz) = e4b5375c628cc69731135346b57d851f
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratificat

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz) = 47475f7d4f4acdb9a63735feb95aaf8e
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_notinchainSelf_sorted_merged.bed.gz) = 0464cb9c78cd449a1a11b704662b9f56
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_notinchainSelf_sorted_merged_gt10kb.bed.gz) = b335ffb609b71f6c8379e60d531590f8
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_notinlowmappabilityall.bed.gz) = ee6ac347f80650d9aebe3409f4a3ee58
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_refseq_cds_merged

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_remapped_HG005_HG006_HG007_FB_GATKHC_CG_MetaSV_allsvs_merged.bed.gz) = bf5fd59342d61d82b4925c0c2d658cc9
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_remapped_NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_geno2haplo_compoundhet_slop50.bed.gz) = 1266b743b90c45d5f552f1fa67ed8f2b
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_remapped_NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf_varswithin50bp.bed.gz) = b138e122f2002e30d33caa293bf00c94
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs//GRCh38_remapped_PG2016-1.0_NA12878_b37_comphetindel10bp_sl

MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs/second_batch_from_JZ//HG002_GRCh38_1_22_v4.1_draft_benchmark_othercomplexwithin10bp_slop50.bed.gz) = 51089011a9363e4e19bfae7f85458c83
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs/second_batch_from_JZ//HG002_GRCh38_1_22_v4.1_draft_benchmark_snpswithin10bp_slop50.bed.gz) = fdd64ab550393dd8dc15a4aee06c7c4c
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs/second_batch_from_JZ//HG002_genomespecific_complexandSVs_alldifficultregions_v3.3.2.bed.gz) = b7fcd6bed50112bdf0f6351caeac398f
MD5 (/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh38/sorted_merged_noNs/second_batch_from_JZ//HG002_genomespecific_complexandSVs_alldifficultregions_v4.1.bed.gz) = 28c30a4d4

## OUTPUT FILE LOCATIONS
#### files for validation plots
/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh3X/sorted_merged_noNs/

/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh3X/sorted_merged_noNs/second_batch_from_JZ/

/Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh3X/sorted_merged_noNs/second_batch_from_JZ/revised_021120

*files that have undergone a revision have been deleted in their original output directory.  

#### files for release
All stratification files for release were subsequently renamed with Stratification_BEDfile_rename.R script and put in /Users/jmcdani/Documents/GiaB/Benchmarking/GRCh38_stratification_validation/data/stratifications/GRCh3X/ALL_GRCh3X_sorted_merged_renamed_BEDs_for_GitHub/