# Generate Genome Bed Files for Validation (new for v3.1 stratifications)

JMcDaniel
started 6/3/22

In [1]:
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/validation/genome-bed-files


## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>

**[Background](#bkgd)**    
**[GRCh38](#GRCh38)**  
**[GRCh37](#GRCh37)**  
**[CHM13](#CHM13)**  

## Background<a id="bkgd"></a>
Genome region files are used for calculating chromosome coverage of stratificaitons in validation script. In the past merging using a distance of 100 was used however JZ would like to use default of 0.  Generating these files for each reference will include:
1. start with sorted genome bed (chrom, 0, end)
2. remove gaps (Ns), this will only be performed for GRCh37/38 since there are no gaps for CHM13
3. remove PSA-Y, this will only be performed for GRCh37/38 because this region is not included in ref however it is for CHM13. 
4. merge using default -d = 0
These files will be new for v3.1 stratifications. 

Process for generating these files was adapted from what was done for v3.0 stratification genome bed files and can be found in `/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.0_genome-stratifications/validation/README_for_genome_bed_files`

## GRCh38<a id="GRCh38"></a>

Genome Bed file for GRCh38 in `v3.1_genome-stratifications/GRCh38-stratifications/ref-files/human.b38.genome.bed` this file was generated in `GRCh38_LowComplexity.ipynb` for v3.1.  chrM was removed since it is not covered by any of our stratification, chromosomes covered are 1-22 X and Y. 

In [12]:
#read in genome.bed, remove chrM and sort
grep -v ^"chrM" ../../GRCh38-stratifications/ref-files/human.b38.genome.bed | 
sed 's/^chr//' | 
sed 's/^X/23/;s/^Y/24/' |
sort -k1,1n -k2,2n  | 
sed 's/^23/X/;s/^24/Y/' | 
sed 's/^[a-zA-Z0-9_]/chr&/' > GRCh38_1-22XY_sorted.genome.bed

In [13]:
# Remove N's in genome.bed
subtractBed \
-a GRCh38_1-22XY_sorted.genome.bed \
-b ../../post-processing/refNs/gap_38_1thruY_sorted_merged.bed \
> GRCh38_1-22XY_sorted_NoNs.genome.bed
rm GRCh38_1-22XY_sorted.genome.bed

In [14]:
# Remove PSA-Y in genome.bed and merge
subtractBed \
-a GRCh38_1-22XY_sorted_NoNs.genome.bed \
-b ../../post-processing/refNs/PSA_Y_GRCh38.bed |
mergeBed > GRCh38_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed
rm GRCh38_1-22XY_sorted_NoNs.genome.bed

## GRCh37<a id="GRCh37"></a>

Genome Bed file for GRCh37 in `v3.1_genome-stratifications/GRCh37-stratifications/ref-files/human.b37.genome.bed` this file was generated in `GRCh37_LowComplexity.ipynb` for v3.1.  chrM was removed since it is not covered by any of our stratification, chromosomes covered are 1-22 X and Y. 

In [24]:
#read in genome.bed, remove MT and sort
grep -v ^"MT" ../../GRCh37-stratifications/ref-files/human.b37.genome.bed | 
sed 's/^X/23/;s/^Y/24/' |
sort -k1,1n -k2,2n  | 
sed 's/^23/X/;s/^24/Y/' > GRCh37_1-22XY_sorted.genome.bed

In [25]:
# Remove N's in genome.bed
subtractBed \
-a GRCh37_1-22XY_sorted.genome.bed \
-b ../../post-processing/refNs/hg19.gap_1thruY_sorted_merged.bed \
> GRCh37_1-22XY_sorted_NoNs.genome.bed
rm GRCh37_1-22XY_sorted.genome.bed

In [26]:
# Remove PSA-Y in genome.bed and merge
subtractBed \
-a GRCh37_1-22XY_sorted_NoNs.genome.bed \
-b ../../post-processing/refNs/PSA_Y_hg19.bed |
mergeBed > GRCh37_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed
rm GRCh37_1-22XY_sorted_NoNs.genome.bed

## CHM13<a id="CHM13"></a>

Genome Bed file for CHM13 in `v3.1_genome-stratifications/T2T-CHM13v2.0-stratifications/LowComplexity/intermediatefiles/CHM13v2.0.genome.bed` this file was generated in `T2T-CHM13v2.0-LowComplexity.ipynb` for v3.1.  chrM was removed since it is not covered by any of our stratification, chromosomes covered are 1-22 X and Y. 

In [18]:
#read in genome.bed, remove chrM and sort
grep -v ^"chrM" ../../T2T-CHM13v2.0-stratifications/LowComplexity/intermediatefiles/CHM13v2.0.genome.bed | 
sed 's/^chr//' | 
sed 's/^X/23/;s/^Y/24/' |
sort -k1,1n -k2,2n  | 
sed 's/^23/X/;s/^24/Y/' | 
sed 's/^[a-zA-Z0-9_]/chr&/' > CHM13v2.0_1-22XY_sorted.genome.bed

Unlike the other references, there are no gaps for CHM13 therefore none were removed and PSAY is included in CHM13 therefore we will not remove it. 

In [27]:
md5 GRCh38_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed
md5 GRCh37_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed
md5 CHM13v2.0_1-22XY_sorted.genome.bed

MD5 (GRCh38_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed) = b52fc583925d39d709d6f37526a4642f
MD5 (GRCh37_1-22XY_sorted_NoNs_NoPSAY_merged.genome.bed) = 3b7c647de5301e12cbcefa1b82e4b39e
MD5 (CHM13v2.0_1-22XY_sorted.genome.bed) = 0d39d01e151c4030dc68c60a49d63297
