# T2T-CHM13v2 Union Stratifications (v3.1)
JMcDaniel started 4/28/22

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>
### GENERAL INFORMATION
- [Background](#background)  
- [File Descriptions](#files)  
- [Files for release](#release)  
- [Resources](#resources)  
- [Code and File Sharing](#share)  
- [Software Tools](#tools)  
- [Get Dependency Files](#depend)  
### STRATIFICATION PREP
- [All Difficult Regions](#alldiff)
- [All Tandem Repeats](#alltr)

## GENERAL INFORMATION
<hr style="border:2px solid black"> </hr>

In [2]:
#JM workign directory
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/T2T-CHM13v2.0-stratifications


## Background<a id="background"></a>
Prepare stratifications for use with T2T CHM13 v2.0 as a reference.
- chr1-22 and chrX from CHM13
- chrY from HG002 (NA24385)

These files can be used as standard resource of BED files for use with GA4GH benchmarking tools such as [hap.py](https://github.com/Illumina/hap.py) to stratify true positive, false positive, and false negative variant calls into difficult regions. 

Create single startifications that is the union of the following difficult regions:
- LowComplexity/CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz (which includes new satellites strat)
- rDNA/chm13v1.1.rdna_model.bed.gz
- SegDups/T2T-CHM13v2.SDs.lowid.bed.gz (release filename CHM13v2.0_SegDups.bed.gz)
- XY XTR

Create single stratification that is union of all Tandem Repeats was generated as a new LowComplexity stratification.

## Files for release<a id="release"></a>
Union  
CHM13v2.0_alldifficultregions.bed.gz  
CHM13v2.0_notinalldifficultregions.bed.gz  

LowComplexity  
CHM13v2.0_allTandemRepeats.bed.gz  
CHM13v2.0_notinallTandemRepeats.bed.gz

## Resources<a id="resources"></a>
- [T2T annotations](https://docs.google.com/spreadsheets/d/13BXuEFB904aje6zWXyZ0znZnXvQiu1qxKADA2uV2JU4/edit#gid=1966247802)
- [Stratification JZ would like to see and info on generating them](https://docs.google.com/spreadsheets/d/1xSsmq48pBJVOXa2dP-845I4qkCzTCzyR_iXRs9RPdBE/edit#gid=0)
- [T2T CHM13 v2.0 assembly](https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chm13v2.0.fasta)

## Code and File Sharing<a id="share"></a>
- GIAB GitHub
- GIAB FTP

## Required tools and versions<a id="tools"></a>

In [1]:
# bedtools conda environment used by JMcDaniel
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.1              pyhd3eb1b0_0  

## Get Dependency Files<a id="depend"></a>

CHM13v2.0.genome.bed used for preparing complement (notin) startification, generated in T2T-CHM13v2.0-LowComplexity.ipynb.  
File Location: LowComplexity/intermediatefiles/CHM13v2.0.genome.bed

## STRATIFICATION FILE PREPARATION STEPS
<hr style="border:2px solid black"> </hr>

### All Difficult and its complement<a id="alldiff"></a>  
note:  
updated 5/2/22 to remove satellites since they are included with AllTRandHomopol. Also sorted and merged rDNA and SegDup input files before union.  
updated 5/13/22 to add XY XTR

In [4]:
date
multiIntersectBed -i \
LowComplexity/CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
rDNA/CHM13v1.1_rDNA.bed.gz \
SegDups/CHM13v2.0_SegDups.bed.gz \
XY/CHM13v2.0_chrX_XTR.bed \
| sortBed -faidx T2T_files/chm13v2.0.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/CHM13v2.0_alldifficultregions.bed.gz

subtractBed \
-a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed -b Union/CHM13v2.0_alldifficultregions.bed.gz \
| bgzip -c > Union/CHM13v2.0_notinalldifficultregions.bed.gz

echo "sum of regions:"
gzcat Union/CHM13v2.0_alldifficultregions.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:51:22 EDT 2022
sum of regions:
497697292


### All TandemRepeats and its complement<a id="alltr"></a>  

NOTE: allTR strats were erroneously generated in Union ipynb however should have been generated in LowComplexity ipynb.  Code has been copied over for future use and files transferred to LowComplexity. 

In [5]:
date
multiIntersectBed -i \
LowComplexity/CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz \
| sortBed -faidx T2T_files/chm13v2.0.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/CHM13v2.0_allTandemRepeats.bed.gz

subtractBed \
-a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed -b Union/CHM13v2.0_allTandemRepeats.bed.gz \
| bgzip -c > Union/CHM13v2.0_notinallTandemRepeats.bed.gz

echo "sum of regions:"
gzcat Union/CHM13v2.0_allTandemRepeats.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:52:15 EDT 2022
sum of regions:
296888132
