# GRCh37 Union Stratifications (v3.1)
JMcDaniel started 5/13/22

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>  

### GENERAL INFORMATION
- [Background](#background)  
- [File Descriptions](#files)  
- [Files for release](#release)  
- [Resources](#resources)  
- [Code and File Sharing](#share)  
- [Software Tools](#tools)  
- [Get Dependency Files](#depend)  
### STRATIFICATION PREP
- [All Lowmap and Segdups](#alllow)
- [All Difficult Regions](#alldiff)
- [All Tandem Repeats](#alltr)

## GENERAL INFORMATION
<hr style="border:2px solid black"> </hr>

In [1]:
#JM workign directory
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/GRCh37-stratifications


## Background<a id="background"></a>
These files can be used as standard resource of BED files for use with GA4GH benchmarking tools such as [hap.py](https://github.com/Illumina/hap.py) to stratify true positive, false positive, and false negative variant calls into difficult regions. 

`alldifficultregions` is being updated in v3.1 stratificaitons to include the following new/updated files:
- AllTandemRepeatsandHomopolymers_slop5.bed.gz (now includes satellites)
- XY XTR regions
- XY ampliconic regions

`allTandemRepeats` is a new LowComplexity file for v3.1 stratification of all Tandem Repeats. 

## Files for release<a id="release"></a>
Union  
GRCh37_alldifficultregions.bed.gz  
GRCh37_notinalldifficultregions.bed.gz  
GRCh37_alllowmapandsegdupregions.bed.gz  
GRCh37_notinalllowmapandsegdupregions.bed.gz

LowComplexity  
GRCh37_allTandemRepeats.bed.gz  
GRCh37_notinallTandemRepeats.bed.gz

## Resources<a id="resources"></a>


## Code and File Sharing<a id="share"></a>
- GIAB GitHub
- GIAB FTP

## Required tools and versions<a id="tools"></a>

In [1]:
# bedtools conda environment used by JMcDaniel
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.1              pyhd3eb1b0_0  

## Get Dependency Files<a id="depend"></a>

human.b37.genome.bed used for preparing complement (notin) startification, generated in GRCh37-LowComplexity.ipynb.  
File Location: LowComplexity/ref-files/human.b37.genome.bed

## STRATIFICATION FILE PREPARATION STEPS
<hr style="border:2px solid black"> </hr>

### All Low Mappability  and Segdups and its complement<a id="alllow"></a>   
merge all low mappable and segdup regions
NOTE: updated on 6/28/22, we removed PAR-X from segdups therefor required re-generating alldifficultregions with revised segdups

In [4]:
date
multiIntersectBed -i \
../v3.0-carry-over-stratifications/GRCh37/mappability/GRCh37_lowmappabilityall.bed.gz \
SegmentalDuplications/GRCh37_segdups.bed.gz \
| grep -v 'gl\|hap\|MT' \
| mergeBed -i stdin \
| bgzip -c > Union/GRCh37_alllowmapandsegdupregions.bed.gz

subtractBed \
-a ref-files/human.b37.genome.bed -b Union/GRCh37_alllowmapandsegdupregions.bed.gz \
| bgzip -c > Union/GRCh37_notinalllowmapandsegdupregions.bed.gz

Tue Jun 28 12:44:06 EDT 2022


### All Difficult and its complement<a id="alldiff"></a>   
NOTE: updated on 6/28/22, we removed PAR-X from segdups therefor required re-generating alldifficultregions with revised segdups

In [5]:
date 
multiIntersectBed -i \
../v3.0-carry-over-stratifications/GRCh37/mappability/GRCh37_lowmappabilityall.bed.gz \
../v3.0-carry-over-stratifications/GRCh37/GCcontent/GRCh37_gclt25orgt65_slop50.bed.gz \
LowComplexity/GRCh37_AllTandemRepeatsandHomopolymers_slop5.bed.gz \
SegmentalDuplications/GRCh37_segdups.bed.gz \
../v3.0-carry-over-stratifications/GRCh37/FunctionalTechnicallyDifficultRegions/GRCh37_BadPromoters.bed.gz \
../v3.0-carry-over-stratifications/GRCh37/OtherDifficult/GRCh37_allOtherDifficultregions.bed.gz \
XY/GRCh37_chrX_XTR.bed \
XY/GRCh37_chrY_XTR.bed \
XY/GRCh37_chrX_ampliconic.bed \
XY/GRCh37_chrY_ampliconic.bed \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/GRCh37_alldifficultregions.bed.gz

subtractBed \
-a ref-files/human.b37.genome.bed -b Union/GRCh37_alldifficultregions.bed.gz \
| bgzip -c > Union/GRCh37_notinalldifficultregions.bed.gz

echo "sum of regions:"
gzcat Union/GRCh37_alldifficultregions.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun 28 12:44:27 EDT 2022
sum of regions:
552733350


### All TandemRepeats and its complement<a id="alldiff"></a>  

NOTE: allTR strats were erroneously generated in Union ipynb however should have been genreated in LowComplexity ipynb.  Code has been copied over to LowComplexity for future use and files transferred to LowComplexity.

In [6]:
date
multiIntersectBed -i \
LowComplexity/GRCh37_AllTandemRepeats_lt51bp_slop5.bed.gz \
LowComplexity/GRCh37_AllTandemRepeats_51to200bp_slop5.bed.gz \
LowComplexity/GRCh37_AllTandemRepeats_201to10000bp_slop5.bed.gz \
LowComplexity/GRCh37_AllTandemRepeats_gt10000bp_slop5.bed.gz \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/GRCh37_allTandemRepeats.bed.gz

subtractBed \
-a ref-files/human.b37.genome.bed -b Union/GRCh37_allTandemRepeats.bed.gz \
| bgzip -c > Union/GRCh37_notinallTandemRepeats.bed.gz

echo "sum of regions:"
gzcat Union/GRCh37_allTandemRepeats.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun 28 12:46:56 EDT 2022
sum of regions:
91043560
