# T2T-CHM13v2.0 LowComplexity Stratifications (v3.1)
JMcDaniel started 2022-03-04

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>

### General Information 
- [Background](#background)
- [File Descriptions](#files)
- [Files for release](#release)
- [Resources](#resources)
- [Code and File Sharing](#share)
- [Required Tools and Versions](#tools)
- [Dependency Files](#depend)
- [JM local working directory](#wd)

### Stratification Preparation Steps
#### Individual Files
**1. [homopolymers](#homopolymers)**  
**2. [di-nucleotide repeats](#dinuc)**  
**3. [tri-nucleotide repeats](#trinuc)**  
**4. [quad-nucleotide repeats](#quadnuc)**  
**5. [Create CHM13v2.0.genome file](#genomefile)**  
**6. [find range of repeats](#findreprange)**  
**7. [Find imperfect homopolymers](#imphomo)**  
**8. [SimpleRepeats & LowComplexity in CHM13 from RepeatMasker](#repmask)**  
**9. [SimpleRepeats in CHM13v2.0 from TRF](#trf)**  
**10. [Satellites](#satellites)**  
#### Intersected Files
**11. [Merge all homopolymers and find complement](#mergehopol)**  
**12. [Merge exact repeats and T2T repeats bed files and subtract homopolymers](#mergeexact)**  
**13. [Merge all homopolymers and TRs and find complement](#"mergehomTR")**  
**14. [All Tandem Repeat](#alltr)**  



#### JM local working directory<a id="wd"></a>  
after completing the stratification 'T2T-CHM13v2.0-stratifications' directory was moved to '/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications'. All paths in this ipynb are relative to the old directory listed below. 

In [2]:
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/assembly-benchmark-development/T2T-CHM13v2.0-stratifications


## GENERAL INFORMATION
<hr style="border:2px solid black"> </hr>

## Background<a id="background"></a>
This ipynb was created to combine all code neccessary for generating LowComplexity Stratifications and is new for v3.1. The same ipynb structure and corresponding code was applied to all references (GRCh37, GRCh38 and CHM13v2.0) for v3.1 LowComplexity stratifications for consistency.

Prepare stratifications for use with T2T CHM13 v2.0 as a reference.
- chr1-22 and chrX from CHM13
- chrY from HG002 (NA24385)

These files can be used as standard resource of BED files for use with GA4GH benchmarking tools such as [hap.py](https://github.com/Illumina/hap.py) to stratify true positive, false positive, and false negative variant calls into regions with different types and sizes of low complexity sequence (e.g., homopolymers, STRs, VNTRs, other locally repetitive sequences). 

Code used to generate "LowComplexity" stratifications was adapted from JZook generated scripts for GRCh37/38 stratifications: `FindSimpleRepeats_GRCh38_v2.sh`, `GRCh38_SimpleRepeat_homopolymer_gt20.sh` , `GRCh38_SimpleRepeat_imperfecthomoplgt20_slop5.sh`, located with the GIAB GitHub for v3.0 LowComplexity stratifications.  

**Statellites stratification**  
Satellites are a new addition for v3.1 stratifications. Centromeric and Pericentromeric Satellite Annotations (cenSat) from T2T for CHM13v2.0. The broad definition of peri/centromeric regions on each chromosome includes the satellite-rich regions and 5 Mb of sequence on the p-arm and q-arm. Satellite DNA consists of very large arrays of tandemly repeating, non-coding DNA. Satellite DNA is the main component of functional centromeres, and form the main structural constituent of heterochromatin.

## File Descriptions<a id="files"></a>
- `CHM13v2.0_SimpleRepeat*_slop5.bed.gz`\
perfect repeats of different unit sizes (i.e., homopolymers, and dinucleotide, trinucleotide, and quadnucleotide STRs) and different total repeat lengths (i.e., <=50bp, 51-200bp, or >200bp)
- `CHM13v2.0_SimpleRepeat_imperfecthomopolgt*_slop5.bed.gz`\
perfect homopolymers >*p as well as imperfect homopolymers where a single base was repeated >10bp except for a 1bp interruption by a different base
- `CHM13v2.0_AllTandemRepeats_*_slop5.bed.gz`\
union of SimpleRepeat dinucleotide, trinucleotide, and quadnucleotide STRs as well as T2T RepeatMasker_LowComplexity, RepeatMasker_SimpleRepeats,TRF_SimpleRepeats and satellites. 
- `CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz`\
union of all perfect homopolymers >6bp and imperfect homopolymers >10bp
- `CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz`\
union of AllTandemRepeats_* with AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
- `CHM13v2.0_satellites_slop5.bed.gz`\
satellite tandem repeats from T2T CHM13v2.0 censtat track. Centromeric and Pericentromeric regions.
- `CHM13v2.0_notin*_slop5.bed.gz`\
are non-overlapping complements of the stratification regions (i.e., genome after excluding the regions).  
- `CHM13v2.0_allTandemRepeats.bed.gz`  
union of all tandem repeats  
- `notin`  
complement regions are non-overlapping genomic regions that remain after excluding stratification regions.


## Files for release<a id="release"></a>
CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeats_gt100bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz  
CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz  
CHM13v2.0_allTandemRepeats.bed.gz  
CHM13v2.0_notinallTandemRepeats.bed.gz  
CHM13v2.0_SimpleRepeat_diTR_11to50_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_diTR_51to200_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_diTR_gt200_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_homopolymer_4to6_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_homopolymer_7to11_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_homopolymer_gt11_slop5.bed.gz   
CHM13v2.0_SimpleRepeat_homopolymer_gt20_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_quadTR_20to50_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_quadTR_51to200_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_quadTR_gt200_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_triTR_15to50_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_triTR_51to200_slop5.bed.gz  
CHM13v2.0_SimpleRepeat_triTR_gt200_slop5.bed.gz  
CHM13v2.0_notinAllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz  
CHM13v2.0_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz  
CHM13v2.0_satellites_slop5.bed.gz  
CHM13v2.0_notinsatellites_slop.bed.gz

## Resources<a id="resources"></a>
- [T2T annotations](https://docs.google.com/spreadsheets/d/13BXuEFB904aje6zWXyZ0znZnXvQiu1qxKADA2uV2JU4/edit#gid=1966247802)
- [Stratification JZ would like to see and info on generating them](https://docs.google.com/spreadsheets/d/1xSsmq48pBJVOXa2dP-845I4qkCzTCzyR_iXRs9RPdBE/edit#gid=0)
- [GitLab v3.0 LowComplexity stratifications scripts](https://github.com/genome-in-a-bottle/genome-stratifications/tree/master/GRCh38/LowComplexity)
- [T2T CHM13 v2.0 assembly](https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/chm13v2.0.fasta)
- [JZ slack message about sattelites](https://t2t-consortium.slack.com/archives/D015NGSRTUK/p1648647830778769)

## Code and File Sharing<a id="share"></a>
- GIAB GitHub
- GIAB FTP

## Required tools and versions<a id="tools"></a>

In [1]:
#needed to run findSimpleRegions_quad.py from inVitae and it requires python 2.7.15.  I have this python version in my samtools_env conda environment.
python --version

Python 2.7.15 :: Anaconda, Inc.


In [10]:
# samtools_env conda environment used by JMcDaniel for steps 1-5
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/samtools_env:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0    anaconda
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bamtools                  2.5.1                h5c9b4e4_4    bioconda
bcftools                  1.8                  h4da6232_3    bioconda
bedtools                  2.27.1               h5c9b4e4_3    bioconda
biopython                 1.74             py27h1de35cc_0  
blas                      1.0                         mkl  
bleach                    3.3.1              pyhd3eb1b0_0  
bzip2                     1.0.6                h1de35cc_5  
ca-certificates           2021.10.26           hecd8cb5_2  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser             

In [1]:
# bedtools conda environment used by JMcDaniel for steps 6-13
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
attrs                     21.4.0             pyhd3eb1b0_0  
backcall                  0.2.0                      py_0    anaconda
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.3.1              pyhd3eb1b0_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.2.1             hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h1

## Get Dependency Files<a id="depend"></a>

**Download T2T CHM13 v2.0 assembly**

File was downloaded outside ipynb on 4/1/22 using  
`wget https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz -P T2T_files/`  
This file was then unzipped and indexed. 

**Download T2T CHM13 Repeatmasker file**

In [10]:
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RepeatMasker_4.1.2p1.bed -P T2T_files/

--2022-02-24 11:30:59--  https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13v2.0_RepeatMasker_4.1.2p1.bed
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.92.165.248
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.92.165.248|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 356948938 (340M) [application/vnd.realvnc.bed]
Saving to: ‘T2T_files/chm13v2.0_RepeatMasker_4.1.2p1.bed’


2022-02-24 11:32:01 (5.55 MB/s) - ‘T2T_files/chm13v2.0_RepeatMasker_4.1.2p1.bed’ saved [356948938/356948938]



**Download T2T TRF SimpleRepeats**  
Transferred via globus web interface `T2T-CHM13v2_trf.bed` from https://app.globus.org/file-manager?origin_id=9db1f0a6-a05a-11ea-8f06-0a21f750d19b&origin_path=%2Fteam-segdups%2FAssembly_analysis%2FMasked%2F/T2T_CHM13v2_trf.bed to `T2T_files/`

**Download censtat (satellites) track files from browser including chrY**  
`t2t_censat_CHM13v2.0_trackv2.0.bed` and `.html` describing file provided as slack download in slack message about satellites referenced in "resouces" section
https://t2t-consortium.slack.com/files/ULT7E06GL/F039A96RY84/t2t_censat_chm13v2.0_trackv2.0.bed
https://t2t-consortium.slack.com/files/ULT7E06GL/F039A96RY84/t2t_censat_chm13v2.0_trackv2.0.html

## STRATIFICATION FILE PREPARATION STEPS
<hr style="border:2px solid black"> </hr>

## 1. homopolymers<a id="hompolymers"></a>


In [4]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
python LowComplexity/scripts/findSimpleRegions_quad.py -p 3 -d 100000 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed
python LowComplexity/scripts/findSimpleRegions_quad.py -p 6 -d 100000 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p6.bed 
python LowComplexity/scripts/findSimpleRegions_quad.py -p 11 -d 100000 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p11.bed
python LowComplexity/scripts/findSimpleRegions_quad.py -p 20 -d 100000 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p20.bed

## 2. di-nucleotide repeats<a id="dinuc"></a>

In [1]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 11 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d11.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 51 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d51.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 201 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d201.bed & 

[1] 37693
[2] 37694
[3] 37695


## 3. tri-nucleotide repeats<a id="trinuc"></a>

In [6]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 15 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t15.bed
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 51 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t51.bed
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 201 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t201.bed 

[1]   Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 11 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d11.bed
[3]+  Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 201 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d201.bed
[2]+  Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 51 -t 100000 -q 100000 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d51.bed


## 4. quad-nucleotide repeats<a id="quadnuc"></a>

In [7]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 20 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q20.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 51 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q51.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 201 T2T_files/chm13v2.0.fa LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q201.bed & 

[1] 98529
[2] 98530
[3] 98531


## 5. Create CHM13v2.genome file for use with bedtools slopbed (-g .genome) and subtractbed (genome.bed)<a id="genomefile"></a> 
this generates file with chromosome sizes for CHM13v2

In [11]:
#run in JM conda env 'samtools_env'
samtools faidx T2T_files/chm13v2.0.fa
# prints 1st and 2nd column of index file and separates by tab
awk -v OFS='\t' {'print $1,$2'} T2T_files/chm13v2.0.fa.fai > LowComplexity/intermediatefiles/CHM13v2.0.genome

# prints 1st and 2nd column of index file and separates by tab but adds field 2  = 0. This results in 3 column bedfile.
awk -v OFS='\t' {'print $1,"0",$2'} T2T_files/chm13v2.0.fa.fai > LowComplexity/intermediatefiles/CHM13v2.0.genome.bed

## 6. find LowComplexity and SimpleRepeat ranges <a id="findreprange"></a>
activated bedtools conda env from here on

In [2]:
# HOMOPOLYMERS
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p6.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_4to6.bed.gz
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p6.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p11.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_7to11.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p11.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt11.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p20.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt20.bed.gz

In [3]:
# DI-NUCLEOTIDES
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d11.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_11to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d51.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d201.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_gt200.bed.gz

In [4]:
# TRI-NUCLEOTIDES
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t15.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_15to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t51.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t201.bed | sed 's/^chr//'  | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_gt200.bed.gz

In [5]:
# QUAD-NUCLEOTIDES
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q20.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_20to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q51.bed -b LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q201.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_gt200.bed.gz

In [6]:
# Add 5bp slop on either side of repeats to ensure insertions at the edge of the repeat and any adjacent repetitive structures are captured

slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_4to6.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_4to6_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_7to11.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_7to11_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt11.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt11_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt20.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt20_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_11to50.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_11to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_51to200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_gt200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_gt200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_15to50.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_15to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_51to200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_gt200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_gt200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_20to50.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_20to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_51to200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_gt200.bed.gz -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | bgzip -c  > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_gt200_slop5_withUNITS.bed.gz

In [7]:
# Make 3 column versions so that hap.py doesn't perform more granular stratification

gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_4to6_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_homopolymer_4to6_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_7to11_slop5_withUNITS.bed.gz | cut -f1-3  | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_homopolymer_7to11_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt11_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_homopolymer_gt11_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_homopolymer_gt20_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_homopolymer_gt20_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_11to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_diTR_11to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_51to200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_diTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_diTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_diTR_gt200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_15to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_triTR_15to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_51to200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_triTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_triTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_triTR_gt200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_20to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_quadTR_20to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_51to200_slop5_withUNITS.bed.gz| cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_quadTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_quadTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/CHM13v2.0_SimpleRepeat_quadTR_gt200_slop5.bed.gz

## 7. Find imperfect homopolymers >10bp & >20<a id="imphomo"></a>
by merging homopolymers >=4bp separated by 1bp. Also, add 5bp padding on both sides to include errors around edges

In [8]:
# IMPERFECT HOMOPOLYMERS >10
# SimpleRepeat_homopolymer_gt11.bed.gz previously generated would be the corresponding "PERFECT HOMOPOLYMERS" file

grep 'unit=C' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_C.bed
grep 'unit=G' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_G.bed
grep 'unit=A' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_A.bed
grep 'unit=T' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_T.bed

multiIntersectBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_C.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_G.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_A.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_T.bed | 
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' |
	sed 's/^/chr/' |
	slopBed -i stdin -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome |
	sed 's/^chr//' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz

In [9]:
# IMPERFECT HOMOPOLYMERS >20

grep 'unit=C' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_C.bed
grep 'unit=G' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_G.bed
grep 'unit=A' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_A.bed
grep 'unit=T' LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_T.bed

multiIntersectBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_C.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_G.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_A.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_T.bed |
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' |
	sed 's/^/chr/' |
	slopBed -i stdin -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome |
	sed 's/^chr//' |
	sed 's/^X/23/;s/^Y/24/' |
	sort -k1,1n -k2,2n -k3,3n |
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' |
	mergeBed -i stdin |
	bgzip -c > LowComplexity/CHM13v2.0_SimpleRepeat_imperfecthomopolgt20_slop5.bed.gz

## 8. Get SimpleRepeats and LowComplexity from CHM13v2.0 RepeatMasker file <a id="repmask"></a>
note: format of CHM13 RepeatMasker bed (`chm13v2.0_RepeatMasker_4.1.2p1.bed`) is different from UCSC GRCh3X rmsk. Code below was updated to use the correct fields in the CHM13 formatted rmsk bed.

In [10]:
# SIMPLE REPEATS
grep Simple_repeat T2T_files/chm13v2.0_RepeatMasker_4.1.2p1.bed |
	awk '{ print $1 "\t" $2 "\t" $3 ; }' |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat_lt51.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat_51to200.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat_gt200.bed.gz

In [11]:
# LOW COMPLEXITY

zgrep Low_complexity T2T_files/chm13v2.0_RepeatMasker_4.1.2p1.bed |
	awk '{ print $1 "\t" $2 "\t" $3 ; }' |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
    bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity_lt51.bed.gz

gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity_51to200.bed.gz

gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity_gt200.bed.gz


## 9. Get SimpleRepeats from CHM13v2.0 TRF <a id="trf"></a>
note: format of CHM13 simple repeat file (`T2T-CHM13v2_trf.bed`) is different from UCSC GRCh3X simpleRepeat (trf file). Code below was updated to use the correct fields in the CHM13 formatted simpleRepeat (from TRF) bed.

In [12]:
cut -f1-3 T2T_files/T2T-CHM13v2_trf.bed |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
    bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat.bed.gz

gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat_lt51.bed.gz

gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat_51to200.bed.gz

gunzip -c LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat_gt200.bed.gz


## 10. Satellites (censat)<a id="satellites"></a>
remove regions in the centromere that are not classified as satellite repeats (often they are segmental duplications or other challenging regions)

In [28]:
date
grep -v 'ct_' T2T_files/t2t_censat_chm13v2.0_trackv2.0.bed \
| sortBed -faidx T2T_files/chm13v2.0.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > LowComplexity/intermediatefiles/CHM13v2.0_satellites.bed.gz

slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_satellites.bed -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome \
| mergeBed -i stdin \
| bgzip -c  > LowComplexity/CHM13v2.0_satellites_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_satellites_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:48:44 EDT 2022
sum of regions:
240530553


In [29]:
date
#notin
subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed -b LowComplexity/CHM13v2.0_satellites_slop5.bed.gz | bgzip -c > LowComplexity/CHM13v2.0_notinsatellites_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_notinsatellites_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:48:47 EDT 2022
sum of regions:
2876761517


## 11. Merge all homopolymers and find complement<a id="mergehopol"></a>

note: got segmenation fault error with subtractbed using bedtools v2.27. Found updates to bed/samtools have been done to try to handle these faults, tried with v2.30 and it worked.

In [17]:
slopBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_p6.bed -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome |
	mergeBed -i stdin |
	multiIntersectBed -i stdin LowComplexity/CHM13v2.0_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz |
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz

subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | bgzip -c > LowComplexity/CHM13v2.0_notinAllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz

## 12. Multiintersect exact repeats and rmsk/trf repeats bed files and subtract homopolymers<a id="mergeexact"></a>

In [30]:
#Intermediate AllTandemRepeats before any size range selection
date
multiIntersectBed -i LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_d11.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_t15.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_SimpleRepeat_q20.bed \
	LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Simple_repeat.bed.gz \
	LowComplexity/intermediatefiles/CHM13v2.0_rmsk_Low_complexity.bed.gz \
	LowComplexity/intermediatefiles/CHM13v2.0_trf_simpleRepeat.bed.gz \
	LowComplexity/intermediatefiles/CHM13v2.0_satellites.bed.gz | 
	sed 's/^chr//' | 
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' | 
	sed 's/^/chr/' | 
	slopBed -i stdin -b 5 -g LowComplexity/intermediatefiles/CHM13v2.0.genome | 
	sed 's/^chr//' | 
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin > LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed
    
echo "sum of regions:"
cat LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:03 EDT 2022
sum of regions:
309153634


In [31]:
#AllTandemRepeats_lt51bp
date
awk '$3-$2<61' LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:26 EDT 2022
sum of regions:
26612507


In [32]:
#AllTandemRepeats_51to200bp
date
awk '$3-$2>60 && $3-$2<211' LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:35 EDT 2022
sum of regions:
18670809


In [33]:
#AllTandemRepeats_201to10000bp
date
awk '$3-$2>210 && $3-$2<10011' LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:41 EDT 2022
sum of regions:
14334051


In [34]:
#AllTandemRepeats_gt10000bp
date
awk '$3-$2>10010' LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:46 EDT 2022
sum of regions:
237270765


In [35]:
#AllTandemRepeats_gt100bp
date
awk '$3-$2>110' LowComplexity/intermediatefiles/CHM13v2.0_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeats_gt100bp_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeats_gt100bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:49:52 EDT 2022
sum of regions:
258228311


## 13. Merge all homopolymers and TRs and find complement<a id="mergehomTR"></a>

In [36]:
date
multiIntersectBed -i LowComplexity/CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz \
	LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz \
	LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz \
	LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz \
	LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	sed 's/^chr//' | 
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' | 
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz

subtractBed -a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed \
-b LowComplexity/CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz | 
bgzip -c > LowComplexity/CHM13v2.0_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/CHM13v2.0_AllTandemRepeatsandHomopolymers_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 12:50:01 EDT 2022
sum of regions:
382764544


## 14. All TandemRepeats and its complement<a id="alltr"></a>

NOTE: allTR strats were erroneously generated in Union ipynb however should have been generated in LowComplexity ipynb.  Code has been copied over for future use and files transferred to LowComplexity. 

date
multiIntersectBed -i \
LowComplexity/CHM13v2.0_AllTandemRepeats_lt51bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz \
LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz \
| sortBed -faidx T2T_files/chm13v2.0.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/CHM13v2.0_allTandemRepeats.bed.gz

subtractBed \
-a LowComplexity/intermediatefiles/CHM13v2.0.genome.bed -b Union/CHM13v2.0_allTandemRepeats.bed.gz \
| bgzip -c > Union/CHM13v2.0_notinallTandemRepeats.bed.gz

echo "sum of regions:"
gzcat Union/CHM13v2.0_allTandemRepeats.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'