# GRCh38 LowComplexity Stratifications (v3.1)
JMcDaniel started 5/12/22

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>  

### GENERAL INFORMATION  
- [Background](#background)  
- [File Descriptions](#files)  
- [Files for release](#release)  
- [Resources](#resources)  
- [Code and File Sharing](#share)  
- [Software Tools](#tools)  
- [Get Dependency Files](#depend)  

### STRATIFICATION PREPARATION  
#### Individual Files
**1. [homopolymers](#homopolymers)**  
**2. [di-nucleotide repeats](#dinuc)**  
**3. [tri-nucleotide repeats](#trinuc)**  
**4. [quad-nucleotide repeats](#quadnuc)**  
**5. [Create GRCh38.genome file](#genomefile)**  
**6. [find range of repeats](#findreprange)**  
**7. [Find imperfect homopolymers](#imphomo)**  
**8. [SimpleRepeats & LowComplexity from UCSC RepeatMasker](#repmask)**  
**9. [SimpleRepeats in UCSC simpleRepeats (from TRF)](#trf)**  
**10. [Satellites](#satellites)**  
#### Intersected Files
**11. [Merge all homopolymers and find complement](#mergehopol)**  
**12. [Merge exact repeats and UCSC (rmsk/trf) repeats and subtract homopolymers](#mergeexact)**  
**13. [Merge all homopolymers and TRs and find complement](#"mergehomTR")**  
**14. [All Tandem Repeats](#allTR)**



## GENERAL INFORMATION
<hr style="border:2px solid black"> </hr>

In [39]:
#JMcDaniel workign directory
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/GRCh38-stratifications


## Background<a id="background"></a>
These files can be used as standard resource of BED files for use with GA4GH benchmarking tools such as [hap.py](https://github.com/Illumina/hap.py) to stratify true positive, false positive, and false negative variant calls into regions with different types and sizes of low complexity sequence (e.g., homopolymers, STRs, VNTRs, other locally repetitive sequences). 

Code used to generate "LowComplexity" stratifications was adapted from JZook generated scripts for GRCh37/38 stratifications: `FindSimpleRepeats_GRCh38_v2.sh`, `GRCh38_SimpleRepeat_homopolymer_gt20.sh` , `GRCh38_SimpleRepeat_imperfecthomoplgt20_slop5.sh`located with the GIAB GitHub for v3.0 LowComplexity stratifications.  

Satellite file is a new addtion for v3.1 while this is an individual stratification it also feeds into the intersected files (steps #12 and #13).  This ipynb was created to combine all code neccessary for generating LowComplexity Stratifications and is new for v3.1.  This ipynb was created to combine all code neccessary for generating LowComplexity Stratifications and is new for v3.1. The same ipynb structure and corresponding code was applied to all references (GRCh37, GRCh38 and CHM13v2.0) for v3.1 LowComplexity stratifications for consistency.

## File Descriptions<a id="files"></a>
- `GRCh3X_SimpleRepeat*_slop5.bed.gz`\
perfect repeats of different unit sizes (i.e., homopolymers, and dinucleotide, trinucleotide, and quadnucleotide STRs) and different total repeat lengths (i.e., <=50bp, 51-200bp, or >200bp)
- `GRCh3X_SimpleRepeat_imperfecthomopolgt*_slop5.bed.gz`\
perfect homopolymers >*p as well as imperfect homopolymers where a single base was repeated >10bp except for a 1bp interruption by a different base
- `GRCh3X_AllTandemRepeats_*_slop5.bed.gz`\
union of SimpleRepeat dinucleotide, trinucleotide, and quadnucleotide STRs as well as UCSC Genome Brower RepeatMasker_LowComplexity, RepeatMasker_SimpleRepeats, RepeatMasker_Satellite, and TRF_SimpleRepeat.
- `GRCh3X_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz`\
union of all perfect homopolymers >6bp and imperfect homopolymers >10bp
- `GRCh3X_AllTandemRepeatsandHomopolymers_slop5.bed.gz`\
union of AllTandemRepeats_* with AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
- `GRCh3X_satellites_slop5.bed.gz`\
Centromeric and Pericentromeric Satellite Annotations
- `GRCh3X_notin*_slop5.bed.gz`\
are non-overlapping complements of the stratification regions (i.e., genome after excluding the regions).
- `GRCh3X_allTandemRepeats.bed.gz`  
union of all tandem repeats  
- `notin`  
complement regions are non-overlapping genomic regions that remain after excluding stratification regions.  

## Files for release<a id="release"></a>
GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz  
GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz  
GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz  
GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz  
GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz  
GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz  
GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz   
GRCh38_allTandemRepeats.bed.gz  
GRCh38_notinallTandemRepeats.bed.gz  
GRCh38_SimpleRepeat_diTR_11to50_slop5.bed.gz  
GRCh38_SimpleRepeat_diTR_51to200_slop5.bed.gz  
GRCh38_SimpleRepeat_diTR_gt200_slop5.bed.gz  
GRCh38_SimpleRepeat_homopolymer_4to6_slop5.bed.gz  
GRCh38_SimpleRepeat_homopolymer_7to11_slop5.bed.gz  
GRCh38_SimpleRepeat_homopolymer_gt11_slop5.bed.gz   
GRCh38_SimpleRepeat_homopolymer_gt20_slop5.bed.gz  
GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz  
GRCh38_SimpleRepeat_imperfecthomopolgt20_slop5.bed.gz  
GRCh38_SimpleRepeat_quadTR_20to50_slop5.bed.gz  
GRCh38_SimpleRepeat_quadTR_51to200_slop5.bed.gz  
GRCh38_SimpleRepeat_quadTR_gt200_slop5.bed.gz  
GRCh38_SimpleRepeat_triTR_15to50_slop5.bed.gz  
GRCh38_SimpleRepeat_triTR_51to200_slop5.bed.gz  
GRCh38_SimpleRepeat_triTR_gt200_slop5.bed.gz  
GRCh38_notinAllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz  
GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz  
GRCh38_satellites_slop5.bed.gz  
GRCh38_notinsatellites_slop5.bed.gz

## Resources<a id="resources"></a>
- GIAB GitHub
- GIAB FTP

## Code and File Sharing<a id="share"></a>
- add GIAB GitHub
- data.nist.gov

## Required tools and versions<a id="tools"></a>

In [12]:
# samtools_env conda environment used by JMcDaniel for steps 1-5
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/samtools_env:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0    anaconda
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bamtools                  2.5.1                h5c9b4e4_4    bioconda
bcftools                  1.8                  h4da6232_3    bioconda
bedtools                  2.27.1               h5c9b4e4_3    bioconda
biopython                 1.74             py27h1de35cc_0  
blas                      1.0                         mkl  
bleach                    3.3.1              pyhd3eb1b0_0  
bzip2                     1.0.6                h1de35cc_5  
ca-certificates           2021.10.26           hecd8cb5_2  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser             

In [1]:
#JMcDaniel conda bedtools env for step 6-13
date
conda list

Thu May 12 14:55:54 EDT 2022
# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.

## Get Dependency Files<a id="depend"></a>

**GRCh38 UCSC RepeaMasker File**

In [1]:
#Repeat masker file from UCSC.
rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz LowComplexity/GRCh38/intermediatefiles/

receiving file list ... 
1 file to consider
rmsk.txt.gz
   154291848 100%    1.84MB/s    0:01:19 (xfer#1, to-check=0/1)

sent 38 bytes  received 154356311 bytes  1870986.05 bytes/sec
total size is 154291848  speedup is 1.00


**GRCh38 UCSC TRF simpleRepeat File**

In [3]:
date
rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/simpleRepeat.txt.gz LowComplexity/intermediatefiles/

Wed May 11 13:58:58 EDT 2022
receiving file list ... 
1 file to consider

sent 16 bytes  received 82 bytes  65.33 bytes/sec
total size is 29493439  speedup is 300953.46


**GRCh38 Reference**

In [11]:
# GRCh38 ref download
date
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz -P ref-files/GRCh38

Thu May  5 13:10:20 EDT 2022
--2022-05-05 13:10:20--  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
Resolving ftp.ncbi.nlm.nih.gov... 2607:f220:41f:250::229, 2607:f220:41e:250::11, 130.14.250.12, ...
Connecting to ftp.ncbi.nlm.nih.gov|2607:f220:41f:250::229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 872949833 (833M) [application/x-gzip]
Saving to: 'ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz'


2022-05-05 13:11:49 (9.48 MB/s) - 'ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz' saved [872949833/872949833]



**Index Reference File**

In [21]:
# index reference. Ref is gzipped and needed to be bgzipped for indexing with samtools.
date
gunzip -k ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 
samtools faidx ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna
grep -Ev '^chr[0-9XYM]_|^chr[0-9][0-9]_|^chrUn_|^chrEBV' ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.fai > ref-files/GRCh38/GRCh38.fa.fai
rm ref-files/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

Thu May  5 14:06:07 EDT 2022


## Stratification Preparation<a id="prep"></a>
<hr style="border:2px solid black"> </hr>

## 1. homopolymers<a id="hompolymers"></a>

In [2]:
#findSimpleRegions_quad.py requires python 2.7.15. Ran this step in JM conda env 'samtools_env' where this version of python is located.
gunzip -k ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz 
python LowComplexity/scripts/findSimpleRegions_quad.py -p 3 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 6 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p6.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 11 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p11.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 20 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p20.bed

[1] 96221
[2] 96222
[3] 96223


## 2. di-nucleotide repeats<a id="dinuc"></a>

In [3]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
date
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 11 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d11.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 51 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d51.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 201 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d201.bed & 

[1]   Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 3 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed
[2]-  Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 6 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p6.bed
[3]+  Done                    python LowComplexity/scripts/findSimpleRegions_quad.py -p 11 -d 100000 -t 100000 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p11.bed
Wed May 11 16:06:10 EDT 2022
[1] 16677
[2] 16678
[3] 16679


## 3. tri-nucleotide repeats<a id="trinuc"></a>

In [4]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
date
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 15 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t15.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 51 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t51.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 201 -q 100000 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t201.bed & 

Wed May 11 16:08:18 EDT 2022
[4] 17067
[5] 17068
[6] 17069


## 4. quad-nucleotide repeats<a id="quadnuc"></a>

In [8]:
#findSimpleRegions_quad.py requires python 2.7.15. Run in JM conda env 'samtools_env' where this version of python is located.
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 20 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q20.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 51 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q51.bed &
python LowComplexity/scripts/findSimpleRegions_quad.py -p 100000 -d 100000 -t 100000 -q 201 ref-files/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q201.bed & 

[1] 38315
[2] 38316
[3] 38317


## 5. Create GRCh38.genome file for use with bedtools slopbed (-g .genome) and subtractbed (genome.bed)<a id="genomefile"></a> 
this generates file with chromosome sizes for GRCh38

In [14]:
# prepare .genome files with chromosome lengths
date
cat ref-files/GRCh38.fa.fai | cut -f 1,2 | grep -Ev '^[chr0-9XYM]_|^[chr0-9][0-9XYM]_|^chrUn_' > ref-files/human.b38.genome

Thu May 12 09:05:20 EDT 2022


In [16]:
# prints 1st and 2nd column of index file and separates by tab but adds field 2  = 0. This results in 3 column bedfile.
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/GRCh38.fa.fai | grep -Ev '^[0-9XYM]_|^[0-9][0-9XYM]_|^chrUn_|^GL|^NC|^hs' > ref-files/human.b38.genome.bed

## 6. find LowComplexity and SimpleRepeat ranges <a id="findreprange"></a>
activated bedtools conda env from here on

In [2]:
# HOMOPOLYMERS
date
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p6.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_4to6.bed.gz
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p6.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p11.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_7to11.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p11.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt11.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p20.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt20.bed.gz

Thu May 12 14:56:22 EDT 2022


In [3]:
# DI-NUCLEOTIDES
date
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d11.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_11to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d51.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d201.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_gt200.bed.gz

Thu May 12 15:01:46 EDT 2022


In [4]:
# TRI-NUCLEOTIDES
date
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t15.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_15to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t51.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t201.bed | sed 's/^chr//'  | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_gt200.bed.gz

Thu May 12 15:01:56 EDT 2022


In [5]:
# QUAD-NUCLEOTIDES
date
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q20.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q51.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_20to50.bed.gz
subtractBed -a LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q51.bed -b LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q201.bed | sed 's/^chr//' | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_51to200.bed.gz
sed 's/^chr//' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q201.bed | grep "^[0-9XY]" | grep -v '_' | sed 's/^X/23/;s/^Y/24/' | sort -k1,1n -k2,2n -k3,3n | sed 's/^23/X/;s/^24/Y/;s/^/chr/' | bgzip -c > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_gt200.bed.gz

Thu May 12 15:01:59 EDT 2022


In [6]:
# Add 5bp slop on either side of repeats to ensure insertions at the edge of the repeat and any adjacent repetitive structures are captured
date
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_4to6.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_4to6_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_7to11.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_7to11_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt11.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt11_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt20.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt20_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_11to50.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_11to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_51to200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_gt200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_gt200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_15to50.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_15to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_51to200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_gt200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_gt200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_20to50.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_20to50_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_51to200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_51to200_slop5_withUNITS.bed.gz
slopBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_gt200.bed.gz -b 5 -g ref-files/human.b38.genome | bgzip -c  > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_gt200_slop5_withUNITS.bed.gz

Thu May 12 15:02:06 EDT 2022


In [7]:
# Make 3 column versions so that hap.py doesn't perform more granular stratification
date
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_4to6_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_homopolymer_4to6_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_7to11_slop5_withUNITS.bed.gz | cut -f1-3  | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_homopolymer_7to11_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt11_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_homopolymer_gt11_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_homopolymer_gt20_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_homopolymer_gt20_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_11to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_diTR_11to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_51to200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_diTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_diTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_diTR_gt200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_15to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_triTR_15to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_51to200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_triTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_triTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_triTR_gt200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_20to50_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_quadTR_20to50_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_51to200_slop5_withUNITS.bed.gz| cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_quadTR_51to200_slop5.bed.gz
gzcat LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_quadTR_gt200_slop5_withUNITS.bed.gz | cut -f1-3 | bgzip -c  > LowComplexity/GRCh38_SimpleRepeat_quadTR_gt200_slop5.bed.gz

Thu May 12 15:03:37 EDT 2022


## 7. Find imperfect homopolymers >10bp & >20<a id="imphomo"></a>
by merging homopolymers >=4bp separated by 1bp. Also, add 5bp padding on both sides to include errors around edges

In [9]:
# IMPERFECT HOMOPOLYMERS >10
# SimpleRepeat_homopolymer_gt11.bed.gz previously generated would be the corresponding "PERFECT HOMOPOLYMERS" file
date
grep 'unit=C' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_C.bed
grep 'unit=G' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_G.bed
grep 'unit=A' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_A.bed
grep 'unit=T' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>10' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_T.bed

multiIntersectBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_C.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_G.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_A.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt10_T.bed | 
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' |
	sed 's/^/chr/' |
	slopBed -i stdin -b 5 -g ref-files/human.b38.genome |
	sed 's/^chr//' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz

Thu May 12 15:07:16 EDT 2022


In [10]:
# IMPERFECT HOMOPOLYMERS >20
date
grep 'unit=C' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_C.bed
grep 'unit=G' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_G.bed
grep 'unit=A' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_A.bed
grep 'unit=T' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p3.bed | mergeBed -i stdin -d 1 | awk '$3-$2>20' > LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_T.bed

multiIntersectBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_C.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_G.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_A.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_imperfecthomopolgt20_T.bed |
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' |
	sed 's/^/chr/' |
	slopBed -i stdin -b 5 -g ref-files/human.b38.genome |
	sed 's/^chr//' |
	sed 's/^X/23/;s/^Y/24/' |
	sort -k1,1n -k2,2n -k3,3n |
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' |
	mergeBed -i stdin |
	bgzip -c > LowComplexity/GRCh38_SimpleRepeat_imperfecthomopolgt20_slop5.bed.gz

Thu May 12 15:10:20 EDT 2022


## 8. Get SimpleRepeats and LowComplexity from UCSC GRCh38 RepeatMasker file <a id="repmask"></a>

In [11]:
# SIMPLE REPEATS
date
zgrep Simple_repeat LowComplexity/intermediatefiles/rmsk.txt.gz |
	awk '{ print $6 "\t" $7 "\t" $8 ; }' |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat_lt51.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat_51to200.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat_gt200.bed.gz

Thu May 12 15:13:01 EDT 2022


In [12]:
# LOW COMPLEXITY
date
zgrep Low_complexity LowComplexity/intermediatefiles/rmsk.txt.gz |
	awk '{ print $6 "\t" $7 "\t" $8 ; }' |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
    bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity.bed.gz
    
gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity_lt51.bed.gz

gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity_51to200.bed.gz

gunzip -c LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity_gt200.bed.gz

Thu May 12 15:13:23 EDT 2022


## 9. Get SimpleRepeats from UCSC GRCh38 simpleRepeat file (from TRF) <a id="trf"></a>

In [14]:
date
gzcat LowComplexity/intermediatefiles/simpleRepeat.txt.gz |
	cut -f2-4  |
	sed 's/^chr//' |
	grep "^[0-9XY]" | grep -v '_' |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
    bgzip -c > LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat.bed.gz

gunzip -c LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat.bed.gz |
	awk '$3-$2<51' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat_lt51.bed.gz

gunzip -c LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat.bed.gz |
	awk '$3-$2>50 && $3-$2<201' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat_51to200.bed.gz

gunzip -c LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat.bed.gz |
	awk '$3-$2>200' | 
	bgzip -c > LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat_gt200.bed.gz

Thu May 12 15:15:14 EDT 2022


## 10. Satellites<a id="satellites"></a>

pull satellite regions from rmsk.txt put into bed format, sort and merge.
- pull out "Satellite" (repClass field)
- pull out chr, start, stop fields only for 3 col bed
- rmsk included contigs, removed them 

Sum of satellite regions CHM13v2 > GRCh38 > GRCh37.  The difference between 37 and 38 is likely because in 37 , satellites in centromere were represented as "gaps" whereas in 38 they were annotated as "satellites". CHM13v2 assembly was able to futher annotate these regions. 

In [11]:
date
gzcat LowComplexity/intermediatefiles/rmsk.txt.gz \
| grep "Satellite" \
| awk '{ print $6 "\t" $7 "\t" $8 ; }' \
| grep -Ev '^chr[0-9XYM]_|^chr[0-9][0-9]_|^chrUn_' \
| sortBed -faidx ref-files/GRCh38.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > LowComplexity/intermediatefiles/GRCh38_satellites.bed.gz

slopBed -i LowComplexity/intermediatefiles/GRCh38_satellites.bed.gz -b 5 -g ref-files/human.b38.genome \
| mergeBed -i stdin \
| bgzip -c  > LowComplexity/GRCh38_satellites_slop5.bed.gz

gzcat LowComplexity/GRCh38_satellites_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:39:59 EDT 2022
75657144


**notin satellites**

In [12]:
date
subtractBed -a ref-files/human.b38.genome.bed -b LowComplexity/GRCh38_satellites_slop5.bed.gz | bgzip -c > LowComplexity/GRCh38_notinsatellites_slop5.bed.gz

Tue Jun  7 13:40:06 EDT 2022


## 11. Merge all homopolymers and find complement<a id="mergehopol"></a>

In [21]:
date
grep -Ev '_|^chrEBV' LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_p6.bed | 
	slopBed -i stdin -b 5 -g ref-files/human.b38.genome |
	mergeBed -i stdin |
	multiIntersectBed -i stdin LowComplexity/GRCh38_SimpleRepeat_imperfecthomopolgt10_slop5.bed.gz |
	sed 's/^chr//' |
	cut -f1-3 | grep "^[0-9XY]" |
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz

subtractBed -a ref-files/human.b38.genome.bed -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | bgzip -c > LowComplexity/GRCh38_notinAllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz

Thu May 12 16:19:02 EDT 2022


## 12. Multiintersect exact repeats and UCSC rmsk/trf repeat bed files and subtract homopolymers<a id="mergeexact"></a>

In [13]:
#Intermediate AllTandemRepeats before any size range selection
date 
multiIntersectBed -i LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_d11.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_t15.bed \
	LowComplexity/intermediatefiles/GRCh38_SimpleRepeat_q20.bed \
	LowComplexity/intermediatefiles/GRCh38_rmsk_Simple_repeat.bed.gz \
	LowComplexity/intermediatefiles/GRCh38_rmsk_Low_complexity.bed.gz \
	LowComplexity/intermediatefiles/GRCh38_trf_simpleRepeat.bed.gz \
	LowComplexity/intermediatefiles/GRCh38_satellites.bed.gz | 
	sed 's/^chr//' | 
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' | 
	sed 's/^/chr/' | 
	slopBed -i stdin -b 5 -g ref-files/human.b38.genome | 
	sed 's/^chr//' | 
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin > LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed
    
echo "sum of regions:"
cat LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:40:25 EDT 2022
sum of regions:
181602570


In [14]:
#lt51bp
date
awk '$3-$2<61' LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:40:46 EDT 2022
sum of regions:
29039810


In [15]:
#51to200bp
date
awk '$3-$2>60 && $3-$2<211' LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:40:52 EDT 2022
sum of regions:
25934792


In [16]:
#201to10000bp
date
awk '$3-$2>210 && $3-$2<10011' LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:41:00 EDT 2022
sum of regions:
40752164


In [17]:
#gt10000bp
date
awk '$3-$2>10010' LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz
    
echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:41:04 EDT 2022
sum of regions:
73385988


In [18]:
#gt100bp
date
awk '$3-$2>110' LowComplexity/intermediatefiles/GRCh38_AllTandemRepeats_intermediate.bed | 
	subtractBed -a stdin -b LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:41:11 EDT 2022
sum of regions:
124329651


## 13. Merge all homopolymers and TRs and find complement<a id="mergehomTR"></a>

In [19]:
date
multiIntersectBed -i LowComplexity/GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz \
	LowComplexity/GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz \
	LowComplexity/GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz \
	LowComplexity/GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz \
	LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz | 
	sed 's/^chr//' | 
	cut -f1-3 | grep "^[0-9XY]" | grep -v '_' | 
	sed 's/^X/23/;s/^Y/24/' | 
	sort -k1,1n -k2,2n -k3,3n | 
	sed 's/^23/X/;s/^24/Y/;s/^/chr/' | 
	mergeBed -i stdin | 
	bgzip -c > LowComplexity/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz

subtractBed -a ref-files/human.b38.genome.bed \
-b LowComplexity/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz | 
bgzip -c > LowComplexity/GRCh38_notinAllTandemRepeatsandHomopolymers_slop5.bed.gz

echo "sum of regions:"
gzcat LowComplexity/GRCh38_AllTandemRepeatsandHomopolymers_slop5.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'

Tue Jun  7 13:41:19 EDT 2022
sum of regions:
253090200


## 14. All TandemRepeats and its complement<a id="allTR"></a>

NOTE: allTR strats were erroneously generated in Union ipynb however should have been generated in LowComplexity ipynb.  Code has been copied over for future use and files transferred to LowComplexity. 

date
multiIntersectBed -i \
LowComplexity/GRCh38_AllTandemRepeats_lt51bp_slop5.bed.gz \
LowComplexity/GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz \
LowComplexity/GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz \
LowComplexity/GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz \
| sortBed -faidx ref-files/GRCh38.fa.fai -i stdin \
| mergeBed -i stdin \
| bgzip -c > Union/GRCh38_allTandemRepeats.bed.gz

subtractBed \
-a ref-files/human.b38.genome.bed -b Union/GRCh38_allTandemRepeats.bed.gz \
| bgzip -c > Union/GRCh38_notinallTandemRepeats.bed.gz

echo "sum of regions:"
gzcat Union/GRCh38_allTandemRepeats.bed.gz | awk -F'\t' 'BEGIN{SUM=0}{ SUM+=$3-$2 }END{print SUM}'