# GRCh37 Segmental Duplications - (v3.1)
started 6/27/22

In [1]:
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/GRCh37-stratifications


## Background
Following validation with benchmarking of v3.1 startifications we found that PAR-X regions were annotated in  segdup and selfchain files.  To correct this we will take the v3.0 stratification file and subtract the PAR-X region file. 

## Updates specific to GRCh37
PAR-X region is improperly annotated in GRCh37 segdups and selfchain.  The following stratifications will be updated, removing the PAR-X, within this ipynb.
- GRCh37_chainSelf_gt10kb.bed.gz
- GRCh37_notinchainSelf_gt10kb.bed.gz
- GRCh37_chainSelf.bed.gz
- GRCh37_notinchainSelf.bed.gz
- GRCh37_segdups.bed.gz
- GRCh37_notinsegdups.bed.gz
- GRCh37_segdups_gt10kb.bed.gz
- GRCh37_notinsegdups_gt10kb.bed.gz

All other GRCh37 "SegmentalDuplication" stratifications, generated as part of v3.0 and carried over for v3.1, were generated in `GRCh37_new_chainSelf_and_Segdups.ipynb`
- GRCh38_gt5segdups_gt10kb_gt99percidentity.bed.gz

## required tools and versions

In [2]:
# bedtools conda environment used by J.McDaniel
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.1              pyhd3eb1b0_0  

## Subtract PAR-X region from self-chain region stratifications
PAR-X from Heng Li, that is used with [dipcall (hs37d5.PAR.bed)](https://github.com/lh3/dipcall/tree/master/data), will be subtracted from stratifications carried over from v3.0.  
NOTE: `grep -v "#"` used to remove header from carry over v3.0 stratifications so it is not doubled in post-processing

In [6]:
# chainSelf
date
gzcat ../v3.0-carry-over-stratifications/GRCh37/SegmentalDuplications/GRCh37_chainSelf.bed.gz \
| grep -v '#' \
| subtractBed -a stdin -b ref-files/hs37d5.PAR.bed \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_chainSelf.bed.gz

Tue Jun 28 12:01:21 EDT 2022


In [2]:
# notinchainSelf
date
subtractBed -a ref-files/human.b37.genome.bed -b SegmentalDuplications/GRCh37_chainSelf.bed.gz \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_notinchainSelf.bed.gz

Wed Jun 29 11:10:28 EDT 2022


In [8]:
# chainSelf_gt10kb
date
gzcat ../v3.0-carry-over-stratifications/GRCh37/SegmentalDuplications/GRCh37_chainSelf_gt10kb.bed.gz \
| grep -v '#' \
| subtractBed -a stdin -b ref-files/hs37d5.PAR.bed \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_chainSelf_gt10kb.bed.gz

Tue Jun 28 12:01:25 EDT 2022


In [3]:
# notinchainSelf_gt10kb
date
subtractBed -a ref-files/human.b37.genome.bed -b SegmentalDuplications/GRCh37_chainSelf_gt10kb.bed.gz \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_notinchainSelf_gt10kb.bed.gz

Wed Jun 29 11:11:36 EDT 2022


In [10]:
# segdups
date
gzcat ../v3.0-carry-over-stratifications/GRCh37/SegmentalDuplications/GRCh37_segdups.bed.gz \
| grep -v '#' \
| subtractBed -a stdin -b ref-files/hs37d5.PAR.bed \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_segdups.bed.gz

Tue Jun 28 12:01:28 EDT 2022


In [4]:
# notinsegdups
date
subtractBed -a ref-files/human.b37.genome.bed -b SegmentalDuplications/GRCh37_segdups.bed.gz \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_notinsegdups.bed.gz

Wed Jun 29 11:12:22 EDT 2022


In [12]:
# segdups_gt10kb
date
gzcat ../v3.0-carry-over-stratifications/GRCh37/SegmentalDuplications/GRCh37_segdups_gt10kb.bed.gz \
| grep -v '#' \
| subtractBed -a stdin -b ref-files/hs37d5.PAR.bed \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_segdups_gt10kb.bed.gz

Tue Jun 28 12:01:33 EDT 2022


In [5]:
# notinsegdups_gt10kb
date
subtractBed -a ref-files/human.b37.genome.bed -b SegmentalDuplications/GRCh37_segdups_gt10kb.bed.gz \
| sortBed -faidx ref-files/hs37d5.fa.gz.fai -i stdin \
| mergeBed -i stdin -d 100 \
| bgzip -c > SegmentalDuplications/GRCh37_notinsegdups_gt10kb.bed.gz

Wed Jun 29 11:13:09 EDT 2022
