# GRCh37 XY v3.1 stratifications
JMcDaniel started 5/9/22

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>

- [Background](#background)  
- [File Descriptions](#files)  
- [Files for release](#release)  
- [Resources](#resources)  
- [Code and File Sharing](#share)  
- [Software Tools](#tools)  
- [Get Dependency Files](#depend)  
- [Stratification Prep](#prep)  
    [Prep input genomic region files](#inprep)  
    [XTR](#xtr)  
    [Ampliconic](#amp)  
    [PAR](#par)  
    [non-PAR](#non)  
    [All Autosomes](#auto)


## Background<a id="background"></a>


XY feature regions from Melissa Wilson and Heng Li will be parsed for the following startifications:

**PAR - The following PARs were used  
PAR-X [hs37d5.PAR.bed from Heng Li](https://github.com/lh3/dipcall/tree/master/data)  
PAR-Y [PSA_Y_hg19.bed regions from UCSC](https://genome.ucsc.edu/cgi-bin/hgGateway)**  
regions similar between X and Y.  These regions do recombine with one another. We Usually mask chrY PAR and force all reads to X.

**Ampliconic - from Melissa Wilson (ASU)     
[chrX_genomic_features_hg19.bed](https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrX_genomic_features_hg19.bed)  
[chrY_genomic_features_hg19.bed](https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrY_genomic_features_hg19.bed)**    
segmentally duplicated regions within and possibly between X and Y

**XTR - from Melissa Wilson (ASU)  
[chrX_genomic_features_hg19.bed](https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrX_genomic_features_hg19.bed)    
[chrY_genomic_features_hg19.bed](https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrY_genomic_features_hg19.bed)**     
regions that are quite similar between X and Y.  Thes regions DO NOT recombine like PAR. They are easier to map however with 97% similarity they can be an issue for short-read technologies.

**AllAutosomes**  
everything but XY, so 1-22

**nonPAR**  
chrX minus PAR

## Files for release<a id="release"></a>
GRCh37_AllAutosomes.bed  
GRCh37_chrX_PAR.bed  
GRCh37_chrX_XTR.bed  
GRCh37_chrX_ampliconic.bed  
GRCh37_chrX_nonPAR.bed  
GRCh37_chrY_PAR.bed  
GRCh37_chrY_XTR.bed  
GRCh37_chrY_ampliconic.bed  
GRCh37_chrY_nonPAR.bed

## Resources<a id="resources"></a>
[Melissa Wilson GRCh37 and GRCh38 XY regions](https://github.com/SexChrLab/SexChrCoordinates)

## Code and File Sharing<a id="share"></a>
- GIAB GitHub
- GIAB FTP

## Required tools and versions<a id="tools"></a>

In [1]:
# JMcDaniel bedtools conda environment 
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.1              pyhd3eb1b0_0  

## Get Dependency Files<a id="depend"></a>

Retrieve XY regions from Melissa Wilson GitHub
Downloaded the following files from GitHub site linked in resources
- https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrX_genomic_features_hg19.bed
- https://github.com/SexChrLab/SexChrCoordinates/blob/main/hg19/chrY_genomic_features_hg19.bed

## Stratification Preparation<a id="prep"></a>

## Prep input genomic region files<a id="inprep"></a>

#### View files from Melissa

In [1]:
cat XY/intermediatefiles/chrX_genomic_features_hg19.bed

chrX	60001	2699520	PAR1
chrX 48202745 48292983 Ampliconic_region_1
chrX 48976199 49062381 Ampliconic_region_2
chrX 51395467 51492862 Ampliconic_region_3
chrX 51775560 51966529 Ampliconic_region_4
chrX 52518132 53027386 Ampliconic_region_5
chrX 55464117 55574172 Ampliconic_region_6
chrX 62335733 62495350 Ampliconic_region_7
chrX 70894117 71055682 Ampliconic_region_8
chrX 71941159 72325075 Ampliconic_region_9
chrX	88193855	93193855	XTR
chrX 100818723 100903977 Ampliconic_region_10
chrX 101435778 101774391 Ampliconic_region_11
chrX 103195105 103362341 Ampliconic_region_12
chrX	154931044	155260560	PAR2


In [2]:
cat XY/intermediatefiles/chrY_genomic_features_hg19.bed

chrY	0	2749806	PAR1
chrY	2918085	6103152	XTR
chrY	6103152	6400947	Ampliconic
chrY	6400947	6616754	XTR
chrY	7442522	10135224	Ampliconic
chrY	16096353	16170613	Ampliconic
chrY	17986973	18017094	Ampliconic
chrY	18271675	18537845	Ampliconic
chrY	19568146	21032220	Ampliconic
chrY	23467839	28561582	Ampliconic
chrY	59133470	59373566	PAR2


**remove chr prefix from chromo name for GRCh37**  
NOTE: Found rows with Ampliconic regions in chrX_genomic_features_hg19.bed were not tab delimited. Manually added tabs in chrX_genomic_features_GRCh37.bed, chrY appeared to be fine. 

In [10]:
sed 's/chrX/X/' XY/intermediatefiles/chrX_genomic_features_hg19.bed > XY/intermediatefiles/chrX_genomic_features_GRCh37.bed
echo "chrX_genomic_features_GRCh37.bed"
cat XY/intermediatefiles/chrX_genomic_features_GRCh37.bed

sed 's/chrY/Y/' XY/intermediatefiles/chrY_genomic_features_hg19.bed  > XY/intermediatefiles/chrY_genomic_features_GRCh37.bed
echo "chrY_genomic_features_GRCh37.bed"
cat XY/intermediatefiles/chrY_genomic_features_GRCh37.bed

chrX_genomic_features_GRCh37.bed
X	60001	2699520	PAR1
X 48202745 48292983 Ampliconic_region_1
X 48976199 49062381 Ampliconic_region_2
X 51395467 51492862 Ampliconic_region_3
X 51775560 51966529 Ampliconic_region_4
X 52518132 53027386 Ampliconic_region_5
X 55464117 55574172 Ampliconic_region_6
X 62335733 62495350 Ampliconic_region_7
X 70894117 71055682 Ampliconic_region_8
X 71941159 72325075 Ampliconic_region_9
X	88193855	93193855	XTR
X 100818723 100903977 Ampliconic_region_10
X 101435778 101774391 Ampliconic_region_11
X 103195105 103362341 Ampliconic_region_12
X	154931044	155260560	PAR2
chrY_genomic_features_GRCh37.bed
Y	0	2749806	PAR1
Y	2918085	6103152	XTR
Y	6103152	6400947	Ampliconic
Y	6400947	6616754	XTR
Y	7442522	10135224	Ampliconic
Y	16096353	16170613	Ampliconic
Y	17986973	18017094	Ampliconic
Y	18271675	18537845	Ampliconic
Y	19568146	21032220	Ampliconic
Y	23467839	28561582	Ampliconic
Y	59133470	59373566	PAR2


## XTR<a id="xtr"></a>
From Melissa Wilson

In [1]:
# chrX
date
grep "XTR" XY/intermediatefiles/chrX_genomic_features_GRCh37.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/hs37d5.fa.gz.fai > XY/GRCh37_chrX_XTR.bed

Mon May  9 10:35:45 EDT 2022


In [2]:
# chrY
date
grep "XTR" XY/intermediatefiles/chrY_genomic_features_GRCh37.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/hs37d5.fa.gz.fai > XY/GRCh37_chrY_XTR.bed

Mon May  9 10:35:46 EDT 2022


## Ampliconic<a id="amp"></a>
From Melissa Wilson

In [3]:
# chrX (see not regarding chrX_genomic_features_GRCh37.bed in above prep of this file)
date
grep -e "Ampliconic" XY/intermediatefiles/chrX_genomic_features_GRCh37.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/hs37d5.fa.gz.fai > XY/GRCh37_chrX_ampliconic.bed

Mon May  9 10:35:50 EDT 2022


In [4]:
# chrY
date
grep "Ampliconic" XY/intermediatefiles/chrY_genomic_features_GRCh37.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/hs37d5.fa.gz.fai > XY/GRCh37_chrY_ampliconic.bed

Mon May  9 10:35:52 EDT 2022


## PAR<a id="par"></a>
chrX from Heng Li, chrY from UCSC  
NOTE: updated 6/28/22 to use Heng and UCSC PARs

In [1]:
# chrX
date
cp ref-files/hs37d5.PAR.bed XY/GRCh37_chrX_PAR.bed

Tue Jun 28 15:44:41 EDT 2022


In [2]:
# chrY
date
cp ref-files/PSA_Y_hg19.bed XY/GRCh37_chrY_PAR.bed

Tue Jun 28 15:46:16 EDT 2022


## non-par<a id="non"></a>
NOTE: updated 6/28/22 only have nonPAR, remove nonXTR and nonAmpliconic

In [7]:
# Prepare chrX and chrY bedfiles from .fai
date
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/hs37d5.fa.gz.fai | grep "X" > XY/intermediatefiles/GRCh37_chrX.bed
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/hs37d5.fa.gz.fai | grep "Y" > XY/intermediatefiles/GRCh37_chrY.bed

Mon May  9 10:36:01 EDT 2022


In [3]:
# subtract PAR-X regions from chrX bed
# chrX
date
subtractBed -a XY/intermediatefiles/GRCh37_chrX.bed -b XY/GRCh37_chrX_PAR.bed > XY/GRCh37_chrX_nonPAR.bed

Tue Jun 28 15:50:37 EDT 2022


In [4]:
# subtract PAR-Y regions from chrY bed
# chrY
date
subtractBed -a XY/intermediatefiles/GRCh37_chrY.bed -b XY/GRCh37_chrY_PAR.bed > XY/GRCh37_chrY_nonPAR.bed

Tue Jun 28 15:50:38 EDT 2022


## All Autosomes<a id="auto"></a>

In [10]:
date
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/hs37d5.fa.gz.fai \
| grep -v "X\|Y\|M\|^GL\|^NC\|^hs" > XY/GRCh37_AllAutosomes.bed

Mon May  9 10:36:07 EDT 2022
