# GRCh38 XY v3.1 stratifications
JMcDaniel started 5/5/22

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>

- [Background](#background)  
- [File Descriptions](#files)  
- [Files for release](#release)  
- [Resources](#resources)  
- [Code and File Sharing](#share)  
- [Software Tools](#tools)  
- [Get Dependency Files](#depend)  
- [Stratification Prep](#prep)  
    [XTR](#xtr)  
    [Ampliconic](#amp)  
    [PAR](#par)  
    [non-PAR](#non)  
    [All Autosomes](#auto)


## Background<a id="background"></a>


XY feature regions from Melissa Wilson and Heng Li will be parsed for the following startifications:

**PAR - The following PARs were used  
PAR-X [hs38.PAR.bed from Heng Li](https://github.com/lh3/dipcall/tree/master/data)  
PAR-Y [PSA_Y_GRCh38.bed regions from UCSC](https://genome.ucsc.edu/cgi-bin/hgGateway)**  
regions similar between X and Y.  These regions do recombine with one another. We Usually mask chrY PAR and force all reads to X.

**Ampliconic - from Melissa Wilson (ASU)     
[chrX_genomic_features_GRCh38.bed](https://github.com/SexChrLab/SexChrCoordinates/tree/main/GRCh38)  
[chrY_genomic_features_GRCh38.bed](https://github.com/SexChrLab/SexChrCoordinates/tree/main/GRCh38)**    
segmentally duplicated regions within and possibly between X and Y

**XTR - from Melissa Wilson (ASU)  
[chrX_genomic_features_GRCh38.bed](https://github.com/SexChrLab/SexChrCoordinates/tree/main/GRCh38)    
[chrY_genomic_features_GRCh38.bed](https://github.com/SexChrLab/SexChrCoordinates/tree/main/GRCh38)**   
regions that are quite similar between X and Y.  Thes regions DO NOT recombine like PAR. They are easier to map however with 97% similarity they can be an issue for short-read technologies.

**AllAutosomes**  
everything but XY, so 1-22

**nonPAR**  
chrX minus PAR

## Files for release<a id="release"></a>
GRCh38_AllAutosomes.bed  
GRCh38_chrX_PAR.bed  
GRCh38_chrX_XTR.bed  
GRCh38_chrX_ampliconic.bed  
GRCh38_chrX_nonPAR.bed  
GRCh38_chrY_PAR.bed  
GRCh38_chrY_XTR.bed  
GRCh38_chrY_ampliconic.bed  
GRCh38_chrY_nonPAR.bed

## Resources<a id="resources"></a>
[Melissa Wilson GRCh37 and GRCh38 XY regions](https://github.com/SexChrLab/SexChrCoordinates)

## Code and File Sharing<a id="share"></a>
- GIAB GitHub
- GIAB FTP

## Required tools and versions<a id="tools"></a>

In [1]:
# JMcDaniel bedtools conda environment 
conda list

# packages in environment at /Users/jmcdani/opt/anaconda3/envs/bedtools:
#
# Name                    Version                   Build  Channel
appnope                   0.1.0            py27hb466136_0  
backports                 1.1                pyhd3eb1b0_0  
backports.shutil_get_terminal_size 1.0.0              pyhd3eb1b0_3  
backports_abc             0.5                        py_1  
bedtools                  2.30.0               haa7f73a_1    bioconda
biopython                 1.74             py27h1de35cc_0  
bleach                    3.1.0                    py27_0  
bzip2                     1.0.8                h1de35cc_0  
ca-certificates           2022.3.29            hecd8cb5_0  
certifi                   2020.6.20          pyhd3eb1b0_3  
configparser              4.0.2                    py27_0  
dbus                      1.13.18              h18a8e69_0  
decorator                 4.4.0                    py27_1  
defusedxml                0.7.1              pyhd3eb1b0_0  

## Get Dependency Files<a id="depend"></a>

Retrieved XY regions from Melissa Wilson GitHub
Downloaded the following files from GitHub site linked in resources
- https://github.com/SexChrLab/SexChrCoordinates/blob/main/GRCh38/chrX_genomic_features_GRCh38.bed
- https://github.com/SexChrLab/SexChrCoordinates/blob/main/GRCh38/chrY_genomic_features_GRCh38.bed

## Stratification Preparation<a id="prep"></a>

#### View files from Melissa

In [2]:
cat XY/intermediatefiles/chrX_genomic_features_GRCh38.bed

chrX	10001	2781479	PAR1
chrX	48343310	48434604	Ampliconic_region_1
chrX	49157254	49162986	Ampliconic_region_2
chrX	52032464	52223402	Ampliconic_region_4
chrX	52511000	52511492	Ampliconic_region_5_1
chrX	52489200	52501394	Ampliconic_region_5_2
chrX	52520509	52520537	Ampliconic_region_5_3
chrX	139870917	139871081	Ampliconic_region_5_4
chrX	55437684	55547739	Ampliconic_region_6
chrX	63185573	63193371	Ampliconic_region_7_1
chrX	71674267	71835832	Ampliconic_region_8
chrX	72721314	73105236	Ampliconic_region_9
chrX	88369343	93042942	XTR
chrX	101563740	101648990	Ampliconic_region_10
chrX	102180805	102519463	Ampliconic_region_11
chrX	103940531	104117650	Ampliconic_region_12
chrX	155701383	156030895	PAR2


In [4]:
cat XY/intermediatefiles/chrY_genomic_features_GRCh38.bed

chrY	10001	2781479	PAR1
chrY	3050044	6235111	XTR
chrY	6235111	6532906	Ampliconic
chrY	6532906	6748713	XTR
chrY	7574481	10266944	Ampliconic
chrY	13984473	14058733	Ampliconic
chrY	15875093	15905214	Ampliconic
chrY	16159795	16425965	Ampliconic
chrY	17456266	18870334	Ampliconic
chrY	21305953	26415435	Ampliconic
chrY	56987321	57217415	PAR2


## XTR<a id="xtr"></a>
from Melissa Wilson

In [1]:
# chrX
grep "XTR" XY/intermediatefiles/chrX_genomic_features_GRCh38.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/GRCh38.fa.fai > XY/GRCh38_chrX_XTR.bed

In [2]:
# chrY
grep "XTR" XY/intermediatefiles/chrY_genomic_features_GRCh38.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/GRCh38.fa.fai > XY/GRCh38_chrY_XTR.bed

## Ampliconic<a id="amp"></a>
from Melissa Wilson

In [3]:
# chrX
grep "Ampliconic" XY/intermediatefiles/chrX_genomic_features_GRCh38.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/GRCh38.fa.fai > XY/GRCh38_chrX_ampliconic.bed

In [4]:
# chrY
grep "Ampliconic" XY/intermediatefiles/chrY_genomic_features_GRCh38.bed \
| cut -f 1-3 \
| sortBed -i stdin -faidx ref-files/GRCh38.fa.fai > XY/GRCh38_chrY_ampliconic.bed

## PAR<a id="par"></a>
chrX from Heng Li, chrY from UCSC  
NOTE: updated 6/28/22 to use Heng and UCSC PARs

In [1]:
# chrX
date
cp ref-files/hs38.PAR.bed XY/GRCh38_chrX_PAR.bed

Tue Jun 28 15:56:22 EDT 2022


In [2]:
# chrY
date
cp ref-files/PSA_Y_GRCh38.bed XY/GRCh38_chrY_PAR.bed

Tue Jun 28 15:57:28 EDT 2022


## non-par<a id="non"></a>  
NOTE: updated 6/28/22 only have nonPAR, remove nonXTR and nonAmpliconic

In [7]:
# Prepare chrX and chrY bedfiles from .fai
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/GRCh38.fa.fai | grep "chrX" > XY/intermediatefiles/GRCh38_chrX.bed
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/GRCh38.fa.fai | grep "chrY" > XY/intermediatefiles/GRCh38_chrY.bed

In [4]:
# subtract PAR-X regions from chrX bed
# chrX
date
subtractBed -a XY/intermediatefiles/GRCh38_chrX.bed -b XY/GRCh38_chrX_PAR.bed > XY/GRCh38_chrX_nonPAR.bed

Tue Jun 28 15:59:01 EDT 2022


In [5]:
# subtract PAR-Y regions from chrY bed
# chrY
date
subtractBed -a XY/intermediatefiles/GRCh38_chrY.bed -b XY/GRCh38_chrY_PAR.bed > XY/GRCh38_chrY_nonPAR.bed

Tue Jun 28 16:00:19 EDT 2022


## All Autosomes<a id="auto"></a>

In [10]:
awk -v OFS='\t' {'print $1,"0",$2'} ref-files/GRCh38.fa.fai \
| grep -v "chrX\|chrY\|chrM" > XY/GRCh38_AllAutosomes.bed