# v3.1 Stratification Post-Processing
<hr style="border:2px solid gray"> </hr>
JMcDaniel  
2022-06-29

This ipynb could be used as a template along `stratification_post-processing.sh` script for post-processing future stratification versions.  

## TABLE OF CONTENTS
<hr style="border:2px solid black"> </hr>

**[PURPOSE](#purpose)**    

**[FILE PROCESSING](#process)**  
1. [Standardize files](#std)  
2. [Rename files](#rename)  
3. [Consolidate candidate release files](#consolidate)  
4. [Generate md5s of candidate files](#md5)  

In [50]:
pwd

/Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/post-processing


[Google Drive for v3.1 Stratifications](https://drive.google.com/drive/folders/1Z0_az2KWiDcvcpf3pu9qzfSC1hHuCchD?usp=sharing)

# Purpose<a id="purpose"></a>

This notebook describes the process for preparing the new (2022 v3.1) GRCh37, GRCh38 and CHM13 stratification BEDs for release. The following was performed for new and revised stratifictions for v3.1 release.  All files carried over from v3.0 will be used directly, they will not be run through post-processing.  

Stratification BED Files will be modified for consistency as follows:

**1. stardardize output using shell script**  
`stratification_post-processing`  

Post-Processing Scripts for GRCh3X  includes:  
- sort by chromsome and position  
- subset to only chromosomes 1-22, X & Y (removing contig chroms)  
- merge overlaping regions  
- subtract refNs (gaps)  
- remove psuedoautosomal Y regions  (not subtracted for XY)
- add header to BED giving context for BED  
- compress using bgzip   

Post-Processing Scripts for CHM13 includes:  
- sort by chromsome and position  
- subset to only chromosomes 1-22, X & Y (removing contig chroms)  
- merge overlaping regions  
- No regions are removed unlike GRCh3X 
- add header to BED giving context for BED  
- compress using bgzip  
    
**2. rename files using standard naming convention**  
`<reference>_<stratification-type>.bed.gz`

**3. output md5s**

This work was performed on JMcDaniel's computer in directory: /Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications

# File Processing<a id="process"></a>

### Create header file

In [7]:
echo "#This is part of a collection of bed files intended to stratify performance by genome context using GA4GH benchmarking tools" > /Users/jmcdani/Documents/GiaB/Benchmarking/Stratifications/v3.1_genome-stratifications/post-processing/header.txt

## 1. Standardize output using shell script to process all files in directories<a id="std"></a>
`stratification_post-processing.sh`

## GRCh38

### XY

In [1]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh38-stratifications/XY/*.bed*
do
date >> ../GRCh38-stratifications/XY/GRCh38-XY-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh38-stratifications/XY/GRCh38-XY-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:05:53 EDT 2022


In [2]:
sh ../stratification_post-processing.sh -i ../GRCh38-stratifications/XY/ -n refNs/ -r GRCh38

Wed Jun 29 08:06:58 EDT 2022
bedtools v2.30.0
ref is: GRCh38 
XY
post-processed/GRCh38/XY
Processing for GRCh38
processing XY: ../GRCh38-stratifications/XY/GRCh38_AllAutosomes.bed
processed XY:   GRCh38_AllAutosomes.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrX_PAR.bed
processed XY:   GRCh38_chrX_PAR.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrX_XTR.bed
processed XY:   GRCh38_chrX_XTR.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrX_ampliconic.bed
processed XY:   GRCh38_chrX_ampliconic.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrX_nonPAR.bed
processed XY:   GRCh38_chrX_nonPAR.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrY_PAR.bed
processed XY:   GRCh38_chrY_PAR.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrY_XTR.bed
processed XY:   GRCh38_chrY_XTR.bed
processing XY: ../GRCh38-stratifications/XY/GRCh38_chrY_ampliconic.bed
processed XY:   GRCh38_chrY_ampliconic.bed
processing XY: ../GRCh38-stratifications/XY/GR

### Union

In [5]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh38-stratifications/Union/*.bed*
do
date >> ../GRCh38-stratifications/Union/GRCh38-Union-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh38-stratifications/Union/GRCh38-Union-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:08:53 EDT 2022


In [6]:
sh ../stratification_post-processing.sh -i ../GRCh38-stratifications/Union/ -n refNs/ -r GRCh38

Wed Jun 29 08:08:58 EDT 2022
bedtools v2.30.0
ref is: GRCh38 
Union
post-processed/GRCh38/Union
Processing for GRCh38
processing: ../GRCh38-stratifications/Union/GRCh38_allTandemRepeats.bed.gz
processed:   GRCh38_allTandemRepeats.bed.gz
processing: ../GRCh38-stratifications/Union/GRCh38_alldifficultregions.bed.gz
processed:   GRCh38_alldifficultregions.bed.gz
processing: ../GRCh38-stratifications/Union/GRCh38_notinallTandemRepeats.bed.gz
processed:   GRCh38_notinallTandemRepeats.bed.gz
processing: ../GRCh38-stratifications/Union/GRCh38_notinalldifficultregions.bed.gz
processed:   GRCh38_notinalldifficultregions.bed.gz


### LowComplexity

In [7]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh38-stratifications/LowComplexity/*.bed*
do
date >> ../GRCh38-stratifications/LowComplexity/GRCh38-LowComplexity-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh38-stratifications/LowComplexity/GRCh38-LowComplexity-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:09:57 EDT 2022


In [8]:
sh ../stratification_post-processing.sh -i ../GRCh38-stratifications/LowComplexity/ -n refNs/ -r GRCh38

Wed Jun 29 08:10:29 EDT 2022
bedtools v2.30.0
ref is: GRCh38 
LowComplexity
post-processed/GRCh38/LowComplexity
Processing for GRCh38
processing: ../GRCh38-stratifications/LowComplexity/GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processed:   GRCh38_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processing: ../GRCh38-stratifications/LowComplexity/GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_201to10000bp_slop5.bed.gz
processing: ../GRCh38-stratifications/LowComplexity/GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_51to200bp_slop5.bed.gz
processing: ../GRCh38-stratifications/LowComplexity/GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_gt10000bp_slop5.bed.gz
processing: ../GRCh38-stratifications/LowComplexity/GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz
processed:   GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz
processing: ../GRCh38-stratifications/LowComple

### SegmentalDuplications

In [45]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh38-stratifications/SegmentalDuplications/*.bed*
do
date >> ../GRCh38-stratifications/SegmentalDuplications/GRCh38-SegmentalDuplications-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh38-stratifications/SegmentalDuplications/GRCh38-SegmentalDuplications-v3.1-original_file_md5s.txt
done

Wed Jun 29 11:15:38 EDT 2022


In [46]:
sh ../stratification_post-processing.sh -i ../GRCh38-stratifications/SegmentalDuplications/ -n refNs/ -r GRCh38

Wed Jun 29 11:17:11 EDT 2022
bedtools v2.30.0
ref is: GRCh38 
SegmentalDuplications
post-processed/GRCh38/SegmentalDuplications
Processing for GRCh38
processing: ../GRCh38-stratifications/SegmentalDuplications/GRCh38_chainSelf.bed.gz
processed:   GRCh38_chainSelf.bed.gz
processing: ../GRCh38-stratifications/SegmentalDuplications/GRCh38_chainSelf_gt10kb.bed.gz
processed:   GRCh38_chainSelf_gt10kb.bed.gz
processing: ../GRCh38-stratifications/SegmentalDuplications/GRCh38_notinchainSelf.bed.gz
processed:   GRCh38_notinchainSelf.bed.gz
processing: ../GRCh38-stratifications/SegmentalDuplications/GRCh38_notinchainSelf_gt10kb.bed.gz
processed:   GRCh38_notinchainSelf_gt10kb.bed.gz


## GRCh37

### XY

In [12]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh37-stratifications/XY/*.bed*
do
date >> ../GRCh37-stratifications/XY/GRCh37-XY-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh37-stratifications/XY/GRCh37-XY-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:20:00 EDT 2022


In [13]:
sh ../stratification_post-processing.sh -i ../GRCh37-stratifications/XY/ -n refNs/ -r GRCh37

Wed Jun 29 08:20:09 EDT 2022
bedtools v2.30.0
ref is: GRCh37 
XY
post-processed/GRCh37/XY
Processing for GRCh37
processing XY: ../GRCh37-stratifications/XY/GRCh37_AllAutosomes.bed
processed XY:   GRCh37_AllAutosomes.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrX_PAR.bed
processed XY:   GRCh37_chrX_PAR.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrX_XTR.bed
processed XY:   GRCh37_chrX_XTR.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrX_ampliconic.bed
processed XY:   GRCh37_chrX_ampliconic.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrX_nonPAR.bed
processed XY:   GRCh37_chrX_nonPAR.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrY_PAR.bed
processed XY:   GRCh37_chrY_PAR.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrY_XTR.bed
processed XY:   GRCh37_chrY_XTR.bed
processing XY: ../GRCh37-stratifications/XY/GRCh37_chrY_ampliconic.bed
processed XY:   GRCh37_chrY_ampliconic.bed
processing XY: ../GRCh37-stratifications/XY/GR

### Union

In [14]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh37-stratifications/Union/*.bed*
do
date >> ../GRCh37-stratifications/Union/GRCh37-Union-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh37-stratifications/Union/GRCh37-Union-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:23:56 EDT 2022


In [15]:
sh ../stratification_post-processing.sh -i ../GRCh37-stratifications/Union/ -n refNs/ -r GRCh37

Wed Jun 29 08:24:02 EDT 2022
bedtools v2.30.0
ref is: GRCh37 
Union
post-processed/GRCh37/Union
Processing for GRCh37
processing: ../GRCh37-stratifications/Union/GRCh37_allTandemRepeats.bed.gz
processed:   GRCh37_allTandemRepeats.bed.gz
processing: ../GRCh37-stratifications/Union/GRCh37_alldifficultregions.bed.gz
processed:   GRCh37_alldifficultregions.bed.gz
processing: ../GRCh37-stratifications/Union/GRCh37_alllowmapandsegdupregions.bed.gz
processed:   GRCh37_alllowmapandsegdupregions.bed.gz
processing: ../GRCh37-stratifications/Union/GRCh37_notinallTandemRepeats.bed.gz
processed:   GRCh37_notinallTandemRepeats.bed.gz
processing: ../GRCh37-stratifications/Union/GRCh37_notinalldifficultregions.bed.gz
processed:   GRCh37_notinalldifficultregions.bed.gz
processing: ../GRCh37-stratifications/Union/GRCh37_notinalllowmapandsegdupregions.bed.gz
processed:   GRCh37_notinalllowmapandsegdupregions.bed.gz


### LowComplexity

In [16]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh37-stratifications/LowComplexity/*.bed*
do
date >> ../GRCh37-stratifications/LowComplexity/GRCh37-LowComplexity-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh37-stratifications/LowComplexity/GRCh37-LowComplexity-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:25:21 EDT 2022


In [17]:
sh ../stratification_post-processing.sh -i ../GRCh37-stratifications/LowComplexity/ -n refNs/ -r GRCh37

Wed Jun 29 08:25:28 EDT 2022
bedtools v2.30.0
ref is: GRCh37 
LowComplexity
post-processed/GRCh37/LowComplexity
Processing for GRCh37
processing: ../GRCh37-stratifications/LowComplexity/GRCh37_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processed:   GRCh37_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processing: ../GRCh37-stratifications/LowComplexity/GRCh37_AllTandemRepeats_201to10000bp_slop5.bed.gz
processed:   GRCh37_AllTandemRepeats_201to10000bp_slop5.bed.gz
processing: ../GRCh37-stratifications/LowComplexity/GRCh37_AllTandemRepeats_51to200bp_slop5.bed.gz
processed:   GRCh37_AllTandemRepeats_51to200bp_slop5.bed.gz
processing: ../GRCh37-stratifications/LowComplexity/GRCh37_AllTandemRepeats_gt10000bp_slop5.bed.gz
processed:   GRCh37_AllTandemRepeats_gt10000bp_slop5.bed.gz
processing: ../GRCh37-stratifications/LowComplexity/GRCh37_AllTandemRepeats_gt100bp_slop5.bed.gz
processed:   GRCh37_AllTandemRepeats_gt100bp_slop5.bed.gz
processing: ../GRCh37-stratifications/LowComple

### SegmentalDuplicatons

In [49]:
#Get list of md5s for original files prior to any post processing
date
for f in ../GRCh37-stratifications/SegmentalDuplications/*.bed*
do
date >> ../GRCh37-stratifications/SegmentalDuplications/GRCh37-SegmentalDuplications-v3.1-original_file_md5s.txt
md5 $f >> ../GRCh37-stratifications/SegmentalDuplications/GRCh37-SegmentalDuplications-v3.1-original_file_md5s.txt
done

Wed Jun 29 11:23:20 EDT 2022


In [48]:
sh ../stratification_post-processing.sh -i ../GRCh37-stratifications/SegmentalDuplications/ -n refNs/ -r GRCh37

Wed Jun 29 11:18:24 EDT 2022
bedtools v2.30.0
ref is: GRCh37 
SegmentalDuplications
post-processed/GRCh37/SegmentalDuplications
Processing for GRCh37
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_chainSelf.bed.gz
processed:   GRCh37_chainSelf.bed.gz
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_chainSelf_gt10kb.bed.gz
processed:   GRCh37_chainSelf_gt10kb.bed.gz
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_notinchainSelf.bed.gz
processed:   GRCh37_notinchainSelf.bed.gz
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_notinchainSelf_gt10kb.bed.gz
processed:   GRCh37_notinchainSelf_gt10kb.bed.gz
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_notinsegdups.bed.gz
processed:   GRCh37_notinsegdups.bed.gz
processing: ../GRCh37-stratifications/SegmentalDuplications/GRCh37_notinsegdups_gt10kb.bed.gz
processed:   GRCh37_notinsegdups_gt10kb.bed.gz
processing: ../GRCh37-stratifications/SegmentalDup

## CHM13v2.0

### XY

In [26]:
#Get list of md5s for original files prior to any post processing
date
for f in ../T2T-CHM13v2.0-stratifications/XY/*.bed*
do
date >> ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0-XY-v3.1-original_file_md5s.txt
md5 $f >> ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0-XY-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:36:57 EDT 2022


In [22]:
sh ../stratification_post-processing.sh -i ../T2T-CHM13v2.0-stratifications/XY/ -n refNs/ -r CHM13

Wed Jun 29 08:33:28 EDT 2022
bedtools v2.30.0
ref is: CHM13 
XY
post-processed/CHM13/XY
Processing for CHM13
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_AllAutosomes.bed
processed XY:   CHM13v2.0_AllAutosomes.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrX_PAR.bed
processed XY:   CHM13v2.0_chrX_PAR.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrX_XTR.bed
processed XY:   CHM13v2.0_chrX_XTR.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrX_nonPAR.bed
processed XY:   CHM13v2.0_chrX_nonPAR.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrY_PAR.bed
processed XY:   CHM13v2.0_chrY_PAR.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrY_XTR.bed
processed XY:   CHM13v2.0_chrY_XTR.bed
processing XY: ../T2T-CHM13v2.0-stratifications/XY/CHM13v2.0_chrY_nonPAR.bed
processed XY:   CHM13v2.0_chrY_nonPAR.bed


### Union

In [25]:
#Get list of md5s for original files prior to any post processing
date
for f in ../T2T-CHM13v2.0-stratifications/Union/*.bed*
do
date >> ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0-Union-v3.1-original_file_md5s.txt
md5 $f >> ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0-Union-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:36:24 EDT 2022


In [24]:
sh ../stratification_post-processing.sh -i ../T2T-CHM13v2.0-stratifications/Union/ -n refNs/ -r CHM13

Wed Jun 29 08:34:14 EDT 2022
bedtools v2.30.0
ref is: CHM13 
Union
post-processed/CHM13/Union
Processing for CHM13
processing: ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0_allTandemRepeats.bed.gz
processed:   CHM13v2.0_allTandemRepeats.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0_alldifficultregions.bed.gz
processed:   CHM13v2.0_alldifficultregions.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0_notinallTandemRepeats.bed.gz
processed:   CHM13v2.0_notinallTandemRepeats.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/Union/CHM13v2.0_notinalldifficultregions.bed.gz
processed:   CHM13v2.0_notinalldifficultregions.bed.gz


### LowComplexity

In [27]:
#Get list of md5s for original files prior to any post processing
date
for f in ../T2T-CHM13v2.0-stratifications/LowComplexity/*.bed*
do
date >> ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0-LowComplexity-v3.1-original_file_md5s.txt
md5 $f >> ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0-LowComplexity-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:38:46 EDT 2022


In [28]:
sh ../stratification_post-processing.sh -i ../T2T-CHM13v2.0-stratifications/LowComplexity/ -n refNs/ -r CHM13

Wed Jun 29 08:39:02 EDT 2022
bedtools v2.30.0
ref is: CHM13 
LowComplexity
post-processed/CHM13/LowComplexity
Processing for CHM13
processing: ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processed:   CHM13v2.0_AllHomopolymers_gt6bp_imperfectgt10bp_slop5.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz
processed:   CHM13v2.0_AllTandemRepeats_201to10000bp_slop5.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz
processed:   CHM13v2.0_AllTandemRepeats_51to200bp_slop5.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz
processed:   CHM13v2.0_AllTandemRepeats_gt10000bp_slop5.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/LowComplexity/CHM13v2.0_AllTandemRepeats_gt100bp_slop5.bed.gz
processed:   CHM13v2.0_AllTandemRepeats_gt100b

### OtherDifficult
note: 6/29/22 renamed directories from rDNA to OtherDifficult

In [29]:
#Get list of md5s for original files prior to any post processing
date
for f in ../T2T-CHM13v2.0-stratifications/rDNA/*.bed*
do
date >> ../T2T-CHM13v2.0-stratifications/rDNA/CHM13v2.0-OtherDifficult-v3.1-original_file_md5s.txt
md5 $f >> ../T2T-CHM13v2.0-stratifications/rDNA/CHM13v2.0-OtherDifficult-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:50:46 EDT 2022


In [30]:
sh ../stratification_post-processing.sh -i ../T2T-CHM13v2.0-stratifications/rDNA/ -n refNs/ -r CHM13

Wed Jun 29 08:50:56 EDT 2022
bedtools v2.30.0
ref is: CHM13 
rDNA
post-processed/CHM13/rDNA
Processing for CHM13
processing: ../T2T-CHM13v2.0-stratifications/rDNA/CHM13v1.1_rDNA.bed.gz
processed:   CHM13v1.1_rDNA.bed.gz


### SegmentalDuplications
note: 6/29/22 renamed directories from SegDups to SegmentalDuplications

In [35]:
#Get list of md5s for original files prior to any post processing
date
for f in ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/*.bed*
do
date >> ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/CHM13v2.0-SegmentalDuplications-v3.1-original_file_md5s.txt
md5 $f >> ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/CHM13v2.0-SegmentalDuplications-v3.1-original_file_md5s.txt
done

Wed Jun 29 08:58:32 EDT 2022


In [36]:
sh ../stratification_post-processing.sh -i ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/ -n refNs/ -r CHM13

Wed Jun 29 08:58:37 EDT 2022
bedtools v2.30.0
ref is: CHM13 
SegmentalDuplications
post-processed/CHM13/SegmentalDuplications
Processing for CHM13
processing: ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/CHM13v2.0_SegDups.bed.gz
processed:   CHM13v2.0_SegDups.bed.gz
processing: ../T2T-CHM13v2.0-stratifications/SegmentalDuplications/CHM13v2.0_notinSegDups.bed.gz
processed:   CHM13v2.0_notinSegDups.bed.gz


---

## 2. Rename post-processed BEDs using standard convention<a id="rename"></a>

### GRCh38
All files appropriately named in generation script, no need for re-naming.

### GRCh37
All files appropriately named in generation script, no need for re-naming.

### CHM13v2.0
With the exception of files below, all other files appropriately named in generation script. 

In [37]:
date
#Files needing rename
mv post-processed/CHM13/OtherDifficult/CHM13v1.1_rDNA.bed.gz post-processed/CHM13/OtherDifficult/CHM13v2.0_rDNA.bed.gz

Wed Jun 29 09:00:25 EDT 2022


---

## 3. Consolidate files for release<a id="consolidate"></a>
The directory "v3.0-carry-over-stratifications" includes v3.0 stratifcations that will be directly carried over from v3.0.  These files have already been through post processing.  This step combines "v3.0-carry-over-stratifications" with new "post-processed" v3.1 stratifications into a candidate stratifcation release directory used for validation of all v3.1 stratifications and ulitmate release.

In [40]:
#rename candidate files directory
date
cp -rp ../v3.0-carry-over-stratifications ../validation/v3.1-candidate-release-files

Wed Jun 29 09:32:41 EDT 2022


In [41]:
#CHM13
date
cp -rp post-processed/CHM13 ../validation/v3.1-candidate-release-files

Wed Jun 29 09:32:55 EDT 2022


In [50]:
#GRCh38
date
cp -rp post-processed/GRCh38/XY ../validation/v3.1-candidate-release-files/GRCh38
cp -rp post-processed/GRCh38/Union/* ../validation/v3.1-candidate-release-files/GRCh38/Union
cp -rp post-processed/GRCh38/LowComplexity ../validation/v3.1-candidate-release-files/GRCh38
cp -rp post-processed/GRCh38/SegmentalDuplications/* ../validation/v3.1-candidate-release-files/GRCh38/SegmentalDuplications

Wed Jun 29 11:26:19 EDT 2022


In [51]:
#GRCh37
date
cp -rp post-processed/GRCh37/XY ../validation/v3.1-candidate-release-files/GRCh37
cp -rp post-processed/GRCh37/Union/* ../validation/v3.1-candidate-release-files/GRCh37/Union
cp -rp post-processed/GRCh37/LowComplexity ../validation/v3.1-candidate-release-files/GRCh37
cp -rp post-processed/GRCh37/SegmentalDuplications/* ../validation/v3.1-candidate-release-files/GRCh37/SegmentalDuplications

Wed Jun 29 11:26:22 EDT 2022


---

## 4. Generate md5s for release files<a id="md5"></a>

In [2]:
date
date >> ../validation/v3.1-candidate-release-files/md5_checksums.txt
find ../validation/v3.1-candidate-release-files -type f -name "*.bed.gz" -exec md5 -r '{}' \; -maxdepth 3 >> ../validation/v3.1-candidate-release-files/md5_checksums.txt

Mon Jul 11 14:52:38 EDT 2022
