Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
..
Failed to load latest commit information.
README.txt
hg19_self_chain_split.bed.gz
hg19_self_chain_split.sort.bed.gz
hg19_self_chain_split_both.bed.gz
hg19_self_chain_split_both_gt10k.bed.gz
hg19_self_chain_split_withalts_gt10k.bed.gz
make_self_chain.py
mm-2-merged-complement.bed.gz
mm-2-merged.bed.gz
notinsegdupall.bed.gz
segdupall.bed.gz
selfchain_sort.sh

README.txt

Segmental duplication and decoy files from the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team and the Genome in a Bottle Consortium

These files are intended as standard resource of bed files for use in stratifying true positive, false positive, and false negative variant calls into whether they are in segmental duplications or in regions with homology to decoy sequences.  

SEGMENTAL DUPLICATIONS FILES:
These files were created by Kevin Jacobs using the script make_self_chain.py.

hg19_self_chain_split.sort.bed.gz is sorted in numerical chromosome order with only 1-22, X, Y, and MT

Because the above file is in bedpe format to show pairs of sites with mapping homology, we have also created a bed file with all sites in 3-column bed format (hg19_self_chain_split_both.bed.gz), as well as a file with only regions >10kb in size (hg19_self_chain_split_both_gt10k.bed.gz).

An additional bed file (hg19_self_chain_split_withalts_gt10k.bed.gz) was created with regions>10kb with mapping homology to any other region of the genomes including ALT loci, which were not included in the files above.  

Code for processing raw file is in selfchain_sort.sh


DECOY BED FILES:
The decoy bed file mm-2-merged.bed was created to identify regions of GRCh37 that have homology to the decoy sequence hs37d5.  mm-2-merged-complement.bed is the complement of mm-2-merged.bed. These bed files were generated by Heng Li using the process described below:
These regions were generated as follows:

1) Download hs37d5ss.fa.gz from 1000g ftp:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/phase2_reference_assembly_sequence/

2) Map them to broad hs37 reference with minimap-0.2-r123 <https://github.com/lh3/minimap>:

minimap -w1 -t16 hs37.fa hs37d5ss.fa.gz > hs37d5ss.paf

3) Convert PAF to BED (sort-alt is a customized sort for sorting chr names):

cut -f6,8-11 hs37d5ss.paf | sort-alt -k1,1N -k2,2n -k3,3n | bgzip > mm-0.bed.gz

4) mm-0.bed.gz is 46.2MB in file size. I sent you a filtered version:

zcat mm-0.bed.gz|awk '$5>=200&&($4>=5000||($4>=1000&&$4/$5>=.1)||($4>=100&&$4/$5>=.2)||$4/$5>=.3)'|bgzip > mm-2.bed.gz