# How to generate the `genemap.txt` file

Using UCSC refGene for gene definition, Rutgers Map for genetic distances, and linear interpolation for those that cannot be found in the database.

## Gene range file

Downloaded [`refGene.txt.gz` from UCSC](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz), currently version March 01, 2020. Gene range file generated using [this workflow](https://gaow.github.io/cnv-gene-mapping/dsc/20190627_Clean_RefGene.html) written by Min Qiao when she was at UChicago. Please refer to the link for trickiness converting `refGene.txt.gz` to gene ranges.

The output is a 4-column file,

```
chr start end gene_name
```

## Genetic distance map file

Rutgers genetic maps downloaded from [here](http://compgen.rutgers.edu/downloads/rutgers_map_v3.zip). Preprocessing scripts below were mostly written by Hang Dai when he was at Baylor. I put some in a workflow script to better organize them.

In [None]:
# Copied from Hang Dai's preprocessing scripts in 2014
# to SoS workflow, with minor data formatting adjustments

# add_chr_to_original_file
[preprocess_1]
depends: executable('bgzip')
parameter: chrom = list()
if len(chrom) == 0: chrom = list(range(1,23)) + ['X']
input: for_each = 'chrom'
output: f'RUMap_chr{_chrom}.txt.gz'
bash: expand = '${ }'
    awk -F'\t' -v chromosome="${_chrom}" 'BEGIN {OFS="\t"} {if (NR==1) {print "#chr",$1,$2,$3,$6,$7,$8,$9} else {if ($2=="SNP") {print chromosome,$1,$2,$3,$6,$7,$8,$9}}}' RUMapv3_B137_chr${_chrom if _chrom != 'X' else 23}.txt | sort -k5 -g | bgzip -c > ${_output}

# make_tabix_index_file.sh
[preprocess_2]
output: f'{_input}.tbi'
bash: expand = '${ }'
    tabix  -s1 -b5 -e5 -c# ${_input}

# chr_min_max_dict
[preprocess_3]
input: group_by='all'
python: expand = '${ }'
    import subprocess
    chr_min_max_dict={}
    for item in [${_input:nr,}]:
        print(item)
        command='zcat {} | head -2 | tail -1'.format(item)
        p=subprocess.Popen(command, universal_newlines=True, shell=True, stdout=subprocess.PIPE)
        out=p.stdout.read().split('\t')  #a list
        min_pos=out[4]
        command='zcat {} | tail -1'.format(item)
        p=subprocess.Popen(command, universal_newlines=True, shell=True, stdout=subprocess.PIPE)
        out=p.stdout.read().split('\t')  #a list
        max_pos=out[4]
        chr_min_max_dict[item]=[min_pos, max_pos]
    print(chr_min_max_dict)
    print(len(chr_min_max_dict))

[liftover_download: provides = ['hg19ToHg38.over.chain.gz', 'liftOver']]
download:
	https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/hg19ToHg38.over.chain.gz
	http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
bash:
	chmod +x liftOver

[liftover_genemap]
depends: 'hg19ToHg38.over.chain.gz', 'liftOver'
parameter: genemap = 'genemap.hg19.txt'
input: genemap
output: f'{_input:nn}.hg38.txt'
bash: expand = '${ }'
	awk '{print "chr"$1,$2,$3,$4}' ${_input} > ${_output:nn}.hg19.bed
	./liftOver ${_output:nn}.hg19.bed hg19ToHg38.over.chain.gz ${_output:nn}.hg38.bed ${_output:nn}.unlifted.bed
python: expand = '${ }'
	genemap = dict([(x.split()[3], x.strip().split()) for x in open(${_input:r}).readlines()])
	new_coord = dict([(x.split()[3], x.strip().split()) for x in open('${_output:nn}.hg38.bed').readlines()])
	total = len(genemap)
	unmapped = 0
	for k in list(genemap.keys()):
		if k in new_coord:
			genemap[k][0] = new_coord[k][0][3:]
			genemap[k][1] = new_coord[k][1]
			genemap[k][2] = new_coord[k][2]
		else:
			del genemap[k]
			unmapped += 1
	print(f'{unmapped} units failed to be mapped to hg38.')
	with open(${_output:r}, 'w') as f:
		f.write('\n'.join(['\t'.join(x) for x in genemap.values()]))

To use it, after downloading and decompressing Rutgers Map data, run:

```
sos run genemap.ipynb preprocess
python genetic_pos_searcher.py genemap.txt
mv CM_genemap.txt genemap.hg19.txt
sos run genemap.ipynb liftover_genemap --genemap genemap.hg19.txt
```