# How to generate the `genemap.txt` file

Using UCSC refGene for gene definition, Rutgers Map for genetic distances, and linear interpolation for those that cannot be found in the database.

## Gene range file

Downloaded [refGene.txt.gz from UCSC](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz), currently version March 01, 2020.

Code below are written by Min Qiao when she was at UChicago.

In [4]:
import pandas as pd, numpy as np
import os
from collections import Counter
from more_itertools import unique_everseen
cwd = os.path.expanduser("~/tmp/13-Mar-2020")
os.chdir(cwd)

In [5]:
status = os.system('wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz')

0

In [7]:
ref_gene = pd.read_table("refGene.txt.gz", compression="gzip", sep = "\t", header = None, usecols = (1,2,4,5,12), names = ["tx_name", "chrom", "start", "end", "gene_name"])
ref_gene.shape

(78288, 5)

In [8]:
ref_gene = ref_gene[ref_gene["chrom"].isin(["chr"+str(x+1) for x in range(22)] + ["chrX"])]
ref_gene.shape

(74551, 5)

In [9]:
ref_gene["CHR"] = ref_gene["chrom"].apply(lambda x: int(x.split("chr")[1]) if x.split("chr")[1] != "X" else "X")

In [10]:
ref_gene = ref_gene.drop_duplicates(subset = ("CHR", "start", "end"))
ref_gene = ref_gene.sort_values(by = ["CHR", "start", "end"])[["CHR", "start", "end", "gene_name"]]
ref_gene.shape

(43442, 4)

In [11]:
ref_gene.head()

Unnamed: 0,CHR,start,end,gene_name
0,1,11868,14362,LOC102725121
1,1,11873,14409,DDX11L1
41329,1,14361,29370,WASH7P
3,1,17368,17436,MIR6859-1
7,1,30365,30503,MIR1302-2


There is a problem here: we cannot just `groupby` this data according to chr then gene name, because there exists some genes having the same name but are in different, non-overlapping positions. For example:

## Genetic distance map file

Downloaded from [here](http://compgen.rutgers.edu/downloads/rutgers_map_v3a.zip).

These preprocessing scripts were mostly written by Hang Dai when he was at Baylor.

In [None]:
# Copied from Hang Dai's preprocessing scripts in 2014
# to SoS workflow, with minor data formatting adjustments

# add_chr_to_original_file
[preprocess_1]
depends: executable('bgzip')
parameter: chrom = list()
if len(chrom) == 0: chrom = list(range(1,23)) + ['X']
input: for_each = 'chrom'
output: f'RUMap_chr{_chrom}.txt.gz'
bash: expand = '${ }'
	awk -F'\t' -v chromosome="${_chrom}" 'BEGIN {OFS="\t"} {if (NR==1) {print "#chr",$1,$2,$3,$6,$7,$8,$9} else {if ($2=="SNP") {print chromosome,$1,$2,$3,$6,$7,$8,$9}}}' RUMapv3_B137_chr${_chrom if _chrom != 'X' else 23}.txt | sort -k5 -g | bgzip -c > ${_output}

# make_tabix_index_file.sh
[preprocess_2]
output: f'{_input}.tbi'
bash: expand = '${ }'
	tabix  -s1 -b5 -e5 -c# ${_input}

# chr_min_max_dict
[preprocess_3]
input: group_by='all'
python: expand = '${ }'
import subprocess
chr_min_max_dict={}
for item in [${_input:nr,}]:
	print(item)
	command='zcat {} | head -2 | tail -1'.format(item)
	p=subprocess.Popen(command, universal_newlines=True, shell=True, stdout=subprocess.PIPE)
	out=p.stdout.read().split('\t')  #a list
	min_pos=out[4]
	command='zcat {} | tail -1'.format(item)
	p=subprocess.Popen(command, universal_newlines=True, shell=True, stdout=subprocess.PIPE)
	out=p.stdout.read().split('\t')  #a list
	max_pos=out[4]
	chr_min_max_dict[item]=[min_pos, max_pos]
print(chr_min_max_dict)
print(len(chr_min_max_dict))

[liftover_download: provides = ['hg19ToHg38.over.chain.gz', 'liftOver']]
download:
	https://hgdownload.soe.ucsc.edu/gbdb/hg19/liftOver/hg19ToHg38.over.chain.gz
	http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
bash:
	chmod +x liftOver

[liftover_genemap]
depends: 'hg19ToHg38.over.chain.gz', 'liftOver'
parameter: genemap = ''
input: genemap
output: f'{_input:nn}.hg38.txt'
bash: expand = '${ }'
	awk '{print "chr"$1,$2,$3,$4}' ${_input} > ${_output:nn}.hg19.bed
	./liftOver ${_output:nn}.hg19.bed hg19ToHg38.over.chain.gz ${_output:nn}.hg38.bed ${_output:nn}.unlifted.bed
python: expand = '${ }'
	genemap = dict([(x.split()[3], x.strip().split()) for x in open(${_input:r}).readlines()])
	new_coord = dict([(x.split()[3], x.strip().split()) for x in open('${_output:nn}.hg38.bed').readlines()])
	total = len(genemap)
	unmapped = 0
	for k in list(genemap.keys()):
		if k in new_coord:
			genemap[k][0] = new_coord[k][0][3:]
			genemap[k][1] = new_coord[k][1]
			genemap[k][2] = new_coord[k][2]
		else:
			del genemap[k]
			unmapped += 1
	print(f'{unmapped} units failed to be mapped to hg38.')
	with open(${_output:r}, 'w') as f:
		f.write('\n'.join(['\t'.join(x) for x in genemap.values()]))

To use it, after downloading and decompressiong Rutgers Map data, run:

```
sos run genemap.ipynb preprocess
python genetic_pos_searcher.py genemap.txt
mv CM_genemap.txt genemap.hg19.txt
sos run genemap.ipynb liftover_genemap --genemap genemap.hg19.txt
```