# How to generate the `genemap.txt` file

Using UCSC refGene for gene definition, Rutgers Map for genetic distances, and linear interpolation for those that cannot be found in the database.

## Gene range file

Downloaded [refGene.txt.gz from UCSC](http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz), currently version March 01, 2020.

Code below are written by Min Qiao when she was at UChicago.

In [4]:
import pandas as pd, numpy as np
import os
from collections import Counter
from more_itertools import unique_everseen
cwd = os.path.expanduser("~/tmp/13-Mar-2020")
os.chdir(cwd)

In [5]:
status = os.system('wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz')

0

In [7]:
ref_gene = pd.read_table("refGene.txt.gz", compression="gzip", sep = "\t", header = None, usecols = (1,2,4,5,12), names = ["tx_name", "chrom", "start", "end", "gene_name"])
ref_gene.shape

(78288, 5)

In [8]:
ref_gene = ref_gene[ref_gene["chrom"].isin(["chr"+str(x+1) for x in range(22)] + ["chrX"])]
ref_gene.shape

(74551, 5)

In [9]:
ref_gene["CHR"] = ref_gene["chrom"].apply(lambda x: int(x.split("chr")[1]) if x.split("chr")[1] != "X" else "X")

In [10]:
ref_gene = ref_gene.drop_duplicates(subset = ("CHR", "start", "end"))
ref_gene = ref_gene.sort_values(by = ["CHR", "start", "end"])[["CHR", "start", "end", "gene_name"]]
ref_gene.shape

(43442, 4)

In [11]:
ref_gene.head()

Unnamed: 0,CHR,start,end,gene_name
0,1,11868,14362,LOC102725121
1,1,11873,14409,DDX11L1
41329,1,14361,29370,WASH7P
3,1,17368,17436,MIR6859-1
7,1,30365,30503,MIR1302-2


There is a problem here: we cannot just `groupby` this data according to chr then gene name, because there exists some genes having the same name but are in different, non-overlapping positions. For example:

## Genetic distance map file

Downloaded from [here](http://compgen.rutgers.edu/downloads/rutgers_map_v3a.zip).

Code `genetic_pos_searcher.py` was written by Hang Dai when he was at Baylor.