# Constructing a GarNet file

The goal here is to construct a file in which all of the gene and transcription factor names exist in the same namespace. We use mygene.info's API to map all the gene names to a common namespace. It isn't clear that they have the most "canonical" namespace but at present they correlate best with genecards, and seems to be more consistent than all other namespaces I know of. 

In [1]:
%pylab inline
import sys
import os
import pickle
import sqlite3
import numpy as np
import pandas as pd

known_genes_file = '../data/ucsc_hg19_knownGenes.tsv'
kgXref_file = '../data/ucsc_hg19_kgXref.tsv'


Populating the interactive namespace from numpy and matplotlib


In [2]:
sys.path.append(os.path.abspath('../src'))
from garnet import (group_by_chromosome, IntervalTree_from_reference, save_as_pickled_object)


# Building and exporting IntervalTrees: constructing the "GarNet File"

In [3]:
options = {"upstream_window": 10000, "downstream_window": 10000, "tss": False}

reference = pd.read_csv('../example_data/hg19/reference.normalized.tsv', sep='\t')
reference = group_by_chromosome(reference)
reference = {chrom: IntervalTree_from_reference(genes, options) for chrom, genes in reference.items()}

In [4]:
save_as_pickled_object(reference, '../example_data/hg19/', 'hg19_genes_only.garnet.pickle')