#### Illumina 450k Probe Design & Translation

In [71]:
# Code to list Illumina column names along with an example row (see following cell).

import os
anno_file = "../../data/HumanMethylation450_15017482_v1-2.csv"

N=10
with open(anno_file) as anno:
    head = [next(anno) for x in range(N)]
headers = head[7].split(",")
row = head[8].split(",")

for i in range(0,len(headers)): # 32 columns
    #print ('{idx}\t{h}: \t {v}'.format(idx=i, h=headers[i], v=row[i]))
    pass

``` pre
Illumina 450k row entries we're interested in (referenced against Build 37).

1.  Name                            IlmnID.
2.  AddressA_ID                     Address ID for the probe used for both A and B alleles.
3.  AlleleA_ProbeSeq                Sequence of probe identified in AddressA_ID column.
9.  Forward_Sequence                Plus (+) strand (HapMap) sequence (5'-3') flanking the CG.
11. CHR                             Chromosome containing the CpG (Build 37).
12. MAPINFO                         Chromosomal coordinates of the CpG (Build 37).
13. SourceSeq                       Genomic sequence prior to bisulfite conversion.
16. Strand                          F Strand = (+) and R = (-).
21. UCSC_RefGene_Name               Target gene name[s] from UCSC, w/ splice variants.
22. UCSC_RefGene_Accession          UCSC ACC's of target transcripts, in order w/ transcripts.
23. UCSC_RefGene_Group              Gene region of CpG position from UCSC.
24. UCSC_CpG_Islands_Name           Chromosomal coordinates of the CpG Island, from UCSC.
25. Relation_to_UCSC_CpG_Island     The location of the CpG relative to the CpG island.

Example:
1.  Name:                           cg00035864
2.  AddressA_ID:                    31729416
3.  AlleleA_ProbeSeq:               AAAACACTAACAATCTTATCCACATAAACCCTTAAATTTATCTCAAATTC
9.  Forward_Sequence:               AATCCAAAGATGATGGAGGTAAGTGCCCG...[CG]...TCTCTGGATTG
11. CHR:                            Y
12. MAPINFO:                        8553009
13. SourceSeq:                      AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG
16. Strand:                         F
21. UCSC_RefGene_Name:              TTTY18
22. UCSC_RefGene_Accession:         NR_001550
23. UCSC_RefGene_Group:             TSS1500
24. UCSC_CpG_Islands_Name           '' (sometimes empty)
25. Relation_to_UCSC_CpG_Island     '' (sometimes empty)
```

In [72]:
# Create a mapping of UCSC CpG names --> Illumina probe ids. (Takes ~2 minutes to complete)

import pandas as pd
from collections import defaultdict

anno_file = pd.read_excel("../../data/ilmn.450k.annotated.hg19.xlsx")
UCSC_CpG_names =  anno_file.iloc[:,3]

UCSC_lookup = defaultdict(list)

for i,name in enumerate(list(UCSC_CpG_names)):
    UCSC_lookup[name].append(i)
UCSC_lookup = {k:v for k,v in UCSC_lookup.items() if len(v)>1}

In [65]:
len(UCSC_lookup['CpG: 21']) # 8059 probes correspond to UCSC's 'CpG: 21'

8059

In [66]:
# Functions to save/load dictionary objects. 

import _pickle as pickle

def save_obj(obj, name):
    with open('../../data/pickle/'+ name + '.pkl', 'wb+') as f:
        pickle.dump(obj, f)
        
def load_obj(name):
    with open('../../data/pickle/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [67]:
# save lookup object...
save_obj(UCSC_lookup, 'UCSC_lookup') # file is about 2Mb

In [68]:
# re-load lookup object...
UCSC_lookup = {} # re-initialize...
UCSC_lookup = load_obj('UCSC_lookup')

len(UCSC_lookup['CpG: 21']) # and again... 

8059