#### Illumina 450k Probe Design & Translation

``` pre
Illumina 450k row entries we're interested in (referenced against Build 37).

1.  Name                            IlmnID.
2.  AddressA_ID                     Address ID for the probe used for both A and B alleles.
3.  AlleleA_ProbeSeq                Sequence of probe identified in AddressA_ID column.
9.  Forward_Sequence                Plus (+) strand (HapMap) sequence (5'-3') flanking the CG.
11. CHR                             Chromosome containing the CpG (Build 37).
12. MAPINFO                         Chromosomal coordinates of the CpG (Build 37).
13. SourceSeq                       Genomic sequence prior to bisulfite conversion.
16. Strand                          F Strand = (+) and R = (-).
21. UCSC_RefGene_Name               Target gene name[s] from UCSC, w/ splice variants.
22. UCSC_RefGene_Accession          UCSC ACC's of target transcripts, in order w/ transcripts.
23. UCSC_RefGene_Group              Gene region of CpG position from UCSC.
24. UCSC_CpG_Islands_Name           Chromosomal coordinates of the CpG Island, from UCSC.
25. Relation_to_UCSC_CpG_Island     The location of the CpG relative to the CpG island.

Example:
1.  Name:                           cg00035864
2.  AddressA_ID:                    31729416
3.  AlleleA_ProbeSeq:               AAAACACTAACAATCTTATCCACATAAACCCTTAAATTTATCTCAAATTC
9.  Forward_Sequence:               AATCCAAAGATGATGGAGGTAAGTGCCCG...[CG]...TCTCTGGATTG
11. CHR:                            Y
12. MAPINFO:                        8553009
13. SourceSeq:                      AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG
16. Strand:                         F
21. UCSC_RefGene_Name:              TTTY18
22. UCSC_RefGene_Accession:         NR_001550
23. UCSC_RefGene_Group:             TSS1500
24. UCSC_CpG_Islands_Name           '' (empty)
25. Relation_to_UCSC_CpG_Island     '' (empty)
```

In [21]:
# Creates a mapping of Illumina probe ID's to UCSC CpG names. 
# WARNING: This will take a minute or two to complete.

import pandas as pd

small_anno_file = pd.read_excel("../../data/ilmn.450k.annotated.hg19.xlsx")
UCSC_CpG_names =  small_anno_file.iloc[:,3]
unique = set(list(UCSC_CpG_names))

lookup = {}
for name in unique:
    if (len(name) > 8):
        continue
    indices = [i for i, x in UCSC_CpG_names.iteritems() if x == name]
    try:
        lookup[name] = indices # a dictionary of Illumina probe ID lists keyed by index='CpG: 21', etc.
    except KeyError:
        pass

In [22]:
len(lookup['CpG: 21']) # there are 8059 Illumina probes in UCSC's 'CpG: 21'. 

8059

In [28]:
import _pickle as pickle

def save_obj(obj, name):
    with open('../../data/pickle/'+ name + '.pkl', 'wb+') as f:
        pickle.dump(obj, f)
        
def load_obj(name):
    with open('../../data/pickle/' + name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [29]:
save_obj(lookup, 'CpG_names_to_probes') # file is about 2Mb
#lookup = load_obj('CpG_names_to_probes')

In [30]:
lookup = {} # empty it...
lookup = load_obj('CpG_names_to_probes')

len(lookup['CpG: 21']) # and again... 

8059

In [None]:
# Code needed to figure out the Illumina column names and an example row (see top of page).
import os
anno_file = "../../data/HumanMethylation450_15017482_v1-2.csv"

N=10
with open(anno_file) as anno:
    head = [next(anno) for x in range(N)]
headers = head[7].split(",")
row = head[8].split(",")

for i in range(0,len(headers)): # 32 columns
    #print ('{idx}\t{h}: \t {v}'.format(idx=i, h=headers[i], v=row[i]))
    pass