#### Illumina 450k Probe Design & Translation

In [27]:
# Code to list Illumina column names along with an example row (see following cell).

import os

anno_file = "../../data/HumanMethylation450_15017482_v1-2.csv"

N=10
with open(anno_file) as anno:
    head = [next(anno) for x in range(N)]
headers = head[7].split(",")
row = head[8].split(",")

for i in range(0,len(headers)): # 32 columns
    #print ('{idx}\t{h}: \t {v}'.format(idx=i, h=headers[i], v=row[i]))
    pass

``` pre
Illumina 450k row entries we're interested in (referenced against Build 37).

1.  Name                            IlmnID.
2.  AddressA_ID                     Address ID for the probe used for both A and B alleles.
3.  AlleleA_ProbeSeq                Sequence of probe identified in AddressA_ID column.
9.  Forward_Sequence                Plus (+) strand (HapMap) sequence (5'-3') flanking the CG.
11. CHR                             Chromosome containing the CpG (Build 37).
12. MAPINFO                         Chromosomal coordinates of the CpG (Build 37).
13. SourceSeq                       Genomic sequence prior to bisulfite conversion.
16. Strand                          F Strand = (+) and R = (-).
21. UCSC_RefGene_Name               Target gene name[s] from UCSC, w/ splice variants.
22. UCSC_RefGene_Accession          UCSC ACC's of target transcripts, in order w/ transcripts.
23. UCSC_RefGene_Group              Gene region of CpG position from UCSC.
24. UCSC_CpG_Islands_Name           Chromosomal coordinates of the CpG Island, from UCSC.
25. Relation_to_UCSC_CpG_Island     The location of the CpG relative to the CpG island.

Example:
1.  Name:                           cg00035864
2.  AddressA_ID:                    31729416
3.  AlleleA_ProbeSeq:               AAAACACTAACAATCTTATCCACATAAACCCTTAAATTTATCTCAAATTC
9.  Forward_Sequence:               AATCCAAAGATGATGGAGGTAAGTGCCCG...[CG]...TCTCTGGATTG
11. CHR:                            Y
12. MAPINFO:                        8553009
13. SourceSeq:                      AGACACTAGCAGTCTTGTCCACATAGACCCTTGAATTTATCTCAAATTCG
16. Strand:                         F
21. UCSC_RefGene_Name:              TTTY18
22. UCSC_RefGene_Accession:         NR_001550
23. UCSC_RefGene_Group:             TSS1500
24. UCSC_CpG_Islands_Name           '' (sometimes empty)
25. Relation_to_UCSC_CpG_Island     '' (sometimes empty)
```

In [28]:
# Create a mapping of UCSC CpG names --> Illumina probe ids. (Takes >2 hours to finish the full file)

from collections import defaultdict

def horribly_slow(anno_file): # why is this so slow?
    UCSC_CpG_keys =  list(anno_file.iloc[:,3])
    UCSC_CpG_vals =  list(anno_file.iloc[:,2])
    UCSC_lookup = {}

    UCSC_dict = defaultdict(list)

    for i,name in enumerate(list(UCSC_CpG_keys)):
        UCSC_dict[name].append(i)
    UCSC_dict = {k:v for k,v in UCSC_dict.items()}

    for key in UCSC_CpG_keys:
        try:
            UCSC_lookup[key] = [anno_file.iloc[x]['Name'] for x in UCSC_dict[key]]
        except KeyError:
            pass

    return UCSC_lookup

In [29]:
anno_file = pd.read_excel("../../data/ilmn.450k.annotated.hg19.small.xlsx")

UCSC_lookup = horribly_slow(anno_file)

In [30]:
UCSC_lookup['CpG: 21']

['cg00050873', 'cg00061679', 'cg00543493']