Script that searches for SNP IDs from Illumina microarray MAP files.

## Import packages

In [1]:
import pandas as pd 

## Import data

In [12]:
mapfile = "data.map"
mapdf = pd.read_csv(mapfile,delimiter="\t", header=None)
print(f"Detected SNPs: {len(mapdf)}")
print(mapdf.head())
mapdfconverted = mapdf.convert_dtypes() # Chromosome no. and rsID were "object" types 
mapdfconverted.dtypes # Pandas automatically converts them to string type

Detected SNPs: 964193
    0                    1          2          3
0   1   exm-IND1-200449980  200.60350  202183358
1   1    exm-IND1-85310248  109.37180   85537661
2  10  exm-IND10-102817747  121.16370  102827758
3  10   exm-IND10-18329639   42.88673   18289634
4  10   exm-IND10-27476467   51.56835   27436462


0    string[python]
1    string[python]
2           Float64
3             Int64
dtype: object

In [13]:
mapdfconverted[1]

0          exm-IND1-200449980
1           exm-IND1-85310248
2         exm-IND10-102817747
3          exm-IND10-18329639
4          exm-IND10-27476467
                 ...         
964188              VGXS34742
964189              VGXS34743
964190              VGXS34744
964191              VGXS34761
964192              VGXS35706
Name: 1, Length: 964193, dtype: string

I was interested in finding all the alleles for CYP2C9 and CYP2C19 in case they showed up as hits in subsequent GWAS analyses. <br>
The IDs for these alleles were obtained from https://www.pharmgkb.org/ and exported into a .csv file. <br>
I then extracted the rsID row into separate .txt files to use as a simplified list for indexing.

In [4]:
#Import allele list from txt file
cyp2c9_alleles = open('cyp2c9_alleles.txt','r').read().split()
cyp2c19_alleles = open('cyp2c19_alleles.txt','r').read().split()
print(f"CYP2C9 rsID: {cyp2c9_alleles}\nCYP2C19 rsID:{cyp2c19_alleles}", sep="\n")

CYP2C9 rsID: ['rs114071557', 'rs67807361', 'rs142240658', 'rs1364419386', 'rs2031308986', 'rs564813580', 'rs371055887', 'rs72558187', 'rs762239445', 'rs1304490498', 'rs774607211', 'rs767576260', 'rs12414460', 'rs375805362', 'rs72558189', 'rs1375956433', 'rs200965026', 'rs199523631', 'rs200183364', 'rs1799853', 'rs141489852', 'rs754487195', 'rs1289704600', 'rs17847037', 'rs7900194', 'rs72558190', 'rs774550549', 'rs1326630788', 'rs2031531005', 'rs370100007', 'rs772782449', 'rs2256871', 'rs9332130', 'rs9332131', 'rs182132442', 'rs72558192', 'rs988617574', 'rs1237225311', 'rs57505750', 'rs28371685', 'rs367826293', 'rs1274535931', 'rs750820937', 'rs1297714792', 'rs749060448', 'rs1057910', 'rs56165452', 'rs28371686', 'rs1250577724', 'rs578144976', 'rs542577750', 'rs764211126', 'rs72558193', 'rs1254213342', 'rs1441296358', 'rs776908257', 'rs769942899', 'rs202201137', 'rs767284820', 'rs781583846', 'rs9332239', 'rs868182778']
CYP2C19 rsID:['rs12248560', 'rs28399504', 'rs367543002', 'rs367543003

## Run matching function

In [36]:
#need to match with regex, print matching results, output to list then find in dataframe
cyp2c9_alleles2 = "|".join(cyp2c9_alleles)
cyp2c19_alleles2 = "|".join(cyp2c19_alleles) #need to reformat list to string sequence

cyp2c9_match = mapdfconverted[1].str.contains('^.*'+cyp2c9_alleles2,regex=True)
cyp2c19_match = mapdfconverted[1].str.contains('^.*'+cyp2c19_alleles2,regex=True)
matched = []

def GetMatchedID(pdseries):
    for i,e in enumerate(pdseries.array):
            if e == True:
                print(i,e) # Get index for matched rsID
                matched.append(i) #Compile to a list
            
GetMatchedID(cyp2c9_match)
#2 hits found - corresponding to SNP #2540 and #626531
print("\n")
GetMatchedID(cyp2c19_match)
#5 hits found - corresponding to SNP #529878, #627323, #681867, #735871, #762621

matched_df = mapdfconverted.iloc[matched]
matched_df

2539 True
626530 True


529877 True
627322 True
681866 True
735870 True
762620 True


Unnamed: 0,0,1,2,3
2539,10,exm-rs1799853,115.3083,96702047
626530,10,rs28371685,115.3165,96740981
529877,10,rs17879685,115.2889,96609752
627322,10,rs28399504,115.2706,96522463
681866,10,rs4244285,115.2746,96541616
735870,10,rs4986893,115.2744,96540410
762620,10,rs6413438,115.2746,96541615


Export DF

In [2]:
save = input("Save to file? (y/n)").lower()
if save == 'y':
    exportname = 'matched_alleles'+'.csv'
    matched_df.to_csv(exportname,sep='\t',index=False,header=False)
    print(f"Exported results to {exportname}")
else:
    raise SystemExit

To demonstrate that the following code above works, a dummy dataframe is created from a list of randomly selected variant IDs. <br> Some IDs are modified with random preceding string sequences, others are unchanged. <br> 
3 alleles should be a match.

In [48]:
test = ['exm-rs1250577724','IND1-rs1297714792', 'exm-rs1000113','rs17803441','rs17803457','rs17803505',
'random-rs17803584','xxxxasxa-rs1780361',' ran-dom10-rs868182778','ind10101-rs58973490']
testdf = pd.DataFrame(test)
print(testdf)
testdf[0].str.contains('^.*'+cyp2c19_alleles2,regex=True)

                        0
0        exm-rs1250577724
1       IND1-rs1297714792
2           exm-rs1000113
3              rs17803441
4              rs17803457
5              rs17803505
6       random-rs17803584
7      xxxxasxa-rs1780361
8   ran-dom10-rs868182778
9     ind10101-rs58973490


0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
9     True
Name: 0, dtype: bool