# Choosing markers from real bead chip data

Choose a real bead chip: Illumina Infinium OmniExpress2.5 - 8 Kit (https://support.illumina.com/downloads/infinium-omni2-5-8-v1-4-support-files.html). This array is provides high-density genotyped markers

In [1]:
import os

os.chdir('/home/camille/PoolImpHuman/data/omniexpress8')

### EDA for the chip chosen

In [3]:
print('Number of markers (+1 for the header line) by counting IDs')
%sx cut -f1 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | wc -l

Number of markers (+1 for the header line) by counting IDs


['2382210']

In [4]:
print('Number of markers (+1 for the header line) on chromosome 20 only')
%sx cut -f2 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | grep 20 | wc -l

Number of markers (+1 for the header line) on chromosome 20 only


['56627']

In [6]:
print('Extract ID and CHROM columns for markers on chromosome 20')
%sx cut -f1,2 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | awk '/20$/' | wc -l

Extract ID and CHROM columns for markers on chromosome 20


['56627']

In [5]:
print('Extract ID CHROM POS columns for markers on chromosome 20')
%sx cut -f1,2,3 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | wc -l

Extract ID CHROM POS columns for markers on chromosome 20


['56627']

In [7]:
print('Extract POS column only for markers on chromosome 20')
%sx cut -f1,2,3 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | cut -f3 | wc -l

Extract POS column only for markers on chromosome 20


['56627']

In [8]:
print('Create coordinate file processable by bcftools (markers positions should include at least #chromosome and #STARTPOS)')
%sx cut -f1,2,3 InfiniumOmni2-5-8v1-4_A1_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | cut -f2,3 > InfiniumOmniExpress-chr20-CHROM-POS.txt

Create coordinate file processable by bcftools (markers positions should include at least #chromosome and #STARTPOS)


[]

In [9]:
print('verify markers uniqueness')
%sx uniq -u InfiniumOmniExpress-chr20-CHROM-POS.txt | wc -l

verify markers uniquesness


['56627']

In [10]:
print('Number of Illumina OmniExpress markers present in the 1000GP data: intersection with 1000GP filtered SNPs')
%sx bcftools view -H -R ~/PoolImpHuman/data/omniexpress8/InfiniumOmniExpress-chr20-CHROM-POS.txt ~/PoolImpHuman/data/main/ALL.chr20.snps.gt.vcf.gz | wc -l

Number of Illumina OmniExpress markers present in the 1000GP data: intersection with 1000GP filtered SNPs


['52697']

In [10]:
print('Number of markers in the 1000GP filtered SNPs data')
%sx bcftools query -f '%ID\n' ~/PoolImpHuman/data/main/ALL.chr20.snps.gt.vcf.gz | wc -l

Number of markers in the 1000GP filtered SNPs data


['1739315']

In [11]:
print('Extract positions only from the file for being read by Pandas')
%sx bcftools query -f '%ID\t%CHROM\t%POS\n' ~/PoolImpHuman/data/main/ALL.chr20.snps.gt.vcf.gz > ~/PoolImpHuman/data/omniexpress8/ALL.chr20.snps.gt.ID-CHROM-POSmarkers.txt

Extract positions only from the file for being read by Pandas


[]

In [17]:
'''
Compare markers present in both 1KGP chr20 data set and OmniExpress Beadchip of Illumina for chr20
'''

import pandas as pd

m1kgp = pd.read_csv('ALL.chr20.snps.gt.ID-CHROM-POSmarkers.txt', sep='\t', header=None, names=['ID', 'CHROM', 'POS'])
millumina = pd.read_csv('InfiniumOmniExpress-chr20-ID-CHROM-POS.txt', sep='\t', header=None, names=['ID', 'CHROM', 'POS'])

# minter = m1kgp.join(millumina, how="inner", lsuffix='1kgp', rsuffix='illu')
minter = m1kgp.merge(millumina, on='POS', how="inner", suffixes=('1kgp', 'illu'))
print(minter.head())

print('\nNumber of common markers = {}'.format(len(minter)))
# intersection on ID --> 17778 SNPs only
# there are duplicate IDs

      ID1kgp  CHROM1kgp    POS     IDillu  CHROMillu
0  rs6139074         20  63244  rs6139074         20
1  rs1418258         20  63799  rs1418258         20
2  rs6086616         20  68749  rs6086616         20
3  rs6039403         20  69094  rs6039403         20
4  rs6040395         20  71093  rs6040395         20

Number of common markers = 17791


In [1]:
print('Number of Illumina OmniExpress UNIQUE positions present in the 1000GP data and the OmniExpress bead chip')
%sx bcftools query -f '%POS\n' -R ~/PoolImpHuman/data/omniexpress8/InfiniumOmniExpress-chr20-CHROM-POS.txt ~/PoolImpHuman/data/main/ALL.chr20.snps.gt.vcf.gz | uniq -u | wc -l

Number of Illumina OmniExpress UNIQUE positions present in the 1000GP data and the OmniExpress bead chip


['52697']