# Choosing markers from real bead chip data

Choose a real bead chip: Illumina Infinium OmniExpress (https://support.illumina.com/array/array_kits/humanomniexpress-24-beadchip-kit/downloads.html). It should be an array with many slots since we aim to genotype large populations.

"The Infinium OmniExpress-24 BeadChip array is a powerful tool for genome-wide association studies (GWAS), providing high sample throughput and coverage of common variants. Using the proven iScan System, this 24-sample BeadChip offers exceptional throughput of thousands of samples per week.
Optimized tag SNP content from all three HapMap phases has been strategically selected to capture the greatest amount of common variation and drive the discovery of novel associations with traits and diseases."

Extract markers coordinates of this bead chip, provided on the manufacturer website (https://support.illumina.com/content/dam/illumina-support/documents/downloads/productfiles/humanomniexpress-24/v1-3/infinium-omniexpress-24-v1-3-a1-physical-genetic-coordinates.zip)

Subset markers from the 1000GP project based on this markers list

Apply pooling simulation on this data, and imputation (run parallel execution): impute for whole K20 and plot result for bead chip only vs. impute and results from bead chip 

In [1]:
import os

os.chdir('/home/camille/PoolImpHuman/data/omniexpress24')

### EDA for the chip chosen

In [2]:
print('Number of markers (+1 for the header line) by counting IDs')
%sx cut -f1 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | wc -l

Number of markers (+1 for the header line) by counting IDs


['714239']

In [3]:
print('Number of markers (+1 for the header line) on chromosome 20 only')
%sx cut -f2 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | grep 20 | wc -l

Number of markers (+1 for the header line) on chromosome 20 only


['18131']

In [4]:
print('Extract ID and CHROM columns for markers on chromosome 20')
%sx cut -f1,2 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | awk '/20$/' | wc -l

Extract ID and CHROM columns for markers on chromosome 20


['18131']

In [5]:
print('Extract ID CHROM POS columns for markers on chromosome 20')
%sx cut -f1,2,3 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | wc -l

Extract ID CHROM POS columns for markers on chromosome 20


['18131']

In [6]:
print('Extract POS column only for markers on chromosome 20')
%sx cut -f1,2,3 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | cut -f3 | wc -l

Extract POS column only for markers on chromosome 20


['18131']

In [7]:
print('Create coordinate file processable by bcftools (markers positions should include at least #chromosome and #STARTPOS)')
%sx cut -f1,2,3 InfiniumOmniExpress-24v1-3_A1.csv_Physical-and-Genetic-Coordinates.txt | awk '/\t20\t/' | cut -f2,3 > InfiniumOmniExpress-chr20-CHROM-POS.txt

Create coordinate file processable by bcftools (markers positions should include at least #chromosome and #STARTPOS)


[]

In [8]:
print('verify markers uniquesness')
%sx uniq -u InfiniumOmniExpress-chr20-CHROM-POS.txt | wc -l

verify markers uniquesness


['18131']

In [9]:
print('Number of Illumina OmniExpress markers present in the 1000GP data: intersection with 1000GP filtered SNPs')
%sx cd ~/1000Genomes/data/gt && bcftools view -H -R ~/Documents/PhD/1000Genomes/InfiniumOmniExpress-chr20-CHROM-POS.txt ALL.chr20.snps.gt.vcf.gz | wc -l

Number of Illumina OmniExpress markers present in the 1000GP data: intersection with 1000GP filtered SNPs


['17791']

In [10]:
print('Number of markers in the 1000GP filtered SNPs data')
%sx cd ~/1000Genomes/data/gt && bcftools query -f '%ID\n' ALL.chr20.snps.gt.vcf.gz | wc -l

Number of markers in the 1000GP filtered SNPs data


['1739315']

In [11]:
print('Extract positions only from the file for being read by Pandas')
%sx cd ~/1000Genomes/data/gt && bcftools query -f '%ID\t%CHROM\t%POS\n' ALL.chr20.snps.gt.vcf.gz > ~/Documents/PhD/1000Genomes/ALL.chr20.snps.gt.ID-CHROM-POSmarkers.txt

Extract positions only from the file for being read by Pandas


[]

In [17]:
'''
Compare markers present in both 1KGP chr20 data set and OmniExpress Beadchip of Illumina for chr20
'''

import pandas as pd

m1kgp = pd.read_csv('ALL.chr20.snps.gt.ID-CHROM-POSmarkers.txt', sep='\t', header=None, names=['ID', 'CHROM', 'POS'])
millumina = pd.read_csv('InfiniumOmniExpress-chr20-ID-CHROM-POS.txt', sep='\t', header=None, names=['ID', 'CHROM', 'POS'])

# minter = m1kgp.join(millumina, how="inner", lsuffix='1kgp', rsuffix='illu')
minter = m1kgp.merge(millumina, on='POS', how="inner", suffixes=('1kgp', 'illu'))
print(minter.head())

print('\nNumber of common markers = {}'.format(len(minter)))
# intersection on ID --> 17778 SNPs only
# there are duplicate IDs

      ID1kgp  CHROM1kgp    POS     IDillu  CHROMillu
0  rs6139074         20  63244  rs6139074         20
1  rs1418258         20  63799  rs1418258         20
2  rs6086616         20  68749  rs6086616         20
3  rs6039403         20  69094  rs6039403         20
4  rs6040395         20  71093  rs6040395         20

Number of common markers = 17791


In [14]:
print('Number of Illumina OmniExpress UNIQUE positions present in the 1000GP data and the OmniExpress bead chip')
%sx cd ~/1000Genomes/data/gt && bcftools query -f '%POS\n' -R ~/Documents/PhD/1000Genomes/InfiniumOmniExpress-chr20-CHROM-POS.txt ALL.chr20.snps.gt.vcf.gz | uniq -u | wc -l

Number of Illumina OmniExpress UNIQUE positions present in the 1000GP data


['17791']