# Pooling-imputation performance from real bead chip data with high density markers

Performs pooling simulation and imputation on data from the chromosome 20 of 1000GP.
Markers have been chosen as the intersection between the real bead chip Illumina Infinium OmniExpress2.5 - 8 Kit
 (https://support.illumina.com/array/array_kits/humanomniexpress2.5-8-beadchip-kit/downloads.html) 
 and the chr20 1000GP data. The samples are randomly assigned to the reference panel or the study population.

Apply pooling simulation on this data, and imputation with Beagle: 
**pool and impute bead chip markers only, compute metrics and plot statistics** 
Phasing + imputation are run per sample for trying to get rid of the biased genetic structure in the study population
that seems to strongly impact clustered genotypes

In [3]:
import os

try:
    os.mkdir('/home/camille/PoolImpHuman/data/20200722')
except FileExistsError:
    pass
os.chdir('/home/camille/PoolImpHuman/data/20200722')

In [2]:
print('Configure directory')
%sx ln -s ~/1000Genomes/scripts/VCFPooling/python/omniexpress_20200722.ipynb ./
%sx ln -s ../omniexpress8/InfiniumOmniExpress-chr20-CHROM-POS.txt ./

["ln: failed to create symbolic link './InfiniumOmniExpress-chr20-CHROM-POS.txt': File exists"]

### Prepare experimental VCF file

IMP.chr20.pooled.snps and REF files are identical to 20200709 (imputation with default parameters)
IMP.imputed from per-sample processing on UPPMAX

In [3]:
print('Impute missing genotypes in the pooled file')
# Reindex file
%sx bcftools index -f IMP.chr20.pooled.imputed.vcf.gz

Impute missing genotypes in the pooled file


In [2]:
print('Plotting results with bcftools stats')
%sx deactivate
# bcftools stats needs python 2.7
%sx ln -s /home/camille/PoolImpHuman/data/20200709/study.population
%sx bcftools stats --af-bins 0.01,0.02,0.04,0.08,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.98 --collapse snps -S study.population IMP.chr20.pooled.imputed.vcf.gz IMP.chr20.snps.gt.vcf.gz > filestats.vchk
%sx plot-vcfstats -p bcftoolstats -s filestats.vchk

Plotting results with bcftools stats


['Parsing bcftools stats output: filestats.vchk',
 '\texpected: # PSC\t[2]id\t[3]sample\t[4]nRefHom\t[5]nNonRefHom\t[6]nHets\t[7]nTransitions\t[8]nTransversions\t[9]nIndels\t[10]average depth\t[11]nSingletons',
 '\tfound:    # PSC\t[2]id\t[3]sample\t[4]nRefHom\t[5]nNonRefHom\t[6]nHets\t[7]nTransitions\t[8]nTransversions\t[9]nIndels\t[10]average depth\t[11]nSingletons\t[12]nHapRef\t[13]nHapAlt\t[14]nMissing',
 'Plotting graphs: python plot.py',
 'Invalid limit will be ignored.',
 '  ax1.set_xlim(0,1.01)',
 'Invalid limit will be ignored.',
 '  ax1.set_xlim(0,1.01)',
 'Creating PDF: pdflatex summary.tex >plot-vcfstats.log 2>&1',
 'Finished: bcftoolstats/summary.pdf']

### Compute results with customized metrics

In [4]:
os.chdir('/home/camille/PoolImpHuman/data/20200722')
%sx python3 -u /home/camille/1000Genomes/src/VCFPooling/poolSNPs/imputation_quality.py ./ IMP.chr20.snps.gt.vcf.gz IMP.chr20.pooled.imputed.vcf.gz /home/camille/1000Genomes/src/VCFPooling/bin/gt_to_gl.sh "chrom:pos"  

['/home/camille/PoolImpHuman',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.imputed.vcf.gz.csi',
 '  np.log(np.asarray(y, dtype=float))',
 '[W::hts_idx_load2] The index file is older than the data file: ./IMP.chr20.pooled.impu

In [6]:
# Verify files created at the different phasing and imputation steps
assert os.path.exists('imputation_quality_gtgl.png')

### Single individual results for HG01063
Same individual as the test one for Phaser (20200801)

In [None]:
# Create idv files (true and imputed genotypes)
%sx bcftools view -Oz -o sHG01063.IMP.chr20.snps.gt.vcf.gz -s HG01063 IMP.chr20.snps.gt.vcf.gz
%sx bcftools view -Oz -o sHG01063.IMP.chr20.pooled.imputed.vcf.gz -s HG01063 IMP.chr20.pooled.imputed.vcf.gz
%sx bcftools index -f sHG01063.IMP.chr20.snps.gt.vcf.gz
%sx bcftools index -f sHG01063.IMP.chr20.pooled.imputed.vcf.gz

# if bcftools is configured for python 2.7 usage 
print('Plotting results with bcftools stats')
%sx deactivate
# bcftools stats needs python 2.7
%sx bcftools stats --af-bins 0.01,0.02,0.04,0.08,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.98 --collapse snps -s HG01063 sHG01063.IMP.chr20.pooled.imputed.vcf.gz sHG01063.IMP.chr20.snps.gt.vcf.gz > filestats.vchk
%sx plot-vcfstats -p bcftoolstats -s filestats.vchk