# Genomics Queries on 1000 Genomes dataset

### Set instance and token

In [1]:
suppressMessages(library(tidyverse))
library(integrationCurator) # Genestack client library

Sys.setenv(PRED_SPOT_HOST = 'occam.genestack.com',
           PRED_SPOT_TOKEN = '<token>',
           PRED_SPOT_VERSION = 'default-released')

“package ‘dplyr’ was built under R version 3.6.3”

### Get samples

In [2]:
start = Sys.time()
samples <- OmicsQueriesApi_search_samples(
    study_filter='genestack:accession=GSF535886',
    sample_filter='"Species Or Strain"="British" OR "Species Or Strain"="Finnish"'
)$content$data[['metadata']]
cat(sprintf('Time to get %s samples: %s seconds\n\n', nrow(samples), round(Sys.time()-start)))

samples = samples[,c('genestack:accession', 'Sample Source ID', 'Species Or Strain')]
head(samples)

Time to get 182 samples: 0 seconds



genestack:accession,Sample Source ID,Species Or Strain
GSF535900,HG00111,British
GSF535899,HG00110,British
GSF535902,HG00113,British
GSF535901,HG00112,British
GSF535896,HG00106,British
GSF535895,HG00104,British


### Get variants
The query below will retrieve, from the selected samples, SNPs within a certain genomic region that are annotated to have allele frequence greater than 0.1%.

It is possible to filter the variants by:
- Genomic region, e.g. Intervals=4:142142000-142143000
- Reference / Alteration alleles, e.g. Reference=T
- Variant ID, e.g. VariationId=rs79011024
- Numerical or categorical INFO fields, e.g. info_AF=(0.001:1), info_SNPSOURCE=LOWCOV
- Minimum number of alternative alleles in the sample genotypes, e.g. AllelesNumber=1 will filter for 0|1, 1|0, 1|1 genotypes
- Variant type, e.g. Type=SNP

See the example query parameters [here](https://swagger.occam.genestack.com/integrationCurator/#/Omics%20queries/searchVariantData), under *vxQuery*

In [3]:
start = Sys.time()
variants = OmicsQueriesApi_search_variant_data(
    study_filter='genestack:accession=GSF535886',
    sample_filter='"Species Or Strain"="British" OR "Species Or Strain"="Finnish"',
    vx_query='Intervals=4:142142000-142143000 Type=SNP info_AF=(0.001:1)',
    page_limit=20000
)$content$data
cat(sprintf('Time to get %s variants: %s seconds\n\n', nrow(variants), round(Sys.time()-start)))

variants = cbind(
    'SampleID'=variants$relationships$sample,
    VariantID=variants$variationId,
    Chr=variants$contig,
    Start=variants$start,
    Ref=variants$reference,
    Alt=variants$alteration,
    do.call(cbind, variants$genotype),
    do.call(cbind, variants$info)
) %>% as_tibble %>% mutate_all(function(x) map_chr(x, ~paste(., collapse=', ')))

head(variants)

Time to get 1267 variants: 1 seconds



SampleID,VariantID,Chr,Start,Ref,Alt,sampleNames,GL,GT,DS,...,AF,ERATE,AFR_AF,AN,VT,LDAF,THETA,ASN_AF,AMR_AF,EUR_AF
GSF535888,rs79011024,4,142142126,T,C,HG00096,"-0.18,-0.48,-2.47",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,
GSF535889,rs79011024,4,142142126,T,C,HG00097,"-0.04,-1.06,-5.00",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,
GSF535890,rs79011024,4,142142126,T,C,HG00099,"-0.04,-1.09,-5.00",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,
GSF535891,rs79011024,4,142142126,T,C,HG00100,"-0.03,-1.21,-5.00",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,
GSF535892,rs79011024,4,142142126,T,C,HG00101,"-0.00,-3.52,-5.00",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,
GSF535893,rs79011024,4,142142126,T,C,HG00102,"-0.01,-1.65,-5.00",0|0,0.0,...,0.0018,0.0003,0.01,2184,SNP,0.0022,0.0008,,,


### Count genotypes by sample groups and variant IDs

In [4]:
x = inner_join(samples, variants, by=c("genestack:accession"="SampleID"))
x %>% group_by(VariantID, `Species Or Strain`, GT, AF) %>% tally()

VariantID,Species Or Strain,GT,AF,n
rs10023024,British,0|0,0.01,88
rs10023024,Finnish,0|0,0.01,93
rs13147597,British,0|0,0.0014,88
rs13147597,Finnish,0|0,0.0014,93
rs146362755,British,0|0,0.01,86
rs146362755,British,0|1,0.01,1
rs146362755,British,1|0,0.01,1
rs146362755,Finnish,0|0,0.01,93
rs148317497,British,0|0,0.0037,88
rs148317497,Finnish,0|0,0.0037,93


### Get samples with certain variants
The query below will find samples with at least 1 alternative allele for variant rs17007017, i.e. samples with 0|1, 1|0, 1|1 genotypes

In [5]:
start = Sys.time()
samples = OmicsQueriesApi_search_samples(
    study_filter='genestack:accession=GSF535886',
    sample_filter='"Species Or Strain"="British" OR "Species Or Strain"="Finnish"',
    vx_query='VariationId=rs17007017 AllelesNumber=1',
    page_limit=20000
)$content$data[['metadata']]
cat(sprintf('Time to get %s samples: %s seconds\n\n', nrow(samples), round(Sys.time()-start)))

head(samples[,c('genestack:accession', 'Sample Source ID', 'Species Or Strain')])
samples %>% group_by(`Species Or Strain`) %>% tally()

Time to get 72 samples: 0 seconds



genestack:accession,Sample Source ID,Species Or Strain
GSF536014,HG00312,Finnish
GSF536004,HG00278,Finnish
GSF536003,HG00277,Finnish
GSF536000,HG00274,Finnish
GSF535999,HG00273,Finnish
GSF536001,HG00275,Finnish


Species Or Strain,n
British,40
Finnish,32
