Skip to content

Data generator

Dawid Wysakowicz edited this page Jan 16, 2016 · 6 revisions

To generate samples with random variants generated based on dbNSFP database values you can use provided Spark job.

How to build

In folder ${project_root}/samplesgenerator run command sbt assembly

Then resulting jar will be present in ${project_root}/samplesgenerator/target/scala-2.10/gate-generate-samples-assembly-1.0.jar

Prequisites

Variants and frequencies

There has to be a table with variants and its frequencies present in hive.(We used the dbNSFP database to generate one) with following columns:

+--------------+------------+
|   col_name   | data_type  |
+--------------+------------+
| reference    | string     |
| alternative  | string     |
| hg19_chr     | string     |
| hg19_pos     | int        |
| exac_ac      | string     |
| exac_af      | double     |
| exac_adj_ac  | string     |
| exac_adj_af  | double     |
| exac_afr_ac  | string     |
| exac_afr_af  | double     |
| exac_amr_ac  | string     |
| exac_amr_af  | double     |
| exac_eas_ac  | string     |
| exac_eas_af  | double     |
| exac_fin_ac  | string     |
| exac_fin_af  | double     |
| exac_nfe_ac  | string     |
| exac_nfe_af  | double     |
| exac_sas_ac  | string     |
| exac_sas_af  | double     |
| mean         | double     |
+--------------+------------+

Countries population

One has to provide a file with population of countries which to choose from. Example format:

4,AF,Afghanistan,Asia,31627506
8,AL,Albania,Europe,2894475
10,AQ,Antarctica,other,0
12,DZ,Algeria,Africa,38934334
16,AS,American Samoa,Oceania,55434
20,AD,Andorra,Europe,72786
24,AO,Angola,Africa,24227524
28,AG,Antigua and Barbuda,Americas,90900
31,AZ,Azerbaijan,Asia,9537823
32,AR,Argentina,Americas,42980026
36,AU,Australia,Oceania,23490736
40,AT,Austria,Europe,8534492

Where columns are as follows: country id, country code, country name, region name, population.

Usage

Usage: spark-submit <spark-options> pl.edu.pw.elka.GenerateDataJob -a [AnnotationTable] 
                    -d <CountryDictionaryPath> -s <SamplesNumber> -o <OutputPath>

        -a, --annotations-table  <arg>   Name of table in hive that contains dbNSFP
                                         annotations. (default = ANNOTATIONS)
        -d, --dict-path  <arg>           Path to dictionary of countries with their
                                         population for generating samples.
        -o, --output-path  <arg>         Where to store the ocr files with generated
                                         variants.
        -s, --samples-number  <arg>      Number of samples to generate.
        
        --help                           Show help message

Algorithm

Sample origin

First of all for each sample a country of origin is drawn. The countries are divided in regions for which custom allelic frequencies are present in the EXAC database. That is (Africa, Americas, Europa, Finnish, SouthAsian, WestAsian). The probability of choosing a country from each region is the same. After choosing a region, country is being drawn based on countries population.

Variants

For each variant from dbNSFP database a genotype is drawn based on proper allelic frequency. Probabilities for different genotypes are as follows:

Genotype Probability
0/0 1 - 2 * af + af ^ 2
0/1 2 * af * (1 - af)
1/1 af ^ 2

Then for each variant that is not reference homozygot, allele depth and total depth is drawn as to have same mean as present in ExAC.