# Simulate genotype data

Following the work of [P. Casale](https://www.ebi.ac.uk/sites/ebi.ac.uk/files/shared/documents/phdtheses/Casale-Thesis.pdf), we simulated cohorts with different sample sizes based on genotype data from the 1000 Genomes Project (Phase 3).

1. for each newly synthesised individual we randomly select A ancestors
2. the new genotype is built as a mosaic of blocks of 1,000 variants, where each block is copied from the corresponding block in one of the A ancestors (selected at random)

## Genotype data for runtime experiments

For runtime experiments, we generated cohorts with 500, 1K, 2K, 5K, 10K, 20K, 50K, 100K, 200K and 500K individuals. We first assigned each individual to one of the 26 populations in the 1000 Genomes Project and then we sampled A = 10 ancestors from that population, generating cohorts of unrelated individuals and preserving the population structure in the original data.

## Genotype data for calibration experiments

For calibration experiments, we generated three cohorts with 1,000 individuals and different genetic structures, considering only European populations (CEU, FIN, GBR, IBS and TSI).

- Population Structure (*simPopStructure*). 
    We simulated population structure by (i) assigning each newly synthesised individual to one of the five European populations (CEU, FIN, GBR, IBS and TSI) and (ii) considering A = 10 ancestors from that population.

- Unrelated individuals (*simUnrelated*).  
    For each synthesised individual, we considered A = 10 ancestors, which were randomly sampled from the whole set of Europeans.

- Related Individuals (*simRelated*). 
    For each synthesised individual, we considered A = 2 ancestors, again sampled from the whole set of Europeans.

In [1]:
import pandas as pd
from random import randint, seed
import gzip

In [2]:
# Read metadata
df = pd.read_csv('../data/metadata.tsv', sep = '\t')
df = pd.DataFrame(df)

# Remove NaN columns
df = df.dropna(axis = 1)

df.head()

Unnamed: 0,sample,pop,super_pop,gender
0,HG00096,GBR,EUR,male
1,HG00097,GBR,EUR,female
2,HG00099,GBR,EUR,female
3,HG00100,GBR,EUR,female
4,HG00101,GBR,EUR,male


### *simUnrelated*

In [49]:
# Define simulation parameters
A = 10
n = 1000
blocksize = 1000
eur = True
popstr = False

ancestors = dict()

seed(123)  
if eur:
    if popstr:
        pops = list(df.loc[df['super_pop']=='EUR', 'pop'].unique())
        for i in range(n):
            pop = pops[randint(0, len(pops)-1)]
            inds = list(df.loc[df['pop'] == pop, 'sample'])
            ancestors[i] = [inds[randint(0, len(inds)-1)] for i in range(A)]
    else:
        inds = list(df.loc[df['super_pop'] == 'EUR', 'sample'])
        for i in range(n):
            ancestors[i] = [inds[randint(0, len(inds)-1)] for i in range(A)]
else:
    if popstr:
        pops = list(df['pop'].unique())
        for i in range(n):
            pop = pops[randint(0, len(pops)-1)]
            inds = list(df.loc[df['pop'] == pop, 'sample'])
            ancestors[i] = [inds[randint(0, len(inds)-1)] for i in range(A)]
    else:
        inds = list(df['sample'])
        for i in range(n):
            ancestors[i] = [inds[randint(0, len(inds)-1)] for i in range(A)]
ancestors

{0: ['HG00125',
  'HG00326',
  'HG00146',
  'NA12878',
  'HG01536',
  'HG00325',
  'HG00171',
  'NA20538',
  'NA20770',
  'NA20756'],
 1: ['HG00117',
  'HG01512',
  'HG01779',
  'HG02223',
  'HG00367',
  'HG00372',
  'NA20581',
  'HG00125',
  'HG00240',
  'HG00188'],
 2: ['HG00369',
  'HG02223',
  'HG00367',
  'NA12399',
  'HG00309',
  'HG00243',
  'HG00096',
  'NA20772',
  'HG01620',
  'NA20502'],
 3: ['HG00146',
  'NA20758',
  'NA07051',
  'HG01510',
  'HG00136',
  'HG00100',
  'HG00356',
  'NA12762',
  'HG01630',
  'HG00158'],
 4: ['NA20772',
  'HG00120',
  'HG00150',
  'NA12154',
  'HG00231',
  'HG00182',
  'NA20510',
  'NA20773',
  'HG00108',
  'HG00339'],
 5: ['NA20775',
  'HG01617',
  'HG02235',
  'HG01685',
  'HG00324',
  'HG01679',
  'NA20536',
  'HG00116',
  'NA12878',
  'HG00349'],
 6: ['HG00373',
  'HG01768',
  'NA20516',
  'HG01695',
  'HG00268',
  'NA20528',
  'NA11829',
  'NA11933',
  'HG01771',
  'HG02230'],
 7: ['NA20520',
  'NA20812',
  'HG00356',
  'HG00103',
  'HG01

In [24]:
def formatGT(genotypes):
    for i, gt in enumerate(genotypes):
        if (gt == '0|0'):
            genotypes[i] = "0"
        elif(gt == '1|0' or genotypes[i] == '0|1'):
            genotypes[i] = "1"
        elif(gt == '1|1'):
            genotypes[i] = "2"
    return(genotypes)
    

In [57]:
%%timeit -n 1 -r 1
# import cProfile, pstats
# pr = cProfile.Profile()
# pr.enable()
      
fh = gzip.open('../data/genotypes.vcf.gz', 'rt')
for line in fh:
        if line.startswith('##'):
            # skip comment lines
            continue
        elif line.startswith('#CHROM'):
            # read header
            header = line.strip().split("\t")
            count = 0
            # print ('\t'.join(header[0:9]) + '\t' + '\t'.join(['sim' + str(i) for i in range(n)]), '\n')
        else:
            # For every block variants, change ancestors
            if (count % blocksize == 0):
                ancestors_block_idx = [header.index(ancestors[i][randint(0, A-1)]) for i in range(n)]
            line = line.strip().split('\t')
            gt = [line[idx] for idx in ancestors_block_idx]
            # Obtain genotypes corresponding to the selected ancestors
            # print('\t'.join(line[0:9]) + '\t' + '\t'.join(gt) + '\n')
            # print('\t'.join(gt))
            if (line[2] == "."):
                print(line[2])
                snp = line[0] + "_" + line[1]
            else:
                snp = line[2]
            prin = line[2] + ", " + ", ".join(formatGT(gt))
               
            count += 1
            if(count == 100000): break

# pr.disable()
# pr.print_stats()

26.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
