<a href="https://colab.research.google.com/github/vellamike/gwas_problem/blob/master/GWAS_problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import numpy as np
from sklearn.preprocessing import normalize
np.random.seed(2)

# Simulate a set of variants and their associated phenotypes

In [0]:
NUM_SAMPLES=5000
NUM_VARIANTS=10000
NUM_PHENOTYPES=100

In [0]:
def genotype_simulator(num_samples, num_variants=1000):
    return np.random.randint(3, size=(num_samples, num_variants)).astype(np.float32)

In [0]:
def phonetype_simulator(num_phenotypes, genotype):
    """
    num_phenotypes = the total number of phenotypes per sample
    genotype = a NxM matrix where N = number of individuals and M = number of variants. Values = 0 (no variant),1 (heterozygoys),2(homozygous)
    """
    coeffs = np.random.normal(size=(genotype.shape[1], num_phenotypes))
    coeffs -= coeffs.min()
    phenotypes = genotype @ coeffs
    phenotypes = normalize(phenotypes, axis=0, norm='max')
    noise = np.random.uniform(0,0.3, size=(genotype.shape[0], num_phenotypes))
    phenotypes += noise
    phenotypes = normalize(phenotypes, axis=0, norm='max')
    bias = np.random.uniform(-1.0,1.0,size=num_phenotypes)
    phenotypes += bias
    phenotypes = normalize(phenotypes, axis=0, norm='max')
    return phenotypes.astype(np.float32)

In [0]:
genomes = genotype_simulator(NUM_SAMPLES, NUM_VARIANTS)
print(genomes.shape)

(5000, 10000)


In [0]:
phenotypes = phonetype_simulator(NUM_PHENOTYPES, genomes)
print(phenotypes.shape)

(5000, 100)


# Exercise
1. Using pytorch, write a multivariate linear regression model for predicting phenotype from genotype. A good place to start is [here](https://towardsdatascience.com/linear-regression-with-pytorch-eb6dedead817)
2. How does performance on GPU scale with number of samples and phenotypes?
3. How does performance on GPU differ from CPU at different matrix sizes?
4. Can this problem be solved using a DL model?