# Feature selection in high-dimensional genetic data

# Notebook 1: Classical GWAS

## Introduction

The goal of this practical session is to manipulate high-dimensional, low sample-size data that is typical of many genetic applications.

Here we will work with GWAS data from _Arabidopsis thaliana_, which is a plant model organism (https://upload.wikimedia.org/wikipedia/commons/6/6f/Arabidopsis_thaliana.jpg).

The genotypes are hence described by **Single Nucleotide Polymorphisms, or SNPs**. Our goal will be to use this data to identify regions of the genome that can be linked with various growth and flowering traits (**phenotypes**).

In [None]:
%pylab inline 
# imports matplotlib as plt and numpy as np

In [None]:
plt.rc('font', **{'size': 14}) # font size for text on plots

## Data description

* `data/athaliana_small.X.txt` is the design matrix. As many rows as samples, as many columns as SNPs
* the SNPs are given (in order) in `data/athaliana_small.snps.txt`. 
* the samples are given (in order) in `data/athaliana.samples.txt`.

* the transformed phenotypes are given in `data/athaliana.4W.pheno` and `data/athaliana.2W.pheno`. The first column is the sample's ID, and the second the phenotype.

* `data/athaliana.candidates.txt` contains a list of _A. thaliana_ genes known or strongly suspected to be associated with flowering times.

* the feature network is in `data/athaliana_small.W.txt`. It has been saved as 3 arrays, corresponding to the row, col, and data attributes of a [scipy.sparse coo_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html).

## Loading the data
We will start by working without the feature network, on the 2W phenotype.

In [None]:
# Load the SNP names
with open('data/athaliana_small.snps.txt') as f:
    snp_names = f.readline().split()
    f.close()
print(len(snp_names))

In [None]:
# Load the design matrix -- this can take time!
X = np.loadtxt('data/athaliana_small.X.txt',  # file names
               dtype = 'int') # values are integers

__Q: How many samples are there in the data? How many SNPs are there?__

In [None]:
# Answer


In [None]:
p = X.shape[1]

#### Loading the samples

In [None]:
samples = list(np.loadtxt('data/athaliana.samples.txt', # file names
                         dtype=int)) # values are integers
print(len(samples))

#### Loading the 2W phenotype 

The 2W phenotype os the number of days for the required for the flower stalk to reach 5cm, when plants have been growing at 23°C, with 16hrs of daylight, and have been vernalized for 2 weeks at 5°C with 8hrs of daylight.

In [None]:
import pandas as pd

In [None]:
df_2W = pd.read_csv('data/athaliana.2W.pheno', # file name
                 header=None, # columns have no header
                 delim_whitespace=True, # columns are separated by white space
                 index_col=0) # read the first column as index

In [None]:
# Create vector of sample IDs
samples_with_phenotype_2W = list(df_2W.index)
print(len(samples_with_phenotype_2W), "samples have a 2W phenotype")

# Create vector of phenotypes
y_2W = df_2W[1].to_numpy()

The 2W phenotype is not available for all samples. We need to restrict X to the samples with a 2W phenotype, in correct order

In [None]:
X_2W = X[np.array([samples.index(sample_id) \
                   for sample_id in samples_with_phenotype_2W]), :]

__Q: How many samples do we have now? And how many SNPs? Does this make the task of biomarker detection simpler or harder?__

In [None]:
# You can delete X now to free space
del X

#### Loading the list of candidate genes

Candidate genes are genes that are known (or strongly suspected) to be associated with flowering traits in _A. thaliana_. They will serve as (imperfect) ground truth for our experiments.

In [None]:
with open('data/athaliana.candidates.txt') as f:
    candidate_genes = f.readline().split()
    f.close()

#### Loading the SNPs-to-gene mapping

Remember our features are Single-Nucleotide Polymorphisms. In order to compare selected SNPs to candidate genes, we need to map SNPs to genes in or near which they are located.

In [None]:
genes_by_snp = {} # key: SNP, value = [genes in/near which this SNP is]
with open('data/athaliana.snps_by_gene.txt') as f:
    for line in f:
        ls = line.split()
        gene_id = ls[0]
        for snp_id in ls[1:]:
            if not snp_id in genes_by_snp:
                genes_by_snp[snp_id] = []
            genes_by_snp[snp_id].append(gene_id) 

## Splitting the data into a train and test set

In machine learning, we always split the data into a *train* set, which serves to fit the model, and a *test* set, which serves to measure the model's performance.

__Q: Why? What happens if we do both the training and testing on the same data?__

We will set aside a test set, containing 20% of our samples, on which to evaluate the quality of our predictive models.

__Q: What problem occurs if we set a test set that is too large in proportion? What problem occurs when it is set too small?__

In [None]:
from sklearn import model_selection

In [None]:
X_2W_tr, X_2W_te, y_2W_tr, y_2W_te = \
    model_selection.train_test_split(X_2W, y_2W, test_size=0.2, random_state=17)
print(X_2W_tr.shape, X_2W_te.shape)

## Data exploration
### Visualizing the phenotype

In [None]:
h = plt.hist(y_2W_tr, bins=30)

### Visualizing the genotype's correlation structure

In [None]:
#import seaborn as sn
sigma = pd.DataFrame(X_2W_tr).corr()

In [None]:
fig = plt.figure()
plt.imshow(sigma.iloc[0:1000, 0:1000])
plt.colorbar()
plt.title("Correlation between the first 1000 SNPs")

In [None]:
fig = plt.figure()
plt.imshow(sigma.iloc[0:100, 0:100])
plt.colorbar()
plt.title("Correlation between the first 100 SNPs")

__Q: What observation can you make about the phenotype and genotype?__

## T-tests

Let us start by running a statistical test for association of each SNP feature with the phenotype.

In [None]:
import statsmodels.api as sm

### T-test on a single SNP
We will perform a linear regression on a single SNP and test whether this SNP has an effect on the phenotype.

In [None]:
est = sm.regression.linear_model.OLS(y_2W_tr, sm.add_constant(X_2W_tr[:, 0])).fit()
print(est.summary())

__Q: In the previous table, where is the p-value of the T-test? What can you conclude about the effect of the first SNP on the phenotype?__

### T-test on all SNPs

In [None]:
pvalues = []
for snp_idx in range(p):
    # only look a the column corresponding at that SNP
    X_snp = X_2W_tr[:, snp_idx]
    # run a linear regression (with bias) between the phenotype and this SNP
    X_snp = sm.add_constant(X_snp)
    est = sm.regression.linear_model.OLS(y_2W_tr, X_snp)
    est2 = est.fit()
    # get the p-value from the model 
    pvalues.append(est2.pvalues[1])
pvalues = np.array(pvalues)

### Manhattan plot

The common way to visualize such results is by using a Manhattan plot: we will plot all SNPs on the x-axis, and on the y-axis we'll have the opposite of the log base 10 of the p-value. The lower the p-value, the higher the corresponding marker. 

We will also add a horizontal line that corresponds to the _threshold for significance_. Because we are testing multiple hypotheses, we need to lower our threshold accordingly. We will use __Bonferroni correction__ and divide the significance threshold (say, alpha=0.05) by the number of tests, that is, the number of SNPs p.

In [None]:
plt.scatter(range(p), # x = SNP position
            -np.log10(pvalues)) # y = -log10 p-value 

# significance threshold according to Bonferroni correction
t = -np.log10(0.05 / p)
plt.plot([0, p], [t, t])

# plot labels
plt.xlabel("feature")
plt.ylabel("-log10 p-value")
plt.xlim([0, p])

We will now see whether any of the SNPs have a p-value lower than the Bonferroni-corrected threshold, and if so, whether those SNPs are in or near a candidate gene.

In [None]:
thresh = 0.05 / p # significance threshold set using the Bonferroni correction

for snp_idx in np.where(pvalues < thresh)[0]:
    print(("%.2e" % pvalues[snp_idx]), snp_names[snp_idx])
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            print("\t in/near candidate gene %s" % gene_id)

__Q: Are any SNPs significantly associated with the phenotype? Are they biologically meaningful?__
You can use https://www.arabidopsis.org/index.jsp to obtain more information about a gene from its name.