# Feature selection in high-dimensional genetic data

# Notebook 4: Mutitask approaches

## Introduction

We will now repeat the previous analysis for the 4W phenotype. It is very similar to the 2W phenotype, except that the seeds have been vernelized for 4 weeks. 

Then, because it is not unreasonable to expect the genomic regions driving both those phenotypes to be (almost) the same, we will use multi-task versions of the Lasso and elastic net analyze both phenotypes simultaneously.

Check out the documentation: [sklearn.linear_model.MultiTaskLasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskLasso.html#sklearn.linear_model.MultiTaskLasso) + [User Guide](http://scikit-learn.org/stable/auto_examples/linear_model/plot_multi_task_lasso_support.html)

__Q: Is our setting it the same "multi-task" setting as the one described in the documentation of scikit-learn? What is the difference?__

Let us start with reloading the data.

In [None]:
%pylab inline 
# imports matplotlib as plt and numpy as np

In [None]:
plt.rc('font', **{'size': 14}) # font size for text on plots

In [None]:
import pandas as pd

In [None]:
# Loading the SNP names
with open('data/athaliana_small.snps.txt') as f:
    snp_names = f.readline().split()
    f.close()
print(len(snp_names))

In [None]:
# Loading the design matrix -- this can take time!
X = np.loadtxt('data/athaliana_small.X.txt',  # file names
               dtype = 'int') # values are integers
p = X.shape[1]

In [None]:
# Loading the sample names
samples = list(np.loadtxt('data/athaliana.samples.txt', # file names
                         dtype=int)) # values are integers

In [None]:
# Loading the list of candidate genes
with open('data/athaliana.candidates.txt') as f:
    candidate_genes = f.readline().split()
    f.close()

In [None]:
# Loading the SNPs-to-gene mapping
genes_by_snp = {} # key: SNP, value = [genes in/near which this SNP is]
with open('data/athaliana.snps_by_gene.txt') as f:
    for line in f:
        ls = line.split()
        gene_id = ls[0]
        for snp_id in ls[1:]:
            if not snp_id in genes_by_snp:
                genes_by_snp[snp_id] = []
            genes_by_snp[snp_id].append(gene_id) 

## Loading the 4W and 2W phenotypes

### Loading the 2W phenotype
This is the same as in previous notebooks.

In [None]:
df_2W = pd.read_csv('data/athaliana.2W.pheno', # file name
                 header=None, # columns have no header
                 delim_whitespace=True, # columns are separated by white space
                 index_col=0) # read the first column as index

# Create vector of sample IDs
samples_with_phenotype_2W = list(df_2W.index)
print(len(samples_with_phenotype_2W), "samples have a 2W phenotype")

### Loading the 4W phenotype

The 4W phenotype is very similar to the 2W phenotype; the only difference is that seeds have been vernalized for 4 weeks instead of 2.

In [None]:
df_4W = pd.read_csv('data/athaliana.4W.pheno', # file name
                 header=None, # columns have no header
                 delim_whitespace=True, # columns are separated by white space
                 index_col=0) # read the first column as index

# Create vector of sample IDs
samples_with_phenotype_4W = list(df_4W.index)
print(len(samples_with_phenotype_4W), "samples have a 4W phenotype")

### New design matrix

We will now restrict ourselves to samples with _both_ 2W and 4W phenotypes.

In [None]:
samples_with_phenotype_both = list(set(samples_with_phenotype_2W).intersection(samples_with_phenotype_4W))
print(len(samples_with_phenotype_both), "samples have both phenotypes")

Restricting the samples to those in both 2W and 4W

In [None]:
X_both = X[np.array([samples.index(sample_id) for sample_id in samples_with_phenotype_both]), :]
del X # You can delete X now if you want, to free space

Restricting the phenotypes to the samples in both 2W and 4W

In [None]:
# 2W phenotypes, ordered according to samples_with_phenotype_both
df_2W_both = df_2W.loc[samples_with_phenotype_both]

# 4W phenotypes, ordered according to samples_with_phenotype_both
df_4W_both = df_4W.loc[samples_with_phenotype_both]

# multitask phenotype matrix:
y_both = np.hstack((df_2W_both, df_4W_both))

## Preliminary analysis

Is it reasonable to expect the 2W and 4W phenotypes to share many explanatory SNPs?

### Correlation between the phenotypes

In [None]:
from scipy.stats.stats import pearsonr

In [None]:
print("The correlation between the two phenotypes is %.3f" % pearsonr(y_both[:, 0], y_both[:, 1])[0])

In [None]:
plt.scatter(y_both[:, 0], y_both[:, 1])
plt.xlabel("2W")
plt.ylabel("4W")

__Q: What do you make of this? Does it make sense to study both phenotypes together?__

### Manhattan plots

We will now plot the Manhattan plots for both phenotypes.

In [None]:
import statsmodels.api as sm

In [None]:
## Compute p-values for both 2W and 4W
pvalues_2W = []
pvalues_4W = []
for snp_idx in range(p):
    X_snp = sm.add_constant(X_both[:, snp_idx])
    ## 2W
    est_2W = sm.regression.linear_model.OLS(y_both[:, 0], X_snp).fit()
    pvalues_2W.append(est_2W.pvalues[1])
    ## 4W
    est_4W = sm.regression.linear_model.OLS(y_both[:, 1], X_snp).fit()
    pvalues_4W.append(est_4W.pvalues[1])
pvalues_2W = np.array(pvalues_2W)
pvalues_4W = np.array(pvalues_4W)

We can overlay the two Manhattan plots (and flip the second one):

In [None]:
figure(figsize=(15, 4))

plt.scatter(range(p), -np.log10(pvalues_2W), alpha=0.6, s=5, label="2W")
t = -np.log10(0.05 / p)
plt.plot([0, p], [t, t], c="black")
plt.xlabel("feature")
plt.ylabel("-log10 p-value")
plt.xlim([0, p])

plt.scatter(range(p), np.log10(pvalues_4W), alpha=0.6, s=5, label="4W")
plt.plot([0, p], [-t, -t], c="black")

plt.xlabel("feature")
plt.ylabel("-log10 p-value")
plt.xlim([0, p])
plt.legend()

__Q: What do you make of this? Does it make sense to study both phenotypes together?__

Another possible visualization is to plot both pvalues ax xy-coordinates:

In [None]:
figure(figsize=(5, 5))
plt.scatter(-np.log10(pvalues_2W), -np.log10(pvalues_4W))
plt.xlabel('2W'); plt.ylabel('4W')

Most SNPs are in the upper-low left corner (low p-values for both phenotypes). But there is a number of SNPs with both large 4W-pvalues and 2W-pvalues, which shows that the two pvalues are correlated. Moreover, there are no SNPs with a high pvalue for one phenotype and a small pvalue for the other.

__Q: What conclusion can you draw? Do the 2W and 4W phenotypes seem to be linked to the same genome loci?__

## Train-test split

In [None]:
from sklearn import model_selection

In [None]:
X_both_tr, X_both_te, y_both_tr, y_both_te = \
    model_selection.train_test_split(X_both, y_both, test_size=0.1, random_state=17)

In [None]:
print(y_both.shape, y_both_tr.shape, y_both_te.shape)
print(X_both.shape, X_both_tr.shape, X_both_te.shape)

## Lasso on the 2W phenotype

We have fewer samples than for our previous analysis of the 2W phenotype, because we've restricted ourselves to samples for which both the 2W and 4W phenotypes are available. This will affect our ability to train a Lasso model for this phenotype.

Here we re-run the same experiment as in Notebook 2, but restricted to the 117 samples that have both a 2W and a 4W phenotype.

In [None]:
from sklearn import linear_model

In [None]:
# # You can use the Lasso path to determine the most appropriate range of values for alpha
# alphas_lasso_2W, coefs_lasso_2W, _ = linear_model.lasso_path(X_both_tr, y_both_tr[:, 0], eps=1e-2, n_alphas=30, fit_intercept=True)
# alphas_lasso_2W

In [None]:
lasso = linear_model.Lasso(fit_intercept=True, max_iter=6000)

Define cross-validation grid search and learn lasso with cross-validation.

In [None]:
alphas = np.logspace(-2., 1., num=20)
model_l1_2W = model_selection.GridSearchCV(lasso, param_grid = {'alpha': alphas}, 
                                        scoring='explained_variance', verbose=1)
model_l1_2W.fit(X_both_tr, y_both_tr[:, 0])

In [None]:
model_l1_2W.best_params_

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(range(p), # x = SNP position
            model_l1_2W.best_estimator_.coef_)  # y = regression weights

plt.xlabel("SNPs")
plt.ylabel("Regression weights")
plt.title("Lasso on the 2W phenotype")
plt.xlim([0, p])

In [None]:
selected_snps_2W = np.nonzero(model_l1_2W.best_estimator_.coef_)[0]
print("%d SNPs selected" % selected_snps_2W.shape)

candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps_2W:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("\t %d of the selected SNPs are in or near %d candidate genes" % (num_snps_in_candidate_genes, 
                                                                     len(candidate_genes_hit)))

In [None]:
from sklearn import metrics

In [None]:
y_2W_l1_pred = model_l1_2W.best_estimator_.predict(X_both_te)

print("Percentage of variance explained (using %d SNPs): %.2f" % \
     (np.nonzero(model_l1_2W.best_estimator_.coef_)[0].shape[0], 
      metrics.explained_variance_score(y_both_te[:, 0], y_2W_l1_pred)))

In [None]:
plt.figure(figsize = (4, 4))
plt.scatter(y_both_te[:, 0], y_2W_l1_pred)

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.title("2W")
plt.xlim([np.min(y_both_te[:, 0])-5, np.max(y_both_te[:, 0])+5])
plt.ylim([np.min(y_both_te[:, 0])-5, np.max(y_both_te[:, 0])+5])

plt.plot([np.min(y_both_te[:, 0])-5, np.max(y_both_te[:, 0])+5], 
         [np.min(y_both_te[:, 0])-5, np.max(y_both_te[:, 0])+5], c="black")

__Q: How do these results compare to those obtained in Notebook 2? Why?__

## Lasso on the 4W phenotype

Let us see how well a Lasso model performs on the 4W phenotype.

In [None]:
# # You can use the Lasso path to determine the most appropriate range of values for alpha
# alphas_lasso_4W, coefs_lasso_4W, _ = linear_model.lasso_path(X_both_tr, y_both_tr[:, 1], eps=1e-2, n_alphas=30, fit_intercept=True)
# alphas_lasso_4W

In [None]:
lasso = linear_model.Lasso(fit_intercept=True, max_iter=6000)

In [None]:
alphas = np.logspace(-2., 1., num=20)
model_l1_4W = model_selection.GridSearchCV(lasso, param_grid = {'alpha': alphas}, 
                                        scoring='explained_variance', verbose=1)
model_l1_4W.fit(X_both_tr, y_both_tr[:, 1])

In [None]:
model_l1_4W.best_params_

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(range(p), # x = SNP position
            model_l1_4W.best_estimator_.coef_)  # y = regression weights

plt.xlabel("SNPs")
plt.ylabel("Regression weights")
plt.title("Lasso on the 4W phenotype")
plt.xlim([0, p])

In [None]:
selected_snps_4W = np.nonzero(model_l1_4W.best_estimator_.coef_)[0]
print("%d SNPs selected" % selected_snps_4W.shape)

candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps_4W:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("\t %d of the selected SNPs are in or near %d candidate genes" % (num_snps_in_candidate_genes, 
                                                                     len(candidate_genes_hit)))

In [None]:
y_4W_l1_pred = model_l1_4W.best_estimator_.predict(X_both_te)

print("Percentage of variance explained (using %d SNPs): %.2f" % \
     (np.nonzero(model_l1_4W.best_estimator_.coef_)[0].shape[0], 
      metrics.explained_variance_score(y_both_te[:, 1], y_4W_l1_pred)))

In [None]:
plt.figure(figsize = (4, 4))
plt.scatter(y_both_te[:, 1], y_4W_l1_pred)

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.title("4W")
plt.xlim([np.min(y_both_te[:, 1])-5, np.max(y_both_te[:, 1])+5])
plt.ylim([np.min(y_both_te[:, 1])-5, np.max(y_both_te[:, 1])+5])

plt.plot([np.min(y_both_te[:, 1])-5, np.max(y_both_te[:, 1])+5], 
         [np.min(y_both_te[:, 1])-5, np.max(y_both_te[:, 1])+5], c="black")

__Q: How do these results compare to those obtained on the 2W phenotype?__

## Multitask lasso

We can now cross-validate a multitask Lasso on the data training data.

In [None]:
mt_l1 = linear_model.MultiTaskLasso(fit_intercept=True, max_iter=6000)
alphas = np.logspace(-3., 1, num=10)
model_mt_l1 = model_selection.GridSearchCV(mt_l1,
                                          param_grid = {'alpha': alphas}, verbose=2)

model_mt_l1.fit(X_both_tr, y_both_tr)

In [None]:
model_mt_l1.best_params_

In [None]:
plt.figure(figsize = (8, 5))
#plt.spy(ml_lasso_cv.best_estimator_.coef_)
plt.scatter(range(p), model_mt_l1.best_estimator_.coef_[0, :], alpha=0.7, label="2W")
plt.scatter(range(p), model_mt_l1.best_estimator_.coef_[1, :], alpha=0.7, label="4W")

plt.xlabel("features")
plt.ylabel("MTLasso regression weights")
plt.xlim([0, p])
plt.legend()

In [None]:
selected_snps_mt_l1_2W = np.nonzero(model_mt_l1.best_estimator_.coef_[0, :])[0]
print("%d SNPs selected for 2W," % selected_snps_mt_l1_2W.shape)

candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps_mt_l1_2W:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("\t of which %d are in/near %d candidate genes" % (num_snps_in_candidate_genes, 
                                                          len(candidate_genes_hit)))

In [None]:
selected_snps_mt_l1_4W = np.nonzero(model_mt_l1.best_estimator_.coef_[1, :])[0]
print("%d SNPs selected for 4W," % selected_snps_mt_l1_4W.shape)

candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps_mt_l1_4W:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("\t of which %d are in/near %d candidate genes" % (num_snps_in_candidate_genes, 
                                                          len(candidate_genes_hit)))

In [None]:
y_l1_mt_pred = model_mt_l1.best_estimator_.predict(X_both_te)

print("Percentage of variance explained for 2W (using %d SNPs): %.2f" % \
     (np.nonzero(model_mt_l1.best_estimator_.coef_[1, :])[0].shape[0], 
      metrics.explained_variance_score(y_both_te[:, 0], y_l1_mt_pred[:, 0])))

print("Percentage of variance explained for 4W (using %d SNPs): %.2f" % \
     (np.nonzero(model_mt_l1.best_estimator_.coef_[0, :])[0].shape[0], 
      metrics.explained_variance_score(y_both_te[:, 1], y_l1_mt_pred[:, 1])))

In [None]:
plt.figure(figsize = (4, 4))
plt.scatter(y_both_te[:, 0], y_l1_mt_pred[:, 0], alpha=0.7, label="2W")
plt.scatter(y_both_te[:, 1], y_l1_mt_pred[:, 1], alpha=0.7, label="4W")

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.title("Multitask")
plt.xlim([np.min(y_both_te)-5, np.max(y_both_te)+5])
plt.ylim([np.min(y_both_te)-5, np.max(y_both_te)+5])

plt.plot([np.min(y_both_te)-5, np.max(y_both_te)+5], 
         [np.min(y_both_te)-5, np.max(y_both_te)+5], c="black")

Note that what we lost in predictive ability from having fewer samples is not compensated by the multitask here.

## Multitask elastic-net

__Q: Do the same as before, but with (multi-task) elastic net instead!__ 
See the [user guide](https://scikit-learn.org/stable/modules/linear_model.html#multi-task-elastic-net) and [API](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.MultiTaskElasticNet.html#sklearn.linear_model.MultiTaskElasticNet).