# Feature selection in high-dimensional genetic data

# Notebook 2: Linear regression, Lasso and Elastic Net

## Introduction

We keep on working with our _Arabidopsis thaliana_ data. We will now use various linear models to predict the phenotype from the genotype.

We start by reloading the same libraries and data as in Notebook 1, using the same code.

In [None]:
%pylab inline 
# imports matplotlib as plt and numpy as np

In [None]:
plt.rc('font', **{'size': 14}) # font size for text on plots

In [None]:
import pandas as pd

In [None]:
# Loading the SNP names
with open('data/athaliana_small.snps.txt') as f:
    snp_names = f.readline().split()
    f.close()
print(len(snp_names))

In [None]:
# Loading the design matrix -- this can take time!
X = np.loadtxt('data/athaliana_small.X.txt',  # file names
               dtype = 'int') # values are integers
p = X.shape[1]

In [None]:
# Loading the samples
samples = list(np.loadtxt('data/athaliana.samples.txt', # file names
                         dtype=int)) # values are integers

In [None]:
# Loading the phenotypes
df = pd.read_csv('data/athaliana.2W.pheno', # file name
                 header=None, # columns have no header
                 delim_whitespace=True) # columns are separated by white space

In [None]:
# Loading the phenotypes
df_2W = pd.read_csv('data/athaliana.2W.pheno', # file name
                 header=None, # columns have no header
                 delim_whitespace=True, # columns are separated by white space
                 index_col=0) # read the first column as index

# Create vector of sample IDs
samples_with_phenotype_2W = list(df_2W.index)
print(len(samples_with_phenotype_2W), "samples have a 2W phenotype")

# Create vector of phenotypes
y_2W = df_2W[1].to_numpy()

# Restricting the design matrix to those samples who have a 2W phenotype
X_2W = X[np.array([samples.index(sample_id) \
                   for sample_id in samples_with_phenotype_2W]), :]

# Delete X to free space
del X

In [None]:
# Loading the list of candidate genes
with open('data/athaliana.candidates.txt') as f:
    candidate_genes = f.readline().split()
    f.close()

In [None]:
# Loading the SNPs-to-gene mapping
genes_by_snp = {} # key: SNP, value = [genes in/near which this SNP is]
with open('data/athaliana.snps_by_gene.txt') as f:
    for line in f:
        ls = line.split()
        gene_id = ls[0]
        for snp_id in ls[1:]:
            if not snp_id in genes_by_snp:
                genes_by_snp[snp_id] = []
            genes_by_snp[snp_id].append(gene_id) 

In [None]:
# Splitting the data into a train and test set
from sklearn import model_selection

X_2W_tr, X_2W_te, y_2W_tr, y_2W_te = \
    model_selection.train_test_split(X_2W, y_2W, test_size=0.2, 
                                     random_state=17) # use the same random_state as in Notebook 1 to obtain the same split
print(X_2W_tr.shape, X_2W_te.shape)

## Linear regression 

### Fitting a linear regression to the training set

In [None]:
from sklearn import linear_model

In [None]:
model_lr = linear_model.LinearRegression(fit_intercept = True)
model_lr.fit(X_2W_tr, y_2W_tr)

We can now visualize the regression weights we have learned

In [None]:
plt.figure(figsize = (12, 5))
plt.scatter(range(p), # x = SNP position
            model_lr.coef_, # y = regression weights
            s = 10)  # point size

plt.xlabel("SNP")
plt.ylabel("regression weight")
plt.xlim([0, p])

__Q: What do you observe? How can you interpret these results? Do any of the SNPs strike you as having a strong influence on the phenotype?__

The following SNPs are the ones with the ten highest weights (in absolute value). They are all near candidate genes.

In [None]:
highest_weights = np.abs(model_lr.coef_)
highest_weights.sort()
highest_weights = highest_weights[-10:]

for w in highest_weights:
    for snp_idx in np.where(model_lr.coef_ == w)[0]:
        print(w, snp_names[snp_idx])
        for gene_id in genes_by_snp[snp_names[snp_idx]]:
            if gene_id in candidate_genes:
                print("\t in/near candidate gene %s" % gene_id)

### Predictive power

In this section, we measure the performance of our model on the test dataset.

We will now look at the predictive power of the lasso estimated model.

__Q: What is the definition of the variance explained? You may use the [scikit learn documentation](https://sklearn.org/modules/classes.html#sklearn-metrics-metrics). What values can this metric take? and to what cases do the extreme values correspond to?__

In [None]:
from sklearn import metrics

In [None]:
y_2W_lr_pred = model_lr.predict(X_2W_te)

print("Percentage of variance explained (using all SNPs): %.2f" % \
    metrics.explained_variance_score(y_2W_te, y_2W_lr_pred))

In [None]:
plt.figure(figsize = (5, 5))
plt.scatter(y_2W_te, y_2W_lr_pred)

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.xlim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])
plt.ylim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])

plt.plot([np.min(y_2W_te)-5, np.max(y_2W_te)+5], 
         [np.min(y_2W_te)-5, np.max(y_2W_te)]+5, c="black")

## Lasso
Under the hypothesis that not all SNPs are involved in the phenotype, we will now attempt to learn a _sparse_ model, using a Lasso.

### Fitting a lasso model

Define a lasso model

In [None]:
lasso = linear_model.Lasso(fit_intercept=True, max_iter=6000)

Define cross-validation grid search and learn lasso with cross-validation.

In [None]:
alphas = np.logspace(-3., 2., num=20)
model_l1 = model_selection.GridSearchCV(lasso, param_grid = {'alpha': alphas}, 
                                        scoring='explained_variance', verbose=2)
model_l1.fit(X_2W_tr, y_2W_tr)

The best value of the regularization parameter is given by:

In [None]:
model_l1.best_params_

### Interpretation

Let us now visualize the regression coefficients:

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(range(p), # x = SNP position
            model_l1.best_estimator_.coef_)  # y = regression weights

plt.xlabel("SNP")
plt.ylabel("lasso regression weight")
plt.xlim([0, p])

We can now check how many of these SNPs have non-zero coefficients.

In [None]:
selected_snps = np.nonzero(model_l1.best_estimator_.coef_)[0]
print("%d SNPs selected" % selected_snps.shape)

__Q: How many SNPs are selected? How do you interpret this?__

We can now check whether those SNPs are in or near candidate genes, that is to say, genes that are known or strongly suspected to be involved in flowering time:

In [None]:
candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("%d of the selected SNPs are in or near %d candidate genes" % (num_snps_in_candidate_genes, 
                                                                     len(candidate_genes_hit)))

All selected SNPs are in or near candidate genes. The lasso selected biologically relevant SNPs!

### Predictive power 

In [None]:
y_2W_l1_pred = model_l1.best_estimator_.predict(X_2W_te)

print("Percentage of variance explained (using %d SNPs): %.2f" % \
     (np.nonzero(model_l1.best_estimator_.coef_)[0].shape[0], 
      metrics.explained_variance_score(y_2W_te, y_2W_l1_pred)))

__Q: How does the lasso compare with the OLS (linear regression) in terms of variance explained? What is the advantage of the lasso model for generating biological hypotheses?__

Comparing true and predicted phenotypes

In [None]:
plt.figure(figsize = (5, 5))
plt.scatter(y_2W_te, y_2W_l1_pred)

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.xlim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])
plt.ylim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])

plt.plot([np.min(y_2W_te)-5, np.max(y_2W_te)+5], 
         [np.min(y_2W_te)-5, np.max(y_2W_te)]+5, c="black")

## Elastic net

One solution to make the lasso more stable is to use a combination of the l1 and l2 regularizations.

We are now minimizing the loss + a linear combination of an l1-norm and an l2-norm over the regression weights. This imposes sparsity, but encourages correlated features to be selected together, where the lasso would tend to pick only one (at random) of a group of correlated features.

The elastic net is implemented in scikit-learn's [linear_model.ElasticNet](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet).

### Regularization path

To better understand the difference between Lasso and Elastic net, we can start by visualizing the regularization path of a few variables for both models. To avoid looking at almost 10,000 paths (as many as SNPs), we'll only look at the paths for the features selected by the Lasso in the previous section (indexed by `selected_snps`).

The regularization path of a variable shows how the regression coefficient of this variable evolves as a function of the regularization parameter.

It can be computed with [linear_model.lasso_path](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.lasso_path.html) for the Lasso and [linear_model.enet_path](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.enet_path.html) for Elastic net. For the Elastic net, we're fixing `l1_ratio`.

In [None]:
from sklearn import linear_model

In [None]:
alphas_lasso, coefs_lasso, _ = linear_model.lasso_path(X_2W_tr[:, :], y_2W_tr, eps=1e-2, n_alphas=30, fit_intercept=True)

In [None]:
alphas_enet, coefs_enet, _ = linear_model.enet_path(X_2W_tr[:, :], y_2W_tr, eps=1e-2, n_alphas=30,  
                                                    l1_ratio=0.8, fit_intercept=True)

In [None]:
from itertools import cycle
import matplotlib.colors as mcolors

In [None]:
colors = cycle(list(mcolors.TABLEAU_COLORS.keys()))
figure(figsize = (10, 6))
neg_log_alphas_lasso = -np.log10(alphas_lasso)
neg_log_alphas_enet = -np.log10(alphas_enet)

for coef_l, coef_e, c in zip(coefs_lasso[selected_snps, :], coefs_enet[selected_snps, :], colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c = c)
    l2 = plt.plot(neg_log_alphas_enet, coef_e, linestyle = '--', c = c)

plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients')
plt.title('Lasso and Elastic net regularization paths')
plt.legend((l1[-1], l2[-1]), ('Lasso', 'Elastic-Net'), loc='lower left')
plt.axis('tight')

plt.show()

__Q: Compared to the lasso, what is the effect of the elastic-net on the coefficients?__

### Fitting an elastic-net

In [None]:
# Parameters grid
alphas = np.logspace(-0.01, 10., num=15)
ratios = np.linspace(0.7, 1., num=4)

__Q: Define the elastic net model (call it `model_l1l2`) using the functions `ElasticNet` and `GridSearchCV`.__

In [None]:
model_l1l2.best_params_

### Interpretation

In [None]:
plt.figure(figsize = (6, 4))
plt.scatter(range(p), # x = SNP position
            model_l1l2.best_estimator_.coef_)  # y = regression weights

plt.xlabel("SNP")
plt.ylabel("elastic net regression weight")
plt.xlim([0, p])

In [None]:
selected_snps_enet = np.nonzero(model_l1l2.best_estimator_.coef_)[0]
print("%d SNPs selected," % selected_snps_enet.shape)

candidate_genes_hit = set([])
num_snps_in_candidate_genes = 0
for snp_idx in selected_snps_enet:
    for gene_id in genes_by_snp[snp_names[snp_idx]]:
        if gene_id in candidate_genes:
            candidate_genes_hit.add(gene_id)
            num_snps_in_candidate_genes += 1
            break

print("of which %d are in %d candidate genes" % (num_snps_in_candidate_genes, 
                                                          len(candidate_genes_hit)))

__Q: How can you interpret these results? How many SNPs contribute to explaining the phenotype?__

### Predictive power 

In [None]:
from sklearn import metrics

In [None]:
y_2W_l1l2_pred = model_l1l2.best_estimator_.predict(X_2W_te)

print("Percentage of variance explained (using %d SNPs): %.2f" % \
      (selected_snps_enet.shape[0], 
      metrics.explained_variance_score(y_2W_te, y_2W_l1l2_pred)))

In [None]:
plt.figure(figsize = (4, 4))
plt.scatter(y_2W_te, y_2W_l1_pred, alpha=0.7, label="lasso")
plt.scatter(y_2W_te, y_2W_l1l2_pred, alpha=0.7, label="enet")

plt.xlabel("true phenotype")
plt.ylabel("prediction")
plt.xlim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])
plt.ylim([np.min(y_2W_te) - 5, np.max(y_2W_te) + 5])

plt.plot([np.min(y_2W_te)-5, np.max(y_2W_te)+5], 
         [np.min(y_2W_te)-5, np.max(y_2W_te)]+5, c="black")

plt.legend()