# Regularized Multinomial Logistic Regression for tagSNP Selection

Brian Ward  
brian@brianpward.net  
https://github.com/etnite

This notebook is one attempt to create a method to identify a minimal set of predictors (SNPs) which fully describe a non-ordinal categorical response (haplotypes) for a given region of the genome. My idea is to perform a (possibly multinomial) logistic regression, using regularization with an L1 penalization parameter, which should (hopefully) shrink the coefficients for many SNPs to 0.

There are many different methods for finding "tag SNPs" which can capture the variance explained by multiple surrounding SNPs. This problem is non-trivial and potentially NP-hard. See https://doi.org/10.1109/CSB.2005.22 for one example and some background on different methods. Note that much of this work was performed around the time of the first human HapMap project (ca. 2005), and hence there are many links to software on websites which no longer exist.

Many tag SNP identification algorithms are designed to find sets of tag SNPs across chromosomes/genomes, without any phenotypic data. I think that my particular use-case here is actually somewhat simpler, since I:

1. Have access to a "phenotype" (haplotype groups determined by a simple distance matrix)
2. Only need to focus on one region, obviating the need to find haploblock boundaries

At least that's what I tell myself - maybe this won't work at all. See https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/ and http://alimanfoo.github.io/2017/06/14/read-vcf.html for some sources on SciKit functionality that have influenced my thinking, for better or worse.

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, KFold
from sklearn import preprocessing
import allel
import pandas as pd

## User-Defined Constants

Constants are as follows:

* Path to VCF (genotypic data) file
* Region of interest formatted chr:start_pos-end_pos
* .csv file of haplotypes - must contain at least two cols. - one for sample, one for the haplotype groupings
* response - string identifying the response variable in the haplotypes file
* pen - L1 regularization penalty parameter
* n_reps - Number of times to repeat k-fold validation
* val_k - Number of folds per repetition of k-fold validation

In [2]:
vcf_file = '/mnt/DATAPART2/brian/repos/manuscripts/manu_2020_Yr/input_data/geno/GAWN_KM_Yr_postimp_filt.vcf.gz'
reg = '4B:526943215-598847646'
haps_file = '/mnt/DATAPART2/brian/repos/manuscripts/manu_2020_Yr/temp/combined_graph_out_w_priors/haplotype_groupings.csv'
response = "4BL"
pen = 0.001
n_reps = 1
val_k = 5

## Read input

Here we read in the region of interest from the VCF file. We then convert the GT calls from the VCF into a genotype array (this is an intermediate step, and I'm currently not sure if it is required or not).

In [3]:
vcf = allel.read_vcf(vcf_file, region = reg)
preds = vcf['variants/ID']
gt = allel.GenotypeArray(vcf['calldata/GT'])
gt

Unnamed: 0,0,1,2,3,4,...,980,981,982,983,984,Unnamed: 12
0,0/0,1/1,0/0,1/1,0/0,...,1/1,1/1,0/0,0/0,0/0,
1,0/0,1/1,0/0,0/0,0/0,...,1/1,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
424,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,
425,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,
426,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,


Now we convert the genotype array into minor allele dosage format, then convert to a dataframe, set SNP IDs as col. names, and add a column for the sample name.

In [4]:
dos = gt.to_n_alt()
dos = dos.transpose()

dos_df = pd.DataFrame(dos)
dos_df.columns = vcf["variants/ID"]
dos_df["sample"] = vcf["samples"]

Now import the haplotypes file, merge it with the dos_df dataframe, and set the response vector (y) and predictors array (X). We then standardize each predictor.

In [5]:
haps = pd.read_csv(haps_file)
haps = haps[["sample", response]]

merged = pd.merge(haps, dos_df, how = "inner", on = "sample")
y = merged["4BL"]
X = merged[vcf["variants/ID"]]

X_std = preprocessing.scale(X)
X_std

array([[-1.14019606, -0.71251567, -0.33202381, ...,  1.00469512,
         1.01349415,  1.36287906],
       [ 0.92052526, -0.71251567, -0.33202381, ..., -1.01701007,
        -1.01555409, -0.74952987],
       [ 0.92052526, -0.71251567, -0.33202381, ..., -1.01701007,
        -1.01555409, -0.74952987],
       ...,
       [-1.14019606, -0.71251567, -0.33202381, ...,  1.00469512,
         1.01349415,  1.36287906],
       [ 0.92052526, -0.71251567, -0.33202381, ...,  1.00469512,
         1.01349415, -0.74952987],
       [ 0.92052526,  1.45361993, -0.33202381, ..., -1.01701007,
        -1.01555409, -0.74952987]])

## Fit Logistic Regression

Note - may need to increase max_iter if there is a convergence failure.

In [13]:
logit = LogisticRegression(penalty = "l1", C = pen, solver = 'saga', multi_class = 'auto', 
                           max_iter = 5e3, tol = 1e-5)
logit.fit(X, y)

LogisticRegression(C=0.005, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000.0,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=1e-05, verbose=0,
                   warm_start=False)

Note that in the case of a multinomial logistic regression, each predictor will have n coefficients, where n is the number of categories of the response variable. (Other implementations of multinomial logistic regression will report n-1 coefficients per predictor). So, that makes interpretation a bit difficult. Here we are just finding the predictors which have non-zero coefficients after regularization.

In [14]:
coef_sums = np.sum(logit.coef_, axis = 0)
coef_sums
preds[coef_sums > 0]

array(['S4B_559751425', 'S4B_560876006'], dtype=object)

## K-Fold Cross-validation

Now we perform n repeats of k-fold cross-validation, and find the mean accuracy of the prediction (proportion of times the predicted response category matches the observed category).

In [15]:
score_arr = np.zeros((n_reps, val_k), dtype = np.float)
for i in range(0, n_reps):
    k_fold = KFold(n_splits = val_k, random_state = i, shuffle = True)
    score_arr[i,:] = cross_val_score(logit, X, y, cv = k_fold, scoring = 'accuracy')

In [16]:
np.mean(score_arr)

0.6700507614213198