In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import allel
import pandas as pd

This notebook is one attempt to create a method to identify a minimal set of predictors (SNPs) which fully describe a non-ordinal categorical response (haplotypes) for a given region of the genome. My idea is to perform a (possibly multinomial) logistic regression, using regularization with an L1 penalization parameter, which should (hopefully) shrink the coefficients for many SNPs to 0.

There are many different methods for finding "tag SNPs" which can capture the variance explained by multiple surrounding SNPs. This problem is non-trivial and potentially NP-hard. See https://doi.org/10.1109/CSB.2005.22 for one example and some background on different methods. Note that much of this work was performed around the time of the first human HapMap project (ca. 2005), and hence there are many links to software on websites which no longer exist.

Many tag SNP identification algorithms are designed to find sets of tag SNPs across chromosomes/genomes, without any phenotypic data. I think that my particular use-case here is actually somewhat simpler, since I:

1. Have access to a "phenotype" (haplotype groups determined by a simple distance matrix)
2. Only need to focus on one region, obviating the need to find haploblock boundaries

At least that's what I tell myself - maybe this won't work at all. See https://chrisalbon.com/machine_learning/logistic_regression/logistic_regression_with_l1_regularization/ and http://alimanfoo.github.io/2017/06/14/read-vcf.html for some sources on SciKit functionality that have influenced my thinking, for better or worse.

In [40]:
## User-defined constants
vcf_file = '/mnt/DATAPART2/brian/repos/manuscripts/manu_2020_Yr/input_data/geno/GAWN_KM_Yr_postimp_filt.vcf.gz'
reg = '4B:526943215-598847646'

haps_file = '/mnt/DATAPART2/brian/repos/manuscripts/manu_2020_Yr/temp/combined_graph_out_w_priors/haplotype_groupings.csv'

pen = 0.01

## Read input

Here we read in the region of interest from the VCF file. We then convert the GT calls from the VCF into a genotype array (this is an intermediate step, and I'm currently not sure if it is required or not).

In [16]:
vcf = allel.read_vcf(vcf_file, region = reg)
gt = allel.GenotypeArray(vcf['calldata/GT'])
gt

Help on built-in function keys:

keys(...) method of builtins.dict instance
    D.keys() -> a set-like object providing a view on D's keys



Unnamed: 0,0,1,2,3,4,...,980,981,982,983,984,Unnamed: 12
0,0/0,1/1,0/0,1/1,0/0,...,1/1,1/1,0/0,0/0,0/0,
1,0/0,1/1,0/0,0/0,0/0,...,1/1,0/0,0/0,0/0,0/0,
2,0/0,0/0,0/0,0/0,0/0,...,0/0,0/0,0/0,0/0,0/0,
...,...,...,...,...,...,...,...,...,...,...,...,...
424,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,
425,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,
426,1/1,0/0,1/1,0/0,1/1,...,0/0,0/0,1/1,1/1,1/1,


Now we convert the genotype array into minor allele dosage format, then convert to a dataframe, set SNP IDs as col. names, and add a column for the sample name.

In [29]:
dos = gt.to_n_alt()
dos = dos.transpose()

dos_df = pd.DataFrame(dos)
dos_df.columns = vcf["variants/ID"]
dos_df["sample"] = vcf["samples"]

Now import the haplotypes file, merge it with the dos_df dataframe, and set the response (y) and predictors (X)

In [39]:
haps = pd.read_csv(haps_file)
haps = haps[["sample", "4BL"]]
haps

merged = pd.merge(haps, dos_df, how = "inner", on = "sample")
y = merged["4BL"]
X = merged[vcf["variants/ID"]]

Unnamed: 0,S4B_526943216,S4B_526967202,S4B_526967217,S4B_526967245,S4B_527670639,S4B_527816960,S4B_527979736,S4B_528176349,S4B_528176373,S4B_528520372,...,S4B_597861627,S4B_597985902,S4B_598049229,S4B_598220152,S4B_598220225,S4B_598404019,S4B_598612556,S4B_598679527,S4B_598794343,S4B_598847646
0,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,2,2,2,2,2,2
1,2,0,0,0,0,0,0,0,0,0,...,0,0,2,2,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,2,2,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,2,2,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,2,2,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
980,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,2,2,2,2,2,2
981,2,2,0,2,2,2,2,2,2,2,...,0,0,2,2,0,0,0,0,0,0
982,0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,2,2,2,2,2,2
983,2,0,0,0,0,0,0,0,0,0,...,0,2,0,0,2,0,2,2,2,0


## L1 Coefficients



In [45]:
#score_arr = np.zeros((n_reps, val_k), dtype = np.float)

'''
lasso = Lasso(alpha = best_params['alpha'],
              selection = 'random',
              max_iter = 1e6,
              tol = 0.001)
'''
pen = 0.1
logit = LogisticRegression(penalty = "l1", C = pen, solver = 'saga', multi_class = 'auto', max_iter = 5000)
logit.fit(X, y)

#for i in range(0, n_reps):
#    k_fold = KFold(n_splits = val_k, random_state = i, shuffle = True)
#    score_arr[i,:] = cross_val_score(logit, X, y, cv = k_fold, scoring = met_prime)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='auto', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [50]:
test = logit.coef_
test != 0

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False,  True, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])