In [None]:
import sys
sys.path.append('../src') 

# Importing libraries
import pandas as pd
import numpy as np

# Libraries for machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Importing script
import etl as etl

import warnings
warnings.filterwarnings('ignore')

## Disease Risk Classification Model

The purpose of this notebook is to explore a variety of models and tune them in order to figure out which is best for disease risk classification. According to the article "Comparing different supervised machine learning algorithms for disease prediction" by Shahadat Uddin et al. (https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-1004-8), Support Vector Machines, Naive Bayes, and Random Forest are some of the most common machine learning algorithms applied to disease prediction. We will explore these along with other algorithms such as Logistic Regression and K Nearest Neighbors in this notebook.

Let's get straight into it and simulate a data set of 5,000 individuals.

In [None]:
simulated_gwas_fp = '../testdata/gwas/gwas_simulate.tsv' 
maf_fp = '../references/snp_mafs.txt'
simulated_data = etl.simulate_data(simulated_gwas_fp, maf_fp, 10000)
simulated_data.head()

We will not be using the simulated data above for building the model. Given that the class label (disease risk category) was assigned to an individual depending on the weighted sum of all and only the SNPs in this data set, it would be too easy for a machine learning model to figure this out. Essentially, the above data set is a simulated 'ground truth'. In machine learning problems you typically don't work with all the variables that determine the label so something needs to be done about this.

As a solution, we will be making classifications based on a subset of SNPs in the above data set. To do this, we will be using a separate GWAS to inform us of SNPs that are most important in predicting disease. Only those SNPs that are in the above data set and the new GWAS will be used. The model will then be trained on this subset.

Let's load in the other GWAS data and filter the above data.

In [None]:
model_gwas_fp = '../testdata/gwas/gwas_model.tsv' 
model_data = pd.read_csv(model_gwas_fp, sep='\t')
subset = set(simulated_data.columns).intersection(model_data['SNPS'].unique())
new_columns = list(subset)+['Class']

data = simulated_data[new_columns]

Now let's create a training and test set on the above data.

In [None]:
X = data.drop('Class', axis=1)
y = data['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=10)

Let's start simple and work with Logistic Regression.

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)

accuracy_lr = lr.score(X_test, y_test)
print('The accuracy for the logistic regression model is {}'.format(accuracy_lr))