# phenotype prediction case study

The humble fruit fly, *Drosophila melanogaster* is one of the most important and well-studied model organisms in biological and biomedical research.

Early research using the fruit fly helped to establish the basic 'rules' of genetics and inheritance, including generating basic information about how mutations occur.

The fruit fly has been used extensively to learn about neural biology, including neurodevelopment and how neurological disorders important for human health occur, like Alzheimer's and Parkinson's.

The close historical association between fruit fly and human populations led to the use of the fruit fly as a model for studying early human migrations, including understanding how humans may have adapted to their local environments as they migrated out of Africa to colonize the globe.

The fruit fly was one of the first animals used to extensively study the links between genetic vatiation and differences in phenotypes at a whole-genome scale. In 2012, a public data-bank of ~200 'reference' fly lines were fully genome-sequenced and made available for use in a wide variety of genome-phenotype association experiments, with the results of all experiments made freely available to the public through the Drosophila Genetics Reference Panel (DGRP), now at the [dgrp2 website](http://dgrp.gnets.ncsu.edu/).

In this exercise, we will develop a neural-network model for predicting a fly's 'longevity' (normalized lifespan) from >17,000 genomic mutations (SNPs) dispersed along the fruit fly's two main autosomal chromosomes.

## data download and organization

The genomic SNP data and associated longevity phenotypes are available for 182 fly lines from the DGRP. We have made a copy of these data available in comma-separated value (.csv) format for this course.

Comma-separated value format is a very simple text file format, in which all the SNPs from a single line are stored as a "row" of data, along with the line's identifier and the associated longevity value. Each data field (colum) is separated by a comma delimiter (",").

Fortunately for us, we don't have to write the code to parse the .csv data file. We'll use the "pandas" python library to parse the data file for us.

We'll need to import the "pandas" library, and then use the

    pandas.read_csv

function to read the data into a "dataframe" object.

We'll print the first few lines of the dataframe object using the

   dataframe.head()

method call, so we can see what the data file looks like without having to read through all 182 lines and >17,000 columns of data

In [1]:
import pandas
dataframe = pandas.read_csv('https://raw.githubusercontent.com/bryankolaczkowski/ALS3200C/main/phenopred.data.csv')
dataframe.head()

Unnamed: 0,SID,SNP0,SNP1,SNP2,SNP3,SNP4,SNP5,SNP6,SNP7,SNP8,SNP9,SNP10,SNP11,SNP12,SNP13,SNP14,SNP15,SNP16,SNP17,SNP18,SNP19,SNP20,SNP21,SNP22,SNP23,SNP24,SNP25,SNP26,SNP27,SNP28,SNP29,SNP30,SNP31,SNP32,SNP33,SNP34,SNP35,SNP36,SNP37,SNP38,...,SNP17126,SNP17127,SNP17128,SNP17129,SNP17130,SNP17131,SNP17132,SNP17133,SNP17134,SNP17135,SNP17136,SNP17137,SNP17138,SNP17139,SNP17140,SNP17141,SNP17142,SNP17143,SNP17144,SNP17145,SNP17146,SNP17147,SNP17148,SNP17149,SNP17150,SNP17151,SNP17152,SNP17153,SNP17154,SNP17155,SNP17156,SNP17157,SNP17158,SNP17159,SNP17160,SNP17161,SNP17162,SNP17163,SNP17164,LS
0,S0,0,0,0,1,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,46.83
1,S1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,22.67
2,S2,0,0,0,0,0,0,1,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,...,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,45.55
3,S3,0,0,0,1,1,0,1,1,0,0,1,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,34.45
4,S4,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,39.15


As you can see from the output, XX

In [2]:
train_dataframe = dataframe.sample(frac=0.8, random_state=402201)
valid_dataframe = dataframe.drop(train_dataframe.index)
print(train_dataframe.shape, valid_dataframe.shape)

(146, 17167) (36, 17167)


In [3]:
snp_ids = [ x for x in dataframe.columns if x.find('SNP') == 0]
train_x = train_dataframe[snp_ids].to_numpy()
valid_x = valid_dataframe[snp_ids].to_numpy()
print(train_x.shape, valid_x.shape)

(146, 17165) (36, 17165)


In [4]:
train_y = train_dataframe['LS'].to_numpy()
valid_y = valid_dataframe['LS'].to_numpy()
print(train_y.shape, valid_y.shape)

(146,) (36,)


In [5]:
import tensorflow as tf

train_data = tf.data.Dataset.from_tensor_slices((train_x,train_y)).batch(10)
valid_data = tf.data.Dataset.from_tensor_slices((valid_x,valid_y)).batch(36)


In [None]:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.InputLayer(input_shape=[17165]))
model.add(tf.keras.layers.Dropout(rate=0.98))
model.add(tf.keras.layers.Dense(units=1))
model.compile(optimizer=tf.keras.optimizers.RMSprop(),
              loss=tf.keras.losses.MeanAbsoluteError())
model.summary()

model.fit(train_data, epochs=1000, validation_data=valid_data)

train_y_hat = model.predict(train_x)
valid_y_hat = model.predict(valid_x)

import matplotlib.pyplot as plt
plt.plot([10,60],[10,60])
plt.scatter(train_y, train_y_hat, marker='o')
plt.scatter(valid_y, valid_y_hat, marker='+')