# Autoencoding Gene Expressions

In this notebook I build an autoencoder that takes ~19,000 protein encoding gene expression values as input, and encodes it to varying (much smaller) dimensions and then decodes them back to the original width of the input. 

I first normalize the data using the l2 norm after splitting the data into training (67%) and testing sets. Later, I build an autoencoder with 7 layers and 50 dimensions in the bottleneck layer. I use `adam` as optimizer and `binary crossentropy` loss function to compare the results of the decoder with the original input. The model converges at loss = 0.0041 which indicates a fairly high quality compression by the encoder.

I later experiment with different sizes for bottleneck layer and try 10, 20, 30, 40, 60, and 80 to document the change in loss. As expected, we get the lowest loss on our validation set when the bottleneck layer is the widest, but the difference is very small. It is expected to get higher loss when the bottleneck is narrower, as we are forcing data to a lower dimension which causes a higher decrease in information loss. 

Next, I vary the depth of the network to see its impact on loss. The model with just one hidden layer converges later and at a higher loss value compared to the model with 15 hidden layers. However, the training time of the shallow model is considerably less. 

Using autoencoders on gene expression data can help us create clusters of different types of cancer as we force it to a lower dimension representation, analogous to more traditional unsupervised learning algorithms. It can also help generate synthetic data highly is very valuable due to relatively higher costs and of RNA sequencing as well as numerous barriers in front of researchers to access resources to collect and construct organic datasets. 

In [100]:
from keras.layers import Input, Dense
from keras.models import Model
from keras import optimizers
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.preprocessing import StandardScaler, Normalizer

# Load Data

In [3]:
nt_coding = pd.read_csv('data/nt.coding.csv')
nt_coding.drop('Type', axis=1, inplace=True)

Normalize to fit data into [0, 1] scale:

In [40]:
X_train, X_test = train_test_split(nt_coding, test_size=0.33, random_state=1)

scaler = Normalizer()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Building the network

In [115]:
input_layer = Input(shape=(X_train.shape[1],))
encoded = Dense(256, activation='relu')(input_layer)
encoded = Dense(128, activation='relu')(encoded)
encoded = Dense(50, activation='relu')(encoded)

decoded = Dense(128, activation='relu')(encoded)
decoded = Dense(256, activation='relu')(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)

In [116]:
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

In [117]:
autoencoder.fit(X_train, X_train,
                epochs=20,
                batch_size=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fa52b1b3400>

## Evaluate on the testing set

In [73]:
loss = autoencoder.evaluate(X_test, X_test, verbose=2)

15/15 - 0s - loss: 0.0041


## Experimenting with different bottleneck sizes

In [76]:
loss_dict = dict()
del autoencoder, encoded, decoded
for size in [10, 20, 30, 40, 60, 80]:
    input_layer = Input(shape=(X_train.shape[1],))
    encoded = Dense(256, activation='relu')(input_layer)
    encoded = Dense(128, activation='relu')(encoded)
    encoded = Dense(size, activation='relu')(encoded)

    decoded = Dense(128, activation='relu')(encoded)
    decoded = Dense(256, activation='relu')(decoded)
    decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
    autoencoder = Model(input_layer, decoded)
    autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
    autoencoder.fit(X_train, X_train,
                epochs=20,
                batch_size=20)
    loss_dict[size] = autoencoder.evaluate(X_test, X_test, verbose=2)
    del autoencoder, encoded, decoded

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
15/15 - 0s - loss: 0.0041
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
15/15 - 0s - loss: 0.0041
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
15/15 - 0s - loss: 0.0041
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/2

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
15/15 - 0s - loss: 0.0041


We get the lowest loss on our validation set when the bottleneck layer is the widest, but the difference is very small. It is expected to get higher loss when the bottleneck is narrower, as we are forcing data to a lower dimension which causes a higher decrease in information loss. 

In [89]:
loss_dict

{10: 0.004094595089554787,
 20: 0.004099169746041298,
 30: 0.0040933662094175816,
 40: 0.00409733084961772,
 60: 0.004088591318577528,
 80: 0.004073957446962595}

## Experimenting with different depths

### Depth = 1

In [97]:
input_layer = Input(shape=(X_train.shape[1],))
encoded = Dense(50, activation='relu')(input_layer)
decoded = Dense(X_train.shape[1], activation='sigmoid')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X_train, X_train,
                epochs=20,
                batch_size=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fa540276f40>

In [98]:
autoencoder.evaluate(X_test, X_test, verbose=2)

15/15 - 0s - loss: 0.0049


0.004860951565206051

### Depth = 15

In [93]:
input_layer = Input(shape=(X_train.shape[1],))
encoded = Dense(5000, activation='relu')(input_layer)
encoded = Dense(2048, activation='relu')(encoded)
encoded = Dense(1024, activation='relu')(encoded)
encoded = Dense(512, activation='relu')(encoded)
encoded = Dense(256, activation='relu')(encoded)
encoded = Dense(128, activation='relu')(encoded)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(50, activation='relu')(encoded)

decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(256, activation='relu')(decoded)
decoded = Dense(512, activation='relu')(decoded)
decoded = Dense(1024, activation='relu')(decoded)
decoded = Dense(2048, activation='relu')(decoded)
decoded = Dense(5000, activation='relu')(decoded)
decoded = Dense(X_train.shape[1], activation='sigmoid')(decoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X_train, X_train,
                epochs=20,
                batch_size=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7fa54092f550>

In [94]:
autoencoder.evaluate(X_test, X_test, verbose=2)

15/15 - 2s - loss: 0.0041


0.00411525834351778

Higher depth model takes much longer to train but converges faster at lower loss. 