# OpenVaccine Covid-19 mRNA Vaccine Degradation Prediction

This notebook is a solution for the OpenVaccine Competition on Kaggle. This competition requires the model to predict the values for degradation (under various conditions) at each mRNA base in the sequence of a vaccine. 4 Different models have been explored:

1) Deep GRU Network 


2) Simple seq2seq model with an encoder and decoder

3) Seq2Seq Model with Attention Mechanism

4) 1-D Convolutional Network 

From my experimentation I concluded that the Deep GRU Network was the optimal model architecture. However, there are several solutions that make use of Graph Neural Networks (GNN), Graph Convolutional Networks (GCN), and Transformers, which have scored higher on the leaderboard. Thus, this notebook is primarily intended for beginners like me to simply understand the principles of Sequence to Sequence models. Feel free to play around with the model architectures and hyperparameters :) 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import tensorflow as tf
import json
from tensorflow import keras

from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

**Importing the competition data**

In [None]:
train = pd.read_json('../input/stanford-covid-vaccine/train.json', lines=True)

In [None]:
train1 = train.loc[train['SN_filter'] == 1]

In [None]:
train

**Tokenizing the structure, sequence and loop type**

In [None]:
def preprocess_inputs(train):
    token2int = {x:i for i,x in enumerate('AUCG().EISHXMB')}
    cols = ['sequence','structure','predicted_loop_type']
    train[cols] = train[cols].applymap(lambda seq: [token2int[x] for x in seq])
    return np.transpose(np.array(train[cols].values.tolist()),(0,2,1))

In [None]:
train_inputs = preprocess_inputs(train1)

In [None]:
train_inputs.shape

**Importing the Bpps unpaired probability np array**

In [None]:

def read_bpps(df,seq_length):
    
    unpaired_probability = np.empty([len(df),seq_length])
    
    for i in range(len(df)):
        unpaired_probability[i,:] = np.sum(np.load('../input/stanford-covid-vaccine/bpps/'+
                                                   df['id'].iloc[i] + '.npy'), axis = 1)
        
    return unpaired_probability 

In [None]:
prob_train = read_bpps(train1,107)

In [None]:
prob_train.shape

**Retrieving the labels**

In [None]:
label_cols = ['reactivity','deg_pH10','deg_Mg_pH10','deg_Mg_50C','deg_50C']
train_labels = np.transpose(np.array(train1[label_cols].values.tolist()),(0,2,1))
train_labels.shape

**Building the model**

In [None]:
pip install git+git://github.com/stared/livelossplot.git #installing the package for live loss plot

In [None]:
import tensorflow.keras.layers as L 

In [None]:
from livelossplot import PlotLossesKeras

In [None]:
#Custom loss function based on competition metric 

def MCRMSE(y_true, y_pred): 
    colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
    return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)

**Model 1: Seq to Seq using GRUs, with return_sequences = True**

In [None]:
token2int = {x:i for i,x in enumerate('AUCG().EISHXMB')}

In [None]:
def build_gru_model(seq_length, scored_length):

    inputs = L.Input(shape = (seq_length,3))
    embed = L.Embedding(len(token2int),200,input_length = seq_length)(inputs)

    reshape = tf.reshape(embed,shape = (-1,embed.shape[1],embed.shape[2]*embed.shape[3]))
    
    input_prob = L.Input(shape = (seq_length,1))
    
    
    concat = L.Concatenate()([reshape, input_prob])
                    
    gru1 = L.Bidirectional(L.GRU(128, dropout=0.2, return_sequences=True, kernel_initializer='orthogonal'))(concat)
    gru2 = L.Bidirectional(L.GRU(256, dropout=0.2, return_sequences=True, kernel_initializer='orthogonal'))(gru1)
    gru3 = L.Bidirectional(L.GRU(128, dropout=0.2, return_sequences=True, kernel_initializer='orthogonal'))(gru2)
    
    
    
    dense = L.Dense(64, activation = 'relu')(gru3)
    dense = L.Dense(32, activation = 'relu')(dense)
    trunc = dense[:,:scored_length]
    out = L.Dense(5)(trunc)


    model = tf.keras.Model(inputs = [inputs,input_prob], outputs = out)

    model.compile(tf.optimizers.Adam(), loss = MCRMSE)
    
    return model


**Model 2: Simple Seq to Seq model with an encoder decoder**

In [None]:
def build_seq2seq(seq_length,scored_length):
    inputs = L.Input(shape = (seq_length,3))
    embed = L.Embedding(len(token2int),200,input_length = seq_length)(inputs)

    reshape = tf.reshape(embed,shape = (-1,embed.shape[1],embed.shape[2]*embed.shape[3]))
    
    input_prob = L.Input(shape = (seq_length,1))
    
    
    concat = L.Concatenate()([reshape, input_prob])


    encoder_last_h1, encoder_last_h2, encoder_last_c = L.LSTM(
        128, activation='elu', dropout=0.2, recurrent_dropout=0.2, 
        return_sequences=False, return_state=True)(concat)
    
    encoder_last_h1 = L.BatchNormalization(momentum=0.6)(encoder_last_h1)
    encoder_last_c = L.BatchNormalization(momentum=0.6)(encoder_last_c)

    decoder = L.RepeatVector(seq_length)(encoder_last_h1)
    decoder = L.LSTM(128, activation='elu', dropout=0.2, recurrent_dropout=0.2, return_state=False, return_sequences=True)(
        decoder, initial_state=[encoder_last_h1, encoder_last_c])
    
    out = L.TimeDistributed(L.Dense(5))(decoder)
    out = out[:,:scored_length]
    
    model = tf.keras.Model(inputs = [inputs,input_prob], outputs = out)

    model.compile(tf.optimizers.Adam(), loss = MCRMSE)
    
    return model

**Model 3: Seq2Seq with attention**

In [None]:
def build_seq2seq_attention(seq_length,scored_length):
    inputs = L.Input(shape = (seq_length,3))
    embed = L.Embedding(len(token2int),200,input_length = seq_length)(inputs)

    reshape = tf.reshape(embed,shape = (-1,embed.shape[1],embed.shape[2]*embed.shape[3]))
    
    input_prob = L.Input(shape = (seq_length,1))
    
    
    concat = L.Concatenate()([reshape, input_prob])
    
    
    encoder_stack_h, encoder_last_h, encoder_last_c = L.LSTM(
    512, activation='elu', dropout=0.2, recurrent_dropout=0.2, 
    return_state=True, return_sequences=True)(concat)
    
    
    encoder_last_h = L.BatchNormalization(momentum=0.6)(encoder_last_h)
    encoder_last_c = L.BatchNormalization(momentum=0.6)(encoder_last_c)

    
    decoder_input = L.RepeatVector(seq_length)(encoder_last_h)
    
    
    decoder_stack_h = L.LSTM(512, activation='elu', dropout=0.2, recurrent_dropout=0.2,
         return_state=False, return_sequences=True)(
         decoder_input, initial_state=[encoder_last_h, encoder_last_c])

    
    attention = L.dot([decoder_stack_h, encoder_stack_h], axes=[2, 2])
    attention = L.Activation('softmax')(attention)
    
    context = L.dot([attention, encoder_stack_h], axes=[2,1])
    context = L.BatchNormalization(momentum=0.6)(context)
    
    decoder_combined_context = L.concatenate([context, decoder_stack_h])
    
    out = L.TimeDistributed(L.Dense(5))(decoder_combined_context)
    
    out = out[:,:scored_length]
    
    model = tf.keras.Model(inputs = [inputs,input_prob], outputs = out)

    model.compile(tf.optimizers.Adam(), loss = MCRMSE)
    
    return model

**Model 4: 1D Convolutional Net**

In [None]:
def build_1Dconv(seq_length, scored_length,regularization = 1e-4):
    
    l1 = tf.keras.regularizers.l1(regularization)
    
    inputs = L.Input(shape = (seq_length,3))
    embed = L.Embedding(len(token2int),200,input_length = seq_length)(inputs)

    reshape = tf.reshape(embed,shape = (-1,embed.shape[1],embed.shape[2]*embed.shape[3]))
    
    input_prob = L.Input(shape = (seq_length,1))
    
    concat = L.Concatenate()([reshape, input_prob])
    
    conv1 = L.Conv1D(32,4,activation = 'relu',padding = 'same', activity_regularizer = l1)(concat)
    conv2 = L.Conv1D(64,8, activation = 'relu',padding = 'same',activity_regularizer = l1)(conv1)

    conv3 = L.Conv1D(128,16, activation = 'relu', padding = 'same',activity_regularizer = l1)(conv2)
    conv4 = L.Conv1D(256,32, activation = 'relu',padding = 'same',activity_regularizer = l1)(conv3)

    out = L.Dense(5)(conv4)
    
    out = out[:,:scored_length]
    
    model = tf.keras.Model(inputs = [inputs,input_prob], outputs = out)

    model.compile(tf.optimizers.Adam(), loss = MCRMSE)
    
    return model
    
    

In [None]:
seq_length = 107
inputs = L.Input(shape = (seq_length,3))
embed = L.Embedding(len(token2int),200,input_length = seq_length)(inputs)

reshape = tf.reshape(embed,shape = (-1,embed.shape[1],embed.shape[2]*embed.shape[3]))
    
input_prob = L.Input(shape = (seq_length,1))
    
concat = L.Concatenate()([reshape, input_prob])
    
conv1 = L.Conv1D(32,16,activation = 'relu',padding = 'same')(concat)
conv2 = L.Conv1D(64,8, activation = 'relu',padding = 'same')(conv1)

conv3 = L.Conv1D(128,4, activation = 'relu', padding = 'same')(conv2)

out = L.Dense(5)(conv3)

print(out)

**Preparing the training data**

In [None]:
x_train, x_val, y_train, y_val = train_test_split(train_inputs, train_labels, test_size = 0.1)

In [None]:
prob_train_actual, prob_train_val = train_test_split(prob_train, test_size = 0.1)

**Training model 1**

In [None]:
model = build_gru_model(107,68)

In [None]:
model.summary()

In [None]:
history = model.fit(
    [x_train,prob_train_actual], y_train,
    validation_data=([x_val,prob_train_val], y_val),
    batch_size=64,
    epochs=25,
    verbose=1,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only = True),
        PlotLossesKeras()
    ]
)

**Training Model 2**

In [None]:
model = build_seq2seq(107,68)
model.summary()

In [None]:
history = model.fit(
    [x_train,prob_train_actual], y_train,
    validation_data=([x_val,prob_train_val], y_val),
    batch_size=64,
    epochs=25,
    verbose=1,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only = True),
        PlotLossesKeras()
    ]
)

**Training model 3**

In [None]:
model = build_seq2seq_attention(107,68)
model.summary()

In [None]:
history = model.fit(
    [x_train,prob_train_actual], y_train,
    validation_data=([x_val,prob_train_val], y_val),
    batch_size=64,
    epochs=25,
    verbose=1,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only = True),
        PlotLossesKeras()
    ]
)

**Training Model 4**

In [None]:
model = build_1Dconv(107,68,1e-4)
model.summary()

In [None]:
history = model.fit(
    [x_train,prob_train_actual], y_train,
    validation_data=([x_val,prob_train_val], y_val),
    batch_size=256,
    epochs=100,
    verbose=1,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau(patience=5),
        tf.keras.callbacks.ModelCheckpoint('model.h5', save_best_only = True),
        PlotLossesKeras()
    ]
)

**Making the predictions for the test set**

In [None]:
test = pd.read_json('../input/stanford-covid-vaccine/test.json', lines=True)

In [None]:
public_test = test.loc[test['seq_length'] == 107]
private_test = test.loc[test['seq_length'] == 130]

In [None]:
prob_public_test = read_bpps(public_test, 107)
prob_private_test = read_bpps(private_test,130)

In [None]:
private_test_input = preprocess_inputs(private_test)
public_test_input = preprocess_inputs(public_test)

In [None]:
public_gru = build_1Dconv(107,107)
private_gru = build_1Dconv(130,130)

In [None]:
public_gru.load_weights('model.h5')
private_gru.load_weights('model.h5')

In [None]:
public_preds_gru = public_gru.predict([public_test_input,prob_public_test])
private_preds_gru = private_gru.predict([private_test_input,prob_private_test])

In [None]:
public_preds = public_preds_gru
private_preds = private_preds_gru

**Preparing the submission file**

In [None]:
preds_ls = []

for df, preds in [(public_test, public_preds), (private_test, private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=label_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_ls.append(single_df)

preds_df = pd.concat(preds_ls)
preds_df.head()

In [None]:
sample_df = pd.read_csv('../input/stanford-covid-vaccine/sample_submission.csv')

In [None]:
submission = sample_df[['id_seqpos']].merge(preds_df, on=['id_seqpos'])
submission.to_csv('submission12.csv', index=False)