# OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction

In this [Kaggle competition](https://www.kaggle.com/c/stanford-covid-vaccine/overview) we try to develop models and design rules for RNA degradation. As the overview of the competition states:

>mRNA vaccines have taken the lead as the fastest vaccine candidates for COVID-19, but currently, they face key potential limitations. One of the biggest challenges right now is how to design super stable messenger RNA molecules (mRNA). Conventional vaccines (like your seasonal flu shots) are packaged in disposable syringes and shipped under refrigeration around the world, but that is not currently possible for mRNA vaccines.
>
>Researchers have observed that RNA molecules have the tendency to spontaneously degrade. This is a serious limitation--a single cut can render the mRNA vaccine useless. Currently, little is known on the details of where in the backbone of a given RNA is most prone to being affected. Without this knowledge, current mRNA vaccines against COVID-19 must be prepared and shipped under intense refrigeration, and are unlikely to reach more than a tiny fraction of human beings on the planet unless they can be stabilized.

<img src="images/banner.png" width="1000" style="margin-left: auto; margin-right: auto;"> 

The model should predict likely degradation rates at each base of an RNA molecule. The training data set is comprised of over 3000 RNA molecules and their degradation rates at each position.

# Install necessary packages

We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. Since we only need two extra packages here we can use the former command.

> NOTE: Do not forget to use the `--user` argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline

In [1]:
# !pip install --user pandas requests

# Imports

In this section we import the packages we need for this example. Make it a habbit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale.

In [2]:
import json
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.preprocessing.text import Tokenizer

from kale.common.serveutils import serve

# Project hyper-parameters

In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib.

In [3]:
# Hyper-parameters
LR = 1e-3
EPOCHS = 10
BATCH_SIZE = 64
EMBED_DIM = 100
HIDDEN_DIM = 128
DROPOUT = .5
SP_DROPOUT = .3
TRAIN_SEQUENCE_LENGTH = 107

Set random seed for reproducibility and ignore warning messages.

In [4]:
tf.random.set_seed(42)
np.random.seed(42)

tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

# Load and preprocess data

In this section, we load and process the dataset to get it in a ready-to-use form by the model. First, let us load and analyze the data.

## Load data

The data are in `json` format, thus, we use the handy `read_json` pandas method. There is one train data set and two test sets (one public and one private).

In [5]:
train_df = pd.read_json("data/train.json", lines=True)
test_df = pd.read_json("data/test.json", lines=True)

We also load the `sample_submission.csv` file, which will prove handy when we will be creating our submission to the competition.

In [6]:
sample_submission_df = pd.read_csv("data/sample_submission.csv")

Let us now explore the data, their dimensions and what each column mean. To this end, we use the pandas `head` method to visualize a small sample (five rows by default) of our data set.

In [7]:
train_df.head()

Unnamed: 0,index,id,sequence,structure,predicted_loop_type,signal_to_noise,SN_filter,seq_length,seq_scored,reactivity_error,deg_error_Mg_pH10,deg_error_pH10,deg_error_Mg_50C,deg_error_50C,reactivity,deg_Mg_pH10,deg_pH10,deg_Mg_50C,deg_50C
0,0,id_001f94081,GGAAAAGCUCUAAUAACAGGAGACUAGGACUACGUAUUUCUAGGUA...,.....((((((.......)))).)).((.....((..((((((......,EEEEESSSSSSHHHHHHHSSSSBSSXSSIIIIISSIISSSSSSHHH...,6.894,1,107,68,"[0.1359, 0.20700000000000002, 0.1633, 0.1452, ...","[0.26130000000000003, 0.38420000000000004, 0.1...","[0.2631, 0.28600000000000003, 0.0964, 0.1574, ...","[0.1501, 0.275, 0.0947, 0.18660000000000002, 0...","[0.2167, 0.34750000000000003, 0.188, 0.2124, 0...","[0.3297, 1.5693000000000001, 1.1227, 0.8686, 0...","[0.7556, 2.983, 0.2526, 1.3789, 0.637600000000...","[2.3375, 3.5060000000000002, 0.3008, 1.0108, 0...","[0.35810000000000003, 2.9683, 0.2589, 1.4552, ...","[0.6382, 3.4773, 0.9988, 1.3228, 0.78770000000..."
1,1,id_0049f53ba,GGAAAAAGCGCGCGCGGUUAGCGCGCGCUUUUGCGCGCGCUGUACC...,.....(((((((((((((((((((((((....)))))))))).)))...,EEEEESSSSSSSSSSSSSSSSSSSSSSSHHHHSSSSSSSSSSBSSS...,0.193,0,107,68,"[2.8272, 2.8272, 2.8272, 4.7343, 2.5676, 2.567...","[73705.3985, 73705.3985, 73705.3985, 73705.398...","[10.1986, 9.2418, 5.0933, 5.0933, 5.0933, 5.09...","[16.6174, 13.868, 8.1968, 8.1968, 8.1968, 8.19...","[15.4857, 7.9596, 13.3957, 5.8777, 5.8777, 5.8...","[0.0, 0.0, 0.0, 2.2965, 0.0, 0.0, 0.0, 0.0, 0....","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[4.947, 4.4523, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[4.8511, 4.0426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[7.6692, 0.0, 10.9561, 0.0, 0.0, 0.0, 0.0, 0.0..."
2,2,id_006f36f57,GGAAAGUGCUCAGAUAAGCUAAGCUCGAAUAGCAAUCGAAUAGAAU...,.....((((.((.....((((.(((.....)))..((((......)...,EEEEESSSSISSIIIIISSSSMSSSHHHHHSSSMMSSSSHHHHHHS...,8.8,1,107,68,"[0.0931, 0.13290000000000002, 0.11280000000000...","[0.1365, 0.2237, 0.1812, 0.1333, 0.1148, 0.160...","[0.17020000000000002, 0.178, 0.111, 0.091, 0.0...","[0.1033, 0.1464, 0.1126, 0.09620000000000001, ...","[0.14980000000000002, 0.1761, 0.1517, 0.116700...","[0.44820000000000004, 1.4822, 1.1819, 0.743400...","[0.2504, 1.4021, 0.9804, 0.49670000000000003, ...","[2.243, 2.9361, 1.0553, 0.721, 0.6396000000000...","[0.5163, 1.6823000000000001, 1.0426, 0.7902, 0...","[0.9501000000000001, 1.7974999999999999, 1.499..."
3,3,id_0082d463b,GGAAAAGCGCGCGCGCGCGCGCGAAAAAGCGCGCGCGCGCGCGCGC...,......((((((((((((((((......))))))))))))))))((...,EEEEEESSSSSSSSSSSSSSSSHHHHHHSSSSSSSSSSSSSSSSSS...,0.104,0,107,68,"[3.5229, 6.0748, 3.0374, 3.0374, 3.0374, 3.037...","[73705.3985, 73705.3985, 73705.3985, 73705.398...","[11.8007, 12.7566, 5.7733, 5.7733, 5.7733, 5.7...","[121286.7181, 121286.7182, 121286.7181, 121286...","[15.3995, 8.1124, 7.7824, 7.7824, 7.7824, 7.78...","[0.0, 2.2399, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0....","[0.0, -0.5083, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...","[3.4248, 6.8128, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[0.0, -0.8365, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0...","[7.6692, -1.3223, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0..."
4,4,id_0087940f4,GGAAAAUAUAUAAUAUAUUAUAUAAAUAUAUUAUAGAAGUAUAAUA...,.....(((((((.((((((((((((.(((((((((....)))))))...,EEEEESSSSSSSBSSSSSSSSSSSSBSSSSSSSSSHHHHSSSSSSS...,0.423,0,107,68,"[1.665, 2.1728, 2.0041, 1.2405, 0.620200000000...","[4.2139, 3.9637000000000002, 3.2467, 2.4716, 1...","[3.0942, 3.015, 2.1212, 2.0552, 0.881500000000...","[2.6717, 2.4818, 1.9919, 2.5484999999999998, 1...","[1.3285, 3.6173, 1.3057, 1.3021, 1.1507, 1.150...","[0.8267, 2.6577, 2.8481, 0.40090000000000003, ...","[2.1058, 3.138, 2.5437000000000003, 1.0932, 0....","[4.7366, 4.6243, 1.2068, 1.1538, 0.0, 0.0, 0.7...","[2.2052, 1.7947000000000002, 0.7457, 3.1233, 0...","[0.0, 5.1198, -0.3551, -0.3518, 0.0, 0.0, 0.0,..."


We see a lot of strange entries, so, let us try to see what they are:

* `sequence`: An 107 characters long string in Train and Public Test (130 in Private Test), which describes the RNA sequence, a combination of A, G, U, and C for each sample.
* `structure`: An 107 characters long string in Train and Public Test (130 in Private Test), which is a compination of `(`, `)`, and `.` characters that describe whether a base is estimated to be paired or unpaired. Paired bases are denoted by opening and closing parentheses (e.g. (....) means that base 0 is paired to base 5, and bases 1-4 are unpaired).
* `predicted_loop_type`: An 107 characters long string, which describes the structural context (also referred to as 'loop type') of each character in sequence. Loop types assigned by bpRNA from Vienna RNAfold 2 structure. From the bpRNA_documentation: `S`: paired "Stem" `M`: Multiloop `I`: Internal loop `B`: Bulge `H`: Hairpin loop `E`: dangling End `X`: eXternal loop.

Then, we have `signal_to_noise`, which is quality control feature. It records the measurements relative to their errors; the higher value the more confident measurements are.

The `*_error_*` columns calculate the errors in experimental values obtained in corresponding `reactivity` and `deg_*` columns.

The last five columns (i.e., `recreativity` and `deg_*`) are out depended variables, our targets. Thus, for every base in the molecule we should predict five different values.

The `bpps` folder stands for Base Pairing Probabilities. In contains numpy arrays that correspond to the Base Pairing Probability Matrix (BPPM). The `bpps` are pre-calculated for each sequence. They are matrices of base pair probabilities. Biophysically speaking, this matrix gives the probability that each pair of nucleotides, in the RNA, forms a base pair (given a particular model of RNA folding). Thus, let us provide some helper functions to load these arrays.

In [8]:
def read_bpps_sum(df):
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps_arr.append(np.load(f"data/bpps/{mol_id}.npy").max(axis=1))
    return bpps_arr


def read_bpps_max(df):
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps_arr.append(np.load(f"data/bpps/{mol_id}.npy").sum(axis=1))
    return bpps_arr


def read_bpps_nb(df):
    # normalized non-zero number
    # from https://www.kaggle.com/symyksr/openvaccine-deepergcn 
    bpps_nb_mean = 0.077522 # mean of bpps_nb across all training data
    bpps_nb_std = 0.08914   # std of bpps_nb across all training data
    bpps_arr = []
    for mol_id in df.id.to_list():
        bpps = np.load(f"data/bpps/{mol_id}.npy")
        bpps_nb = (bpps > 0).sum(axis=0) / bpps.shape[0]
        bpps_nb = (bpps_nb - bpps_nb_mean) / bpps_nb_std
        bpps_arr.append(bpps_nb)
    return bpps_arr

These are the main columns we care about. For more details, visit the competition [info](https://www.kaggle.com/c/stanford-covid-vaccine/data).

## Preprocess data

We are now ready to preprocess the data set. First, we define the symbols that encode certain features (e.g. the base symbol or the structure), the features and the target variables.

In [9]:
symbols = "().ACGUBEHIMSX"
feat_cols = ["sequence", "structure", "predicted_loop_type", "bpps_sum", "bpps_max", "bpps_nb"]
target_cols = ["reactivity", "deg_Mg_pH10", "deg_Mg_50C", "deg_pH10", "deg_50C"]
error_cols = ["reactivity_error", "deg_error_Mg_pH10", "deg_error_Mg_50C", "deg_error_pH10", "deg_error_50C"]

In order to encode values like strings or characters and feed them to the neural network, we need to tokenize them. The `Tokenizer` class will assign a number to each character.

In [10]:
tokenizer = Tokenizer(char_level=True, filters="")
tokenizer.fit_on_texts(symbols)

Moreover, the tokenizer keeps a dictionary, `word_index`, from which we can get the number of elements in our vocabulary. In this case, we only have a few elements, but if we have passed a whole book, that function would be handy.

> NOTE: We should add `1` to the length of the `word_index` dictionary to get the correct number of elements.

In [11]:
# get the number of elements in the vocabulary
vocab_size = len(tokenizer.word_index) + 1

We are now ready to process our features. First, we transform each character sequence (i.e., `sequence`, `structure`, `predicted_loop_type`) into number sequences and concatenate them together. Then, we add the previously extracted `bpps_*` features on top. The resulting shape should be `(num_examples, 107, 6)`.

> Now, we should do this in a way that would permit us to use this processing function with KFServing. Thus, since Numpy arrays are not JSON serializable, this function should accept and return pure Python lists.

In [12]:
def process_features(example):
    sequence_sentences = example[0]
    structure_sentences = example[1]
    loop_sentences = example[2]
    bpps_sum = example[3]
    bpps_max = example[4]
    bpps_nb = example[5]
    
    # transform character sequences into number sequences
    sequence_tokens = np.array(
        tokenizer.texts_to_sequences(sequence_sentences)
    )
    structure_tokens = np.array(
        tokenizer.texts_to_sequences(structure_sentences)
    )
    loop_tokens = np.array(
        tokenizer.texts_to_sequences(loop_sentences)
    )
    
    # concatenate the tokenized sequences
    sequences = np.stack(
        (sequence_tokens, structure_tokens, loop_tokens),
        axis=1
    )
    sequences = np.transpose(sequences, (2, 0, 1))
    
    # add the `bpps` features on top
    bpps_sum = np.array(bpps_sum)[np.newaxis, :, np.newaxis]
    bpps_max = np.array(bpps_max)[np.newaxis, :, np.newaxis]
    bpps_nb = np.array(bpps_nb)[np.newaxis, :, np.newaxis]
    
    prepared = np.concatenate([sequences, bpps_sum, bpps_max, bpps_nb], 2)
    prepared = prepared.tolist()
    
    return prepared[0]

In the same way we process the labels. We should just extract them and transform them into the correct shape. The resulting shape should be `(num_examples, 68, 5)`.

In [13]:
def process_labels(df):
    df = df.copy()
    
    labels = np.array(df[target_cols].values.tolist())
    labels = np.transpose(labels, (0, 2, 1))
    
    return labels

Before running our dataset from the preprocess function, we need to add the extra info to the dataframe. Also, we need to separate the public and private test dataframes.

In [14]:
train_df['bpps_sum'] = read_bpps_sum(train_df)
train_df['bpps_max'] = read_bpps_max(train_df)
train_df['bpps_nb'] = read_bpps_nb(train_df)

test_df['bpps_sum'] = read_bpps_sum(test_df)
test_df['bpps_max'] = read_bpps_max(test_df)
test_df['bpps_nb'] = read_bpps_nb(test_df)

Turn numpy arrays to Python lists.

In [15]:
train_df['bpps_sum'] = train_df['bpps_sum'].map(lambda x: list(x))
train_df['bpps_max'] = train_df['bpps_max'].map(lambda x: list(x))
train_df['bpps_nb'] = train_df['bpps_nb'].map(lambda x: list(x))

test_df['bpps_sum'] = test_df['bpps_sum'].map(lambda x: list(x))
test_df['bpps_max'] = test_df['bpps_max'].map(lambda x: list(x))
test_df['bpps_nb'] = test_df['bpps_nb'].map(lambda x: list(x))

In [16]:
public_test_df = test_df.query("seq_length == 107")
private_test_df = test_df.query("seq_length == 130")

We are now ready to process the data set and make the features ready to be consumed by the model.

In [17]:
x_train = [process_features(row.tolist()) for _, row in train_df[feat_cols].iterrows()]
y_train = process_labels(train_df)

x_public_test = [process_features(row.tolist()) for _, row in public_test_df[feat_cols].iterrows()]
x_private_test = [process_features(row.tolist()) for _, row in private_test_df[feat_cols].iterrows()]

# Define and train the model

We are now ready to define our model. We have to do with sequences, thus, it makes sense to use RNNs. More specifically, we will use bidirectional Gated Recurrent Units (GRUs) and Long Short Term Memory cells (LSTM). The output layer shoud produce 5 numbers, so we can see this as a regression problem.

First let us define two helper functions for GRUs and LSTMs and then, define the body of the full model.

In [18]:
def gru_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
         tf.keras.layers.GRU(hidden_dim, dropout=dropout, return_sequences=True, kernel_initializer = 'orthogonal')
    )

In [19]:
def lstm_layer(hidden_dim, dropout):
    return tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(hidden_dim, dropout=dropout, return_sequences=True, kernel_initializer = 'orthogonal')
    )

The model has an embedding layer. The embedding layer projects the tokenized categorical input into a high-dimensional latent space. For this example we treat the dimensionality of the embedding space as a hyper-parameter that we can use to fine-tune the model.

In [20]:
def build_model(vocab_size, seq_length=TRAIN_SEQUENCE_LENGTH, pred_len=68,
                embed_dim=EMBED_DIM,
                hidden_dim=HIDDEN_DIM, dropout=DROPOUT, sp_dropout=SP_DROPOUT):
    inputs = tf.keras.layers.Input(shape=(seq_length, 6))
    
    # split categorical and numerical features and concatenate them later.
    cat_features = inputs[:, :, :3]
    num_features = inputs[:, :, 3:]

    # embed the categorical inputs
    embed = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)(cat_features)
    
    reshaped = tf.reshape(
        embed, shape=(-1, embed.shape[1],  embed.shape[2] * embed.shape[3])
    )
      
    # concatenate the numberical and embedded categorical features
    concatenated = tf.keras.layers.concatenate([reshaped, num_features], axis=2)
    
    hidden = tf.keras.layers.SpatialDropout1D(sp_dropout)(concatenated)
    
    hidden = gru_layer(hidden_dim, dropout)(hidden)
    hidden = lstm_layer(hidden_dim, dropout)(hidden)
    
    truncated = hidden[:, :pred_len]
    
    out = tf.keras.layers.Dense(5, activation="linear")(truncated)
    
    model = tf.keras.Model(inputs=inputs, outputs=out)
    
    return model

In [21]:
model = build_model(vocab_size)

In [22]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 107, 6)]     0                                            
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens [(None, 107, 3)]     0           input_1[0][0]                    
__________________________________________________________________________________________________
embedding (Embedding)           (None, 107, 3, 100)  1500        tf_op_layer_strided_slice[0][0]  
__________________________________________________________________________________________________
tf_op_layer_Reshape (TensorFlow [(None, 107, 300)]   0           embedding[0][0]                  
______________________________________________________________________________________________

Submissions are scored using MCRMSE (mean columnwise root mean squared error):

<img src="images/mcrmse.png" width="250" style="margin-left: auto; margin-right: auto;">

Thus, we should code this metric and use it as our objective (loss) function.

In [23]:
class MeanColumnwiseRMSE(tf.keras.losses.Loss):
    def __init__(self, name='MeanColumnwiseRMSE'):
        super().__init__(name=name)

    def call(self, y_true, y_pred):
        colwise_mse = tf.reduce_mean(tf.square(y_true - y_pred), axis=1)
        return tf.reduce_mean(tf.sqrt(colwise_mse), axis=1)

We are now ready to compile and fit the model.

In [24]:
model.compile(tf.optimizers.Adam(learning_rate=LR), loss=MeanColumnwiseRMSE())

In [25]:
history = model.fit(np.array(x_train), np.array(y_train), 
                    validation_split=.1, batch_size=BATCH_SIZE, epochs=EPOCHS)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
model.save('saved_model_v0.0.0')

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
[INFO]:tensorflow:Assets written to: saved_model_v0.0.0/assets


## Evaluate the model

Finally, we are ready to evaluate the model using the two test sets.

In [27]:
model_public = build_model(vocab_size, seq_length=107, pred_len=107)
model_private = build_model(vocab_size, seq_length=130, pred_len=130)

model_public.set_weights(model.get_weights())
model_private.set_weights(model.get_weights())

In [28]:
public_preds = model_public.predict(np.array(x_public_test))
private_preds = model_private.predict(np.array(x_private_test))

## Serve the model

At this point we can use Kale to create an InferenceService and `serve` the model using KFServing and Kale's helper function. Then we can directly query the model using the `predict` method.

In [29]:
kfserving = serve(model)

2020-10-12 11:08:02 Kale serveutils:158       [INFO]     Starting serve procedure for model '<tensorflow.python.keras.engine.training.Model object at 0x7fdd9c0e6ac8>'
2020-10-12 11:08:02 Kale podutils:83          [INFO]     Getting the current container name...
2020-10-12 11:08:02 Kale podutils:89          [INFO]     Using NB_PREFIX env var '/notebook/kubeflow-user/kubecon-vaccine'. Container name: 'kubecon-vaccine'
2020-10-12 11:08:02 Kale serveutils:173       [INFO]     Model is contained in volume 'workspace-kubecon-vaccine-vphhrseg1'
2020-10-12 11:08:02 Kale serveutils:183       [INFO]     Dumping the model to '/home/jovyan/kubecon-vaccine-0-cw66u-kale-serve.model' ...
2020-10-12 11:08:02 Kale marshalling          [INFO]     Saving TF Keras model: kubecon-vaccine-0-cw66u-kale-serve.model
[INFO]:tensorflow:Assets written to: /home/jovyan/kubecon-vaccine-0-cw66u-kale-serve.model.tfkeras/1/assets
2020-10-12 11:08:31 Kale serveutils:190       [INFO]     Model saved successfully
2020-10

  infs_spec = yaml.load(RAW_TEMPLATE.format(name=name))
  yaml_predictor_spec = yaml.load(predictor_spec)


2020-10-12 11:08:45 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:08:48 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:08:51 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:08:54 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:08:57 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:09:00 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:09:03 Kale serveutils:337       [INFO]     Waiting for InferenceService 'kubecon-vaccine-0-cw66u' to become ready...
2020-10-12 11:09:06 Kale serveutils:348       [INFO]     InferenceService kubecon-v

In [30]:
data = json.dumps({"signature_def": "serving_default", "instances": x_public_test})
predictions = kfserving.predict(data)

2020-10-12 11:09:07 Kale serveutils:125       [INFO]     Sending a request to the InferenceService...
2020-10-12 11:09:07 Kale serveutils:126       [INFO]     Getting InferenceService predictor's host...
2020-10-12 11:09:16 Kale serveutils:134       [INFO]     Request submitted successfully!
2020-10-12 11:09:17 Kale serveutils:137       [INFO]     Response: {
    "predictions": [[[0.660725594, 0.642560899, 0.587522388, 2.13440228,  .....  [0.322829247, 0.573023736, 0.444859535, 0.581634283, 0.502082646]]
    ]
}


When we're done, we can delete the inference service using the handy `delete` method.

In [31]:
kfserving.delete()

2020-10-12 11:09:17 Kale serveutils:110       [INFO]     Deleting InferenceServer named 'kubecon-vaccine-0-cw66u'...
2020-10-12 11:09:17 Kale serveutils:116       [INFO]     Successfully deleted InferenceService.


# Submission

Last but note least, we create our submission to the Kaggle competition. The submission is just a `csv` file with the specified columns.

In [32]:
preds_ls = []

for df, preds in [(public_test_df, public_preds), (private_test_df, private_preds)]:
    for i, uid in enumerate(df.id):
        single_pred = preds[i]

        single_df = pd.DataFrame(single_pred, columns=target_cols)
        single_df['id_seqpos'] = [f'{uid}_{x}' for x in range(single_df.shape[0])]

        preds_ls.append(single_df)

preds_df = pd.concat(preds_ls)
preds_df.head()

Unnamed: 0,reactivity,deg_Mg_pH10,deg_Mg_50C,deg_pH10,deg_50C,id_seqpos
0,0.660726,0.642561,0.587523,2.134402,0.761572,id_00073f8be_0
1,1.959678,2.845435,2.877453,3.755889,2.529836,id_00073f8be_1
2,1.20302,0.551606,0.759268,0.712953,0.890268,id_00073f8be_2
3,1.295736,1.12896,1.565803,1.145989,1.378613,id_00073f8be_3
4,0.840669,0.649475,0.849459,0.709274,0.868614,id_00073f8be_4


In [33]:
submission = sample_submission_df[['id_seqpos']].merge(preds_df, on=['id_seqpos'])
submission.to_csv('submission.csv', index=False)