# CITEseq Keras Quickstart

This notebook shows how to tune and cross-validate a Keras model for the CITEseq part of the *Multimodal Single-Cell Integration* competition.

It does not show the EDA - see the separate notebook [MSCI EDA which makes sense ⭐️⭐️⭐️⭐️⭐️](https://www.kaggle.com/ambrosm/msci-eda-which-makes-sense).

The CITEseq predictions of the Keras model are then concatenated with the Multiome predictions of @jsmithperera's [Multiome Quickstart w/ Sparse M + tSVD = 32](https://www.kaggle.com/code/jsmithperera/multiome-quickstart-w-sparse-m-tsvd-32) to a complete submission file.

## Summary

The CITEseq part of the competition has sizeable datasets, when compared to the standard 16 GByte RAM of Kaggle notebooks:
- The training input has shape 70988/*22050 (10.6 GByte).
- The training labels have shape 70988/*140.
- The test input has shape 48663/*22050 (4.3 GByte).

Our solution strategy has five elements:
1. **Dimensionality reduction:** To reduce the size of the 10.6 GByte input data, we project the 22050 features to a space with only 64 dimensions by applying a truncated SVD. To these 64 dimensions, we add 144 features whose names shows their importance.
2. **The model:** The model is a sequential dense network with four hidden layers.
3. **The loss function:** The competition is scored by the average Pearson correlation coefficient between the predictions and the ground truth. As this scoring function is differentiable, we can directly use it as loss function for a neural network. This gives neural networks an advantage in comparison to algorithms which use mean squared error as a surrogate loss. 
3. **Hyperparameter tuning with KerasTuner:** We tune the hyperparameters with [KerasTuner](https://keras.io/keras_tuner/). 
4. **Cross-validation:** Submitting unvalidated models and relying only on the public leaderboard is bad practice. The model in this notebook is fully cross-validated with a 3-fold GroupKFold.


In [1]:
# !pip install keras-tuner

In [2]:
%cd C:/Users/Owner/Documents/dev/open-problem/multiome-mlp

C:\Users\Owner\Documents\dev\open-problem\multiome-mlp


In [3]:
import os, gc, pickle, datetime, scipy.sparse
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from colorama import Fore, Back, Style

from sklearn.model_selection import GroupKFold, train_test_split
from sklearn.preprocessing import StandardScaler, scale, MinMaxScaler
from sklearn.decomposition import TruncatedSVD

import tensorflow as tf
import tensorflow.keras.backend as K
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.callbacks import ReduceLROnPlateau, LearningRateScheduler, EarlyStopping
from tensorflow.keras.layers import Dense, Input, Concatenate, Dropout
from tensorflow.keras.utils import plot_model
import keras_tuner

DATA_DIR = "C:/Users/Owner/Documents/dev/open-problem/open-problems-multimodal/"
FP_CELL_METADATA = os.path.join(DATA_DIR,"metadata.csv")

FP_CITE_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_cite_inputs.h5")
FP_CITE_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_cite_targets.h5")
FP_CITE_TEST_INPUTS = os.path.join(DATA_DIR,"test_cite_inputs.h5")

FP_MULTIOME_TRAIN_INPUTS = os.path.join(DATA_DIR,"train_multi_inputs.h5")
FP_MULTIOME_TRAIN_TARGETS = os.path.join(DATA_DIR,"train_multi_targets.h5")
FP_MULTIOME_TEST_INPUTS = os.path.join(DATA_DIR,"test_multi_inputs.h5")

FP_SUBMISSION = os.path.join(DATA_DIR,"sample_submission.csv")
FP_EVALUATION_IDS = os.path.join(DATA_DIR,"evaluation_ids.csv")

TUNE = False
SUBMIT = True

USE_SAVED_PCA = True

submission_name = "submission_multi_mlp_svd512-256_wdo.csv"

A little trick to save time with pip: If the module is already installed (after a restart of the notebook, for instance), pip wastes 10 seconds by checking whether a newer version exists. We can skip this check by testing for the presence of the module in a simple if statement.

In [4]:
%%time
# If you see a warning "Failed to establish a new connection" running this cell,
# go to "Settings" on the right hand side, 
# and turn on internet. Note, you need to be phone verified.
# We need this library to read HDF files.
# if not os.path.exists('/opt/conda/lib/python3.7/site-packages/tables'):
#     !pip install --quiet tables
    

CPU times: total: 0 ns
Wall time: 0 ns


# The scoring function

This competition has a special metric: For every row, it computes the Pearson correlation between y_true and y_pred, and then all these correlation coefficients are averaged. We implement two variants of the metric: The first one is for numpy arrays, the second one for tensors - thanks to @lucasmorin for the [original tensor implementation](https://www.kaggle.com/competitions/open-problems-multimodal/discussion/347595).

In [5]:
def correlation_score(y_true, y_pred):
    """Scores the predictions according to the competition rules. 
    
    It is assumed that the predictions are not constant.
    
    Returns the average of each sample's Pearson correlation coefficient"""
    if type(y_true) == pd.DataFrame: y_true = y_true.values
    if type(y_pred) == pd.DataFrame: y_pred = y_pred.values
    corrsum = 0
    for i in range(len(y_true)):
        corrsum += np.corrcoef(y_true[i], y_pred[i])[1, 0]
    return corrsum / len(y_true)

def negative_correlation_loss(y_true, y_pred):
    """Negative correlation loss function for Keras
    
    Precondition:
    y_true.mean(axis=1) == 0
    y_true.std(axis=1) == 1
    
    Returns:
    -1 = perfect positive correlation
    1 = totally negative correlation
    """
    my = K.mean(tf.convert_to_tensor(y_pred), axis=1)
    my = tf.tile(tf.expand_dims(my, axis=1), (1, y_true.shape[1]))
    ym = y_pred - my
    r_num = K.sum(tf.multiply(y_true, ym), axis=1)
    r_den = tf.sqrt(K.sum(K.square(ym), axis=1) * float(y_true.shape[-1]))
    r = tf.reduce_mean(r_num / r_den)
    return - r


# Data loading and preprocessing

The metadata is used only for the `GroupKFold`: 

In [6]:
metadata_df = pd.read_csv(FP_CELL_METADATA, index_col='cell_id')
metadata_df = metadata_df[metadata_df.technology=="multiome"]
metadata_df.shape



(161877, 4)

In [7]:
metadata_df.head()
meta = metadata_df[:105942]
meta_test = metadata_df[105942:]

We now define two sets of features:
- `constant_cols` is the set of all features which are constant in the train or test datset. These columns will be discarded immediately after loading.
- `important_cols` is the set of all features whose name matches the name of a target protein. If a gene is named 'ENSG00000114013_CD86', it should be related to a protein named 'CD86'. These features will be used for the model unchanged, that is, they don't undergo dimensionality reduction. 

We read train and test datasets, keep the important columns and convert the rest to sparse matrices.

In [8]:
%%time

# Read train and convert to sparse matrix
X = scipy.sparse.load_npz("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/train_multi_inputs_values.sparse.npz").astype('float16', copy=False)


# Read test and convert to sparse matrix
Xt = scipy.sparse.load_npz("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/test_multi_inputs_values.sparse.npz").astype('float16', copy=False)

CPU times: total: 31.9 s
Wall time: 31.9 s


We apply the truncated SVD to train and test together. The truncated SVD is memory-efficient. We concatenate the SVD output (64 components) with the 144 important features and get the arrays `X` and `Xt`, which will be the input to the Keras model. 

In [9]:
X.shape

(105942, 228942)

In [10]:
Xt.shape

(55935, 228942)

In [11]:
%%time

# Apply the singular value decomposition
both = scipy.sparse.vstack([X, Xt])
print(f"Shape of both before SVD: {both.shape}")


Shape of both before SVD: (161877, 228942)
CPU times: total: 688 ms
Wall time: 681 ms


In [12]:
def save(name, model):
    with open(name, 'wb') as f:
        pickle.dump(model, f)

In [13]:
if USE_SAVED_PCA:
    svd = pickle.load(open('pca.pkl', 'rb'))
    both = svd.transform(both)
else:
    svd = TruncatedSVD(n_components=512, random_state=1) # 512 is possible
    both = svd.fit_transform(both)
    print(f"Shape of both after SVD:  {both.shape}")
    save('pca.pkl', svd)
    
# Hstack the svd output with the important features
X = both[:105942]
Xt = both[105942:]
del both
# X = np.hstack([X, X0])
# Xt = np.hstack([Xt, X0t])
print(f"Reduced X shape:  {str(X.shape):14} {X.size*4/1024/1024/1024:2.3f} GByte")
print(f"Reduced Xt shape: {str(Xt.shape):14} {Xt.size*4/1024/1024/1024:2.3f} GByte")
gc.collect()

Reduced X shape:  (105942, 512)  0.202 GByte
Reduced Xt shape: (55935, 512)   0.107 GByte


148

Finally, we read the target array `Y`:

In [14]:
# # Read Y
# Y = pd.read_hdf(FP_MULTIOME_TEST_INPUTS)
# y_columns = list(Y.columns)
# Y = Y.values

# # Normalize the targets row-wise: This doesn't change the correlations,
# # and negative_correlation_loss depends on it
# Y -= Y.mean(axis=1).reshape(-1, 1)
# Y /= Y.std(axis=1).reshape(-1, 1)
    
# print(f"Y shape: {str(Y.shape):14} {Y.size*4/1024/1024/1024:2.3f} GByte")
train_targets = scipy.sparse.load_npz("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/train_multi_targets_values.sparse.npz")

if USE_SAVED_PCA:
    pca2 = pickle.load(open('pca2.pkl', 'rb'))
    train_target = pca2.transform(train_targets)
else:
    pca2 = TruncatedSVD(n_components=256, random_state=42)
    train_target = pca2.fit_transform(train_targets)
    print(pca2.explained_variance_ratio_.sum())
    save('pca2.pkl', pca2)

# The model

Our model is a sequential network consisting of a few dense layers. The hyperparameters will be tuned with KerasTuner.

We use the `negative_correlation_loss` defined above as loss function.

In [15]:
LR_START = 0.01
BATCH_SIZE = 256

def my_model(hp, n_inputs=X.shape[1]):
    """Sequential neural network
    
    Returns a compiled instance of tensorflow.keras.models.Model.
    """
    activation = 'swish'
    reg1 = hp.Float("reg1", min_value=1e-8, max_value=1e-4, sampling="log")
    reg2 = hp.Float("reg2", min_value=1e-10, max_value=1e-5, sampling="log")
    
    inputs = Input(shape=(n_inputs, ))
    x0 = Dense(hp.Int('units1', min_value=128, max_value=384, step=64), kernel_regularizer=tf.keras.regularizers.l2(reg1),
              activation=activation,
             )(inputs)
    do1 = Dropout(hp.Choice('do1', [0.1]))(x0)
    x1 = Dense(hp.Int('units2', min_value=128, max_value=384, step=64), kernel_regularizer=tf.keras.regularizers.l2(reg1),
              activation=activation,
             )(do1)
    do2 = Dropout(hp.Choice('do2', [0.1]))(x1)
    x2 = Dense(hp.Int('units3', min_value=128, max_value=384, step=64), kernel_regularizer=tf.keras.regularizers.l2(reg1),
              activation=activation,
             )(do2)
    do3 = Dropout(hp.Choice('do3', [0.1]))(x2)
    x3 = Dense(hp.Int('units4', min_value=64, max_value=256, step=64), kernel_regularizer=tf.keras.regularizers.l2(reg1),
              activation=activation,
             )(do3)
    do4 = Dropout(hp.Choice('do4', [0.1]))(x3)
    x = Concatenate()([x0, x1, x2, x3])
    do5 = Dropout(hp.Choice('do5', [0.2]))(x)
    x = Dense(train_target.shape[1], kernel_regularizer=tf.keras.regularizers.l2(reg2),
              #activation=activation,
             )(do5)
    regressor = Model(inputs, x)
    regressor.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp.Float('lr', min_value=0.001, max_value=0.01, step=0.001)),
                      metrics=[tf.keras.metrics.RootMeanSquaredError()],
                      loss=tf.keras.losses.MeanSquaredError(),
                     )
   
    return regressor

display(plot_model(my_model(keras_tuner.HyperParameters()), show_layer_names=False, show_shapes=True, dpi=72))

You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) for plot_model to work.


None

# Tuning with KerasTuner

Now we let [KerasTuner](https://keras.io/keras_tuner/) optimize the hyperparameters. The tunable hyperparameters are:
- the sizes of the hidden layers
- the regularization factors

If you want to save time, you can either set `max_trials` to a lower value or skip tuning completely and set `best_hp.values` manually. If you don't want to see all the output of the tuner, you can set `verbose` to 0 in the call to `tuner.search()`.

In [16]:
%%time
if TUNE:
    tuner = keras_tuner.BayesianOptimization(
        my_model,
        overwrite=True,
        objective=keras_tuner.Objective("val_root_mean_squared_error", direction="min"),
        max_trials=100,
        directory='temp',
        seed=1)
    lr = ReduceLROnPlateau(monitor="val_loss", factor=0.5, 
                           patience=4, verbose=0)
    es = EarlyStopping(monitor="val_loss",
                       patience=8, 
                       verbose=0,
                       mode="min", 
                       restore_best_weights=True)
    callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN()]
    X_tr, X_va, y_tr, y_va = train_test_split(X, train_target, test_size=0.2, random_state=10)
    tuner.search(X_tr, y_tr,
                 epochs=100,
                 validation_data=(X_va, y_va),
                 batch_size=BATCH_SIZE,
                 callbacks=callbacks, verbose=2)
    del X_tr, X_va, y_tr, y_va, lr, es, callbacks


CPU times: total: 0 ns
Wall time: 0 ns


In [17]:
if TUNE:
    tuner.results_summary()
    
    # Table of the 10 best trials
    display(pd.DataFrame([hp.values for hp in tuner.get_best_hyperparameters(10)]))
    
    # Keep the best hyperparameters
    best_hp = tuner.get_best_hyperparameters(1)[0]

In [18]:
# Hyperparameters can be set manually
if not TUNE:
    best_hp = keras_tuner.HyperParameters()
    best_hp.values = {'reg1': 0.0001,
                      'reg2': 1.000000e-5,
                      'units1': 128,
                      'units2': 384,
                      'units3': 128,
                      'units4': 256,
                      'do1': 0.1,
                      'do2': 0.1,
                      'do3': 0.1,
                      'do4': 0.1,
                      'do5': 0.1,
                      'lr': 0.001,
                     }
    
#     Output exceeds the size limit. Open the full output data in a text editor
# Trial 47 Complete [00h 01m 45s]
# val_root_mean_squared_error: 3.435828447341919

# Best val_root_mean_squared_error So Far: 3.4250566959381104
# Total elapsed time: 01h 21m 54s

# Search: Running Trial #48

# Value             |Best Value So Far |Hyperparameter
# 0.0001            |0.0001            |reg1
# 1e-05             |1e-05             |reg2
# 128               |128               |units1
# 0.1               |0.1               |do1
# 384               |384               |units2
# 0.1               |0.1               |do2
# 128               |128               |units3
# 0.1               |0.1               |do3
# 256               |256               |units4
# 0.1               |0.1               |do4
# 0.1               |0.1               |do5
# 0.001             |0.001             |lr


# Cross-validation

For cross-validation of the tuned model, we create three folds. In every fold, we train on the data of two donors and predict the third one. This scheme mimics the situation of the public leaderboard, where we train on three donors and predict the fourth one (see [EDA](https://www.kaggle.com/ambrosm/msci-eda-which-makes-sense)). 

The models are saved so that we can use them to compute the test predictions later.

In [19]:
# Cross-validation
VERBOSE = 2 # set to 2 for more output, set to 0 for less output
EPOCHS = 1000
N_SPLITS = 3

np.random.seed(1)
tf.random.set_seed(1)

kf = GroupKFold(n_splits=N_SPLITS)
score_list = []
for fold, (idx_tr, idx_va) in enumerate(kf.split(X, groups=meta.donor)):
    start_time = datetime.datetime.now()
    model = None
    gc.collect()
    X_tr = X[idx_tr]
    y_tr = train_target[idx_tr]
    X_va = X[idx_va]
    y_va = train_target[idx_va]
    y_va_raw = train_targets[idx_va]

    lr = ReduceLROnPlateau(monitor="val_loss", factor=0.5, 
                           patience=4, verbose=VERBOSE)
    es = EarlyStopping(monitor="val_loss",
                       patience=12, 
                       verbose=0,
                       mode="min", 
                       restore_best_weights=True)
    ckpt = tf.keras.callbacks.ModelCheckpoint(f"model/model_{fold}_ckpt", save_best_only=True)
    callbacks = [lr, es, tf.keras.callbacks.TerminateOnNaN(), ckpt]

    # Construct and compile the model
    model = my_model(best_hp, X_tr.shape[1])

    # Train the model
    history = model.fit(X_tr, y_tr, 
                        validation_data=(X_va, y_va), 
                        epochs=EPOCHS,
                        verbose=VERBOSE,
                        batch_size=BATCH_SIZE,
                        shuffle=True,
                        callbacks=callbacks)
    del X_tr, y_tr
    if SUBMIT:
        model.save(f"model/model_{fold}")
    history = history.history
    callbacks, lr = None, None
    
    # We validate the model
    y_va_pred = model.predict(X_va, batch_size=len(X_va))
    corrscore = correlation_score(y_va_raw.todense(), y_va_pred@pca2.components_)

    print(f"Fold {fold}: {es.stopped_epoch:3} epochs, corr =  {corrscore:.5f}")
    del es, X_va#, y_va, y_va_pred
    score_list.append(corrscore)

# Show overall score
print(f"{Fore.GREEN}{Style.BRIGHT}Average  corr = {np.array(score_list).mean():.5f}{Style.RESET_ALL}")


Epoch 1/1000
INFO:tensorflow:Assets written to: model\model_0_ckpt\assets
254/254 - 4s - loss: 39.6057 - root_mean_squared_error: 6.2866 - val_loss: 17.3611 - val_root_mean_squared_error: 4.1560 - lr: 0.0010 - 4s/epoch - 15ms/step
Epoch 2/1000
INFO:tensorflow:Assets written to: model\model_0_ckpt\assets
254/254 - 2s - loss: 17.1003 - root_mean_squared_error: 4.1244 - val_loss: 15.3912 - val_root_mean_squared_error: 3.9116 - lr: 0.0010 - 2s/epoch - 9ms/step
Epoch 3/1000
INFO:tensorflow:Assets written to: model\model_0_ckpt\assets
254/254 - 3s - loss: 15.2891 - root_mean_squared_error: 3.8984 - val_loss: 14.6773 - val_root_mean_squared_error: 3.8189 - lr: 0.0010 - 3s/epoch - 11ms/step
Epoch 4/1000
INFO:tensorflow:Assets written to: model\model_0_ckpt\assets
254/254 - 3s - loss: 14.4520 - root_mean_squared_error: 3.7891 - val_loss: 14.2574 - val_root_mean_squared_error: 3.7632 - lr: 0.0010 - 3s/epoch - 11ms/step
Epoch 5/1000
INFO:tensorflow:Assets written to: model\model_0_ckpt\assets
254

Cross-validation shows us the average correlation between predictions and ground truth. The histogram additionally shows how the correlations of the cells are distributed. While most correlations are around 0.9, there exist a few predictions with negative correlations.

In [20]:
multi_test_x = scipy.sparse.load_npz("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/test_multi_inputs_values.sparse.npz")
multi_test_x = svd.transform(multi_test_x)

In [21]:
n=1
test_len = multi_test_x.shape[0]
d = test_len//n
x = []
for i in range(n):
    x.append(multi_test_x[i*d:i*d+d])
del multi_test_x
gc.collect()

29696

In [22]:
preds = np.zeros((test_len, 23418), dtype='float16')
for i,xx in enumerate(x):
    for fold in range(N_SPLITS):
        print(f"Predicting with fold {fold}")
        model = load_model(f"model/model_{fold}")
        preds[i*d:i*d+d,:] += (model.predict(xx)@pca2.components_)/N_SPLITS
        gc.collect()
    print('')
    del xx
gc.collect()

Predicting with fold 0
Predicting with fold 1
Predicting with fold 2



0

In [23]:
# Read the table of rows and columns required for submission
eval_ids = pd.read_parquet("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/evaluation.parquet")
# Convert the string columns to more efficient categorical types
#eval_ids.cell_id = eval_ids.cell_id.apply(lambda s: int(s, base=16))
eval_ids.cell_id = eval_ids.cell_id.astype(pd.CategoricalDtype())
eval_ids.gene_id = eval_ids.gene_id.astype(pd.CategoricalDtype())

In [24]:
# Prepare an empty series which will be filled with predictions
submission = pd.Series(name='target',
                       index=pd.MultiIndex.from_frame(eval_ids), 
                       dtype=np.float32)
submission

row_id    cell_id       gene_id        
0         c2150f55becb  CD86              NaN
1         c2150f55becb  CD274             NaN
2         c2150f55becb  CD270             NaN
3         c2150f55becb  CD155             NaN
4         c2150f55becb  CD112             NaN
                                           ..
65744175  2c53aa67933d  ENSG00000134419   NaN
65744176  2c53aa67933d  ENSG00000186862   NaN
65744177  2c53aa67933d  ENSG00000170959   NaN
65744178  2c53aa67933d  ENSG00000107874   NaN
65744179  2c53aa67933d  ENSG00000166012   NaN
Name: target, Length: 65744180, dtype: float32

In [25]:
np.save('preds.npy', preds)


In [26]:
%%time
# Read the table of rows and columns required for submission
eval_ids = pd.read_parquet("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/evaluation.parquet")
# Convert the string columns to more efficient categorical types
#eval_ids.cell_id = eval_ids.cell_id.apply(lambda s: int(s, base=16))
eval_ids.cell_id = eval_ids.cell_id.astype(pd.CategoricalDtype())
eval_ids.gene_id = eval_ids.gene_id.astype(pd.CategoricalDtype())

CPU times: total: 20 s
Wall time: 16.9 s


In [27]:
# Prepare an empty series which will be filled with predictions
submission = pd.Series(name='target',
                       index=pd.MultiIndex.from_frame(eval_ids), 
                       dtype=np.float32)
submission

row_id    cell_id       gene_id        
0         c2150f55becb  CD86              NaN
1         c2150f55becb  CD274             NaN
2         c2150f55becb  CD270             NaN
3         c2150f55becb  CD155             NaN
4         c2150f55becb  CD112             NaN
                                           ..
65744175  2c53aa67933d  ENSG00000134419   NaN
65744176  2c53aa67933d  ENSG00000186862   NaN
65744177  2c53aa67933d  ENSG00000170959   NaN
65744178  2c53aa67933d  ENSG00000107874   NaN
65744179  2c53aa67933d  ENSG00000166012   NaN
Name: target, Length: 65744180, dtype: float32

In [28]:
%%time
y_columns = np.load("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/train_multi_targets_idxcol.npz",
                   allow_pickle=True)["columns"]

test_index = np.load("C:/Users/Owner/Documents/dev/open-problem/multimodal-single-cell-as-sparse-matrix/test_multi_inputs_idxcol.npz",
                    allow_pickle=True)["index"]

CPU times: total: 31.2 ms
Wall time: 34 ms


In [29]:
cell_dict = dict((k,v) for v,k in enumerate(test_index)) 
assert len(cell_dict)  == len(test_index)

gene_dict = dict((k,v) for v,k in enumerate(y_columns))
assert len(gene_dict) == len(y_columns)

In [30]:
eval_ids_cell_num = eval_ids.cell_id.apply(lambda x:cell_dict.get(x, -1))
eval_ids_gene_num = eval_ids.gene_id.apply(lambda x:gene_dict.get(x, -1))

valid_multi_rows = (eval_ids_gene_num !=-1) & (eval_ids_cell_num!=-1)

In [31]:
submission.iloc[valid_multi_rows] = preds[eval_ids_cell_num[valid_multi_rows].to_numpy(),
eval_ids_gene_num[valid_multi_rows].to_numpy()]

In [32]:
del eval_ids_cell_num, eval_ids_gene_num, valid_multi_rows, eval_ids, test_index, y_columns
gc.collect()

40

In [33]:
submission

row_id    cell_id       gene_id        
0         c2150f55becb  CD86                    NaN
1         c2150f55becb  CD274                   NaN
2         c2150f55becb  CD270                   NaN
3         c2150f55becb  CD155                   NaN
4         c2150f55becb  CD112                   NaN
                                             ...   
65744175  2c53aa67933d  ENSG00000134419    5.695312
65744176  2c53aa67933d  ENSG00000186862    0.036163
65744177  2c53aa67933d  ENSG00000170959    0.034607
65744178  2c53aa67933d  ENSG00000107874    1.027344
65744179  2c53aa67933d  ENSG00000166012    5.078125
Name: target, Length: 65744180, dtype: float32

In [34]:
submission.reset_index(drop=True, inplace=True)
submission.index.name = 'row_id'

In [35]:
cite_submission = pd.read_csv("C:/Users/Owner/Documents/dev/open-problem/citeseq/submission_svd256_wdo.csv")
cite_submission = cite_submission.set_index("row_id")
cite_submission = cite_submission["target"]

In [36]:
submission[submission.isnull()] = cite_submission[submission.isnull()]

In [37]:
submission.isnull().any()

False

In [38]:
submission.to_csv(submission_name)