# Read test data
tst = pd.read_parquet('/kaggle/input/leash-BELKA/test.parquet')

# Initialize 'binds' column with zeros
tst['binds'] = 0

# Fill 'binds' column with predictions for each protein
tst.loc[tst['protein_name']=='BRD4', 'binds'] = cnn_preds[(tst['protein_name']=='BRD4').values, 0]
tst.loc[tst['protein_name']=='HSA', 'binds'] = cnn_preds[(tst['protein_name']=='HSA').values, 1]
tst.loc[tst['protein_name']=='sEH', 'binds'] = cnn_preds[(tst['protein_name']=='sEH').values, 2]

# Save submission file
tst[['id', 'binds']].to_csv('submission.csv', index=False)
# Leash Bio - Predict New Medicines with BELKA

## Introduction
Small molecule drugs play a crucial role in modern medicine, often targeting specific proteins to treat various diseases. However, with a vast chemical space to explore, traditional drug discovery methods can be laborious and time-consuming. The Leash Bio competition, "Predict New Medicines with BELKA," aims to revolutionize small molecule binding prediction by leveraging machine learning techniques.

## Dataset Overview
The competition dataset comprises binary classifications indicating whether a small molecule binds to one of three protein targets. The data were collected using DNA-encoded chemical library (DEL) technology. Each example includes SMILES representations of building blocks and the fully assembled molecule, along with protein target names and binary binding classifications.

### Files
- **train/test.[csv/parquet]:** Contains training or test data in both csv and parquet formats.
  - `id`: Unique identifier for the molecule-binding target pair.
  - `buildingblock1_smiles`, `buildingblock2_smiles`, `buildingblock3_smiles`: SMILES representations of building blocks.
  - `molecule_smiles`: SMILES representation of the fully assembled molecule.
  - `protein_name`: Name of the protein target.
  - `binds`: Binary class label indicating whether the molecule binds to the protein (not available for the test set).
- **sample_submission.csv:** Sample submission file in the correct format.

### Competition Data
Leash Biosciences provides approximately 98M training examples per protein, 200K validation examples per protein, and 360K test molecules per protein. The test set contains building blocks not present in the training set, ensuring generalizability. The datasets are highly imbalanced, with only about 0.5% of examples classified as binders.

## Protein Targets
The competition focuses on predicting binding affinity for three protein targets:

1. **EPHX2 (sEH):** Encoded by the EPHX2 genetic locus, soluble epoxide hydrolase (sEH) is a potential drug target for conditions like high blood pressure and diabetes. Crystal structures and amino acid sequences are provided for contestants wishing to incorporate protein structural information.
2. **BRD4:** Bromodomain 4 plays roles in cancer progression, and inhibiting its activity is a strategy for cancer treatment. Crystal structures and sequences are available for contestants.
3. **ALB (HSA):** Human serum albumin (HSA) is the most common protein in blood and plays a crucial role in drug absorption and transport. Predicting ALB binding can greatly impact drug discovery across various diseases.

<table>
  <tr>
      <th><h2>Protein Name</h2></th>
      <th><h2>Structure</h2></th>
  </tr>
  <tr>
      <td><h2>EPHX2 (sEH)</h2></td>
    <td><img src="https://cdn1.sinobiological.com/styles/default/images/protein-structure/CTSS-protein-structure.jpg" alt="EPHX2 (sEH) protein structure" width="500" height="500"></td>
  </tr>
  <tr>
      <td><h2>BRD4</h2></td>
    <td><img src="https://www.pinclipart.com/picdir/big/70-700834_protein-brd4-pdb-2oss-by-emw-brd4-protein.png" alt="BRD4 protein structure" width="500" height="500"></td>
  </tr>
  <tr>
      <td><h2>ALB (HSA)</h2></td>
    <td><img src="https://cdn.rcsb.org/images/structures/1e78_assembly-1.jpeg" alt="ALB (HSA) protein structure" width="500" height="500"></td>
  </tr>
</table>

In [1]:
!pip install fastparquet -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
!pip install seaborn

In [16]:
import gc  # Import garbage collection module for memory management
import os  # Import OS module to interact with the operating system
import pickle  # Import pickle module for object serialization
import random  # Import random module for generating random numbers
import joblib  # Import joblib module for efficient serialization
import numpy as np  # Import NumPy for numerical operations
import pandas as pd  # Import pandas for data manipulation
from tqdm import tqdm  # Import tqdm for progress bars
from sklearn.model_selection import KFold  # Import StratifiedKFold for stratified cross-validation
from sklearn.metrics import average_precision_score as APS, roc_auc_score, classification_report, auc, roc_curve  # Import various metrics from scikit-learn
import tensorflow as tf  # Import TensorFlow for deep learning
import matplotlib.pyplot as plt  # Import Matplotlib for plotting

In [3]:
class CFG:
    PREPROCESS = False  # Flag to indicate whether preprocessing is needed
    EPOCHS = 20  # Number of training epochs
    BATCH_SIZE = 4096  # Size of each batch for training
    LR = 1e-3  # Learning rate for the optimizer
    WD = 0.05  # Weight decay for regularization

    NBR_FOLDS = 15  # Number of folds for cross-validation
    SELECTED_FOLDS = [0]  # List of selected folds to be used in training

    SEED = 2024  # Random seed for reproducibility

In [4]:
def set_seeds(seed):
    os.environ['PYTHONHASHSEED'] = str(seed)  # Set PYTHONHASHSEED environment variable for reproducibility
    random.seed(seed)  # Set the seed for the random module
    tf.random.set_seed(seed)  # Set the seed for TensorFlow
    np.random.seed(seed)  # Set the seed for NumPy

set_seeds(seed=CFG.SEED)  # Call the function with the seed defined in CFG

In [5]:
import tensorflow as tf

# Detect hardware, return appropriate distribution strategy
try:
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect(tpu="local")  # Connect to local TPU
    strategy = tf.distribute.TPUStrategy(tpu)  # Create a TPUStrategy for distribution
    print("Running on TPU")  # Confirm running on TPU
    print("REPLICAS: ", strategy.num_replicas_in_sync)  # Print the number of TPU replicas
except tf.errors.NotFoundError:
    print("Not on TPU")  # Handle case where TPU is not found

INFO:tensorflow:Deallocate tpu buffers before initializing tpu system.
INFO:tensorflow:Initializing the TPU system: local
INFO:tensorflow:Finished initializing TPU system.
INFO:tensorflow:Found TPU system:
INFO:tensorflow:*** Num TPU Cores: 8
INFO:tensorflow:*** Num TPU Workers: 1
INFO:tensorflow:*** Num TPU Cores Per Worker: 8
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:0, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:1, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:2, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:3, TPU, 0, 0)
INFO:tensorflow:*** Available Device: _DeviceAttributes(/job:localhost/replica:0/task:0/device:TPU:4, TPU

# Preprocessing

In [11]:
enc = {'l': 1, 'y': 2, '@': 3, '3': 4, 'H': 5, 'S': 6, 'F': 7, 'C': 8, 'r': 9, 's': 10, '/': 11, 'c': 12, 'o': 13,
           '+': 14, 'I': 15, '5': 16, '(': 17, '2': 18, ')': 19, '9': 20, 'i': 21, '#': 22, '6': 23, '8': 24, '4': 25, '=': 26,
           '1': 27, 'O': 28, '[': 29, 'D': 30, 'B': 31, ']': 32, 'N': 33, '7': 34, 'n': 35, '-': 0}
if CFG.PREPROCESS:    
    train_raw = pd.read_parquet('/kaggle/input/leash-BELKA/train.parquet')
    smiles = train_raw[train_raw['protein_name']=='BRD4']['molecule_smiles'].values
    assert (smiles!=train_raw[train_raw['protein_name']=='HSA']['molecule_smiles'].values).sum() == 0
    assert (smiles!=train_raw[train_raw['protein_name']=='sEH']['molecule_smiles'].values).sum() == 0
    def encode_smile(smile):
        tmp = [enc[i] for i in smile]
        tmp = tmp + [0]*(142-len(tmp))
        return np.array(tmp).astype(np.uint8)

    smiles_enc = joblib.Parallel(n_jobs=96)(joblib.delayed(encode_smile)(smile) for smile in tqdm(smiles))
    smiles_enc = np.stack(smiles_enc)
    train = pd.DataFrame(smiles_enc, columns = [f'enc{i}' for i in range(142)])
    train['bind1'] = train_raw[train_raw['protein_name']=='BRD4']['binds'].values
    train['bind2'] = train_raw[train_raw['protein_name']=='HSA']['binds'].values
    train['bind3'] = train_raw[train_raw['protein_name']=='sEH']['binds'].values
    train.to_parquet('train_enc.parquet')

    test_raw = pd.read_parquet('/kaggle/input/leash-BELKA/test.parquet')
    smiles = test_raw['molecule_smiles'].values

    smiles_enc = joblib.Parallel(n_jobs=96)(joblib.delayed(encode_smile)(smile) for smile in tqdm(smiles))
    smiles_enc = np.stack(smiles_enc)
    test = pd.DataFrame(smiles_enc, columns = [f'enc{i}' for i in range(142)])
    test.to_parquet('test_enc.parquet')

else:
    train = pd.read_parquet('/kaggle/input/belka-enc-dataset/train_enc.parquet')
    test = pd.read_parquet('/kaggle/input/belka-enc-dataset/test_enc.parquet')

# Modeling

In [12]:
def encode_smile(smile):
    tmp = [enc.get(char, 0) for char in smile]  # Use 0 for any character not in the dictionary
    tmp = tmp + [0] * (142 - len(tmp))
    encoded = np.array(tmp).astype(np.uint8)
    if (encoded >= 37).any():
        raise ValueError(f"Encoded value out of range in SMILES: {smile}")
    return encoded

In [13]:
# CNN model function
def cnn_model():
    INP_LEN = 142
    NUM_FILTERS = 32
    hidden_dim = 128

    inputs = tf.keras.layers.Input(shape=(INP_LEN,), dtype='int32')
    x = tf.keras.layers.Embedding(input_dim=37, output_dim=hidden_dim, input_length=INP_LEN, mask_zero=True)(inputs)
    x = tf.keras.layers.Conv1D(filters=NUM_FILTERS, kernel_size=3, activation='relu', padding='valid', strides=1)(x)
    x = tf.keras.layers.Conv1D(filters=NUM_FILTERS * 2, kernel_size=3, activation='relu', padding='valid', strides=1)(x)
    x = tf.keras.layers.Conv1D(filters=NUM_FILTERS * 3, kernel_size=3, activation='relu', padding='valid', strides=1)(x)
    x = tf.keras.layers.GlobalMaxPooling1D()(x)
    x = tf.keras.layers.Dense(1024, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Dense(1024, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    x = tf.keras.layers.Dense(512, activation='relu')(x)
    x = tf.keras.layers.Dropout(0.1)(x)
    outputs = tf.keras.layers.Dense(3, activation='sigmoid')(x)

    model = tf.keras.models.Model(inputs=inputs, outputs=outputs)
    optimizer = tf.keras.optimizers.Adam(learning_rate=CFG.LR, weight_decay=CFG.WD)
    loss = 'binary_crossentropy'
    weighted_metrics = [tf.keras.metrics.AUC(curve='PR', name='avg_precision')]
    model.compile(
        loss=loss,
        optimizer=optimizer,
        weighted_metrics=weighted_metrics,
    )
    return model

# Train & Inference

In [14]:
def train_and_evaluate(model_fn, model_name):
    kf = KFold(n_splits=CFG.NBR_FOLDS, shuffle=True, random_state=CFG.SEED)
    oof_predictions = []
    test_predictions = []
    histories = []
    fold = 0

    for train_index, val_index in kf.split(X_train):
        fold += 1
        print(f"Training fold {fold}...")

        X_tr, X_val = X_train[train_index], X_train[val_index]
        y_tr, y_val = y_train[train_index], y_train[val_index]

        model = model_fn()

        checkpoint = tf.keras.callbacks.ModelCheckpoint(
            f"{model_name}_model-{fold}.h5", save_best_only=True, monitor='val_loss', mode='min')
        reduce_lr_loss = tf.keras.callbacks.ReduceLROnPlateau(
            monitor='val_loss', factor=0.1, patience=3, verbose=1, min_lr=1e-6)
        es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=7, verbose=1, mode='min', restore_best_weights=True)

        history = model.fit(
            X_tr, y_tr,
            validation_data=(X_val, y_val),
            epochs=CFG.EPOCHS,
            callbacks=[checkpoint, reduce_lr_loss, es, TqdmCallback(verbose=1)],
            batch_size=CFG.BATCH_SIZE,
            verbose=0  # Set verbose to 0 for the progress bar
        )

        histories.append(history.history)
        model.load_weights(f"{model_name}_model-{fold}.h5")
        val_preds = model.predict(X_val)
        test_preds = model.predict(X_test)

        oof_predictions.append(val_preds)
        test_predictions.append(test_preds)

    oof_predictions = np.concatenate(oof_predictions, axis=0)
    test_predictions = np.mean(test_predictions, axis=0)
    return oof_predictions, test_predictions, histories

In [23]:
# Tqdm callback for progress bar
class TqdmCallback(tf.keras.callbacks.Callback):
    def __init__(self, verbose=1):
        super(TqdmCallback, self).__init__()
        self.verbose = verbose

    def on_epoch_end(self, epoch, logs={}):
        if self.verbose:
            tqdm.write(f"Epoch {epoch + 1}: loss = {logs['loss']:.4f}, val_loss = {logs['val_loss']:.4f}")

In [24]:
# Sample data preparation (replace with your actual data)
X_train = np.array([encode_smile('CCO') for _ in range(1000)])
y_train = np.random.randint(0, 2, (1000, 3))
X_test = np.array([encode_smile('CCO') for _ in range(200)])
y_test = np.random.randint(0, 2, (200, 3))

# Train and evaluate CNN model
cnn_oof, cnn_preds, cnn_histories = train_and_evaluate(cnn_model, "cnn")

Training fold 1...
Epoch 1: loss = 0.6932, val_loss = 0.6930
Epoch 2: loss = 0.6929, val_loss = 0.6930
Epoch 3: loss = 0.6928, val_loss = 0.6932

Epoch 4: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 4: loss = 0.6927, val_loss = 0.6935
Epoch 5: loss = 0.6928, val_loss = 0.6935
Epoch 6: loss = 0.6928, val_loss = 0.6935

Epoch 7: ReduceLROnPlateau reducing learning rate to 1.0000000474974514e-05.
Epoch 7: loss = 0.6929, val_loss = 0.6935
Restoring model weights from the end of the best epoch: 1.
Epoch 8: loss = 0.6929, val_loss = 0.6935
Epoch 8: early stopping
Training fold 2...
Epoch 1: loss = 0.6932, val_loss = 0.6930
Epoch 2: loss = 0.6931, val_loss = 0.6924
Epoch 3: loss = 0.6930, val_loss = 0.6917
Epoch 4: loss = 0.6929, val_loss = 0.6908
Epoch 5: loss = 0.6929, val_loss = 0.6902
Epoch 6: loss = 0.6929, val_loss = 0.6902
Epoch 7: loss = 0.6930, val_loss = 0.6906

Epoch 8: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.
Epoch 8: loss =

# Submission

In [33]:
# Debug the shapes of the boolean index and cnn_preds
print("Boolean index shape:", (tst['protein_name']=='BRD4').values.shape)
print("cnn_preds shape:", cnn_preds.shape)

# Check if the boolean index and cnn_preds correspond to the same test data samples
print("Unique protein names in tst:", tst['protein_name'].unique())
print("Unique protein names in cnn_preds:", np.unique([tst['protein_name'][idx] for idx in np.argmax(cnn_preds, axis=1)]))

Boolean index shape: (1674896,)
cnn_preds shape: (200, 3)
Unique protein names in tst: ['BRD4' 'HSA' 'sEH']
Unique protein names in cnn_preds: ['sEH']


In [34]:
# Read test data
tst = pd.read_parquet('/kaggle/input/leash-BELKA/test.parquet')

# Initialize 'binds' column with zeros
tst['binds'] = 0

# Filter rows of tst to match the protein names in cnn_preds
tst_brd4 = tst[tst['protein_name'] == 'BRD4']
tst_hsa = tst[tst['protein_name'] == 'HSA']
tst_seh = tst[tst['protein_name'] == 'sEH']

# Fill 'binds' column with predictions for each protein
tst.loc[tst['protein_name']=='BRD4', 'binds'] = cnn_preds[0, 0]  # Assuming predictions for BRD4 are in the first column of cnn_preds
tst.loc[tst['protein_name']=='HSA', 'binds'] = cnn_preds[0, 1]  # Assuming predictions for HSA are in the second column of cnn_preds
tst.loc[tst['protein_name']=='sEH', 'binds'] = cnn_preds[0, 2]  # Assuming predictions for sEH are in the third column of cnn_preds

# Save submission file
tst[['id', 'binds']].to_csv('submission.csv', index=False)