# High-Throughput Molecular Screening Demo

This notebook demonstrates how to perform large-scale batch inference using the `deepmirror` public API.
We'll screen 15 million molecules using a pre-trained model, submitting the job in batches (max 5M per chunk).

## Prerequisites

Before running this notebook, ensure that:
1. You are logged in via terminal using `dm login <YOUR_EMAIL>`
2. Your API token is saved to file
3. Your input data is in **parquet** format with two columns:
   - `"ID"` (unique identifier)
   - `"SMILES"` (molecular structure)

You can install the `deepmirror` client library below:


In [None]:
# !pip install deepmirror
# !dm login <YOUREMAIL>

## Step 1: Setup

In [None]:
import datetime
import io
import tempfile
import time
from pathlib import Path

import pandas as pd
from tqdm import tqdm

from deepmirror.api import (
    create_batch_inference,
    download_batch_results,
    get_batch_inference,
    list_models,
)

MAX_ROWS_PER_BATCH = 5_000_000
TOTAL_ROWS = 15_000_000

In this example we will artificially create a 15M row dataset by repeating a smaller dataset multiple times


In [None]:
root = Path().cwd().parent
csv_path = root / "data" / "data-reg.csv"
df = pd.read_csv(csv_path)[["SMILES"]]

repeat_factor = int(TOTAL_ROWS / len(df))
df = pd.concat([df] * repeat_factor, ignore_index=True)
df["ID"] = df.index

df.tail()

## Step 2: Save screening library to Parquet

In [None]:
with tempfile.NamedTemporaryFile(
    suffix=".parquet", delete=False
) as tmp_parquet:
    df.to_parquet(tmp_parquet.name)
    screening_file = tmp_parquet.name

## Step 3: Select your model

In [None]:
models = list_models()
model_id = models[0]["model_id"]  # Replace with your desired model ID

## Step 4: Submit batch jobs

In [None]:
screening_df = pd.read_parquet(screening_file)
assert "ID" in screening_df.columns
assert "SMILES" in screening_df.columns

jobs = []
for i in range(0, len(screening_df), MAX_ROWS_PER_BATCH):
    chunk = screening_df.iloc[i : i + MAX_ROWS_PER_BATCH]

    with tempfile.NamedTemporaryFile(
        suffix=".parquet", delete=False
    ) as tmp_chunk:
        chunk.to_parquet(tmp_chunk.name)
        job = create_batch_inference(
            model_id=model_id, file_path=tmp_chunk.name
        )
        jobs.append(job)

In [None]:
jobs_df = pd.DataFrame(jobs)
timestamp = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
jobs_df.to_csv(f"batch-inference-{timestamp}.csv", index=False)
jobs_df

## Step 5: Monitor job progress

In [None]:
jobs_df = jobs_df.copy()
done = set()
bar = tqdm(
    total=len(jobs_df), desc="Batch inference jobs", position=0, leave=True
)

while not jobs_df["status"].isin(["completed", "failed"]).all():
    status_summary = []

    for idx, row in jobs_df.iterrows():
        if row["status"] in ("completed", "failed"):
            status_summary.append(
                f"Job {row['task_id'][:6]}...: {row['status']} ({row['progress']}%)"
            )
            continue

        status = get_batch_inference(row["task_id"])
        jobs_df.at[idx, "status"] = status["status"]
        jobs_df.at[idx, "progress"] = status["progress"]

        line = f"Job {row['task_id'][:6]}...: {status['status']} ({status['progress']}%)"
        status_summary.append(line)

        if status["status"] in ("completed", "failed") and idx not in done:
            bar.update(1)
            done.add(idx)

    bar.set_postfix_str("\n".join(status_summary))
    time.sleep(5)

bar.close()

## Step 6: Review job completion

In [None]:
completed = jobs_df[jobs_df["status"] == "completed"]
failed = jobs_df[jobs_df["status"] == "failed"]

print(f"Completed jobs: {len(completed)}")
print(f"Failed jobs: {len(failed)}")

## Step 7: Download predictions

In [None]:
for _, job in completed.iterrows():
    result_bytes = download_batch_results(job["task_id"])
    result_df = pd.read_parquet(io.BytesIO(result_bytes))
    result_df.to_csv(f"example-output-{job['task_id']}.csv", index=False)

result_df.head()