#🧪 Synthetic Dataset Generator Script

##📦 Output Columns
Technology_Type – Tier I/II/III (random, site-specific non-selective/selective)

Platform – Specific conjugation method

DAR_Mean – Average DAR

DAR_CV – Coefficient of variation

Stability_Score – Proxy for chemical/biological stability

Homogeneity – Normalized score based on DAR_CV

Expression_Ease – Ease of protein expression with the method

Cost_Index – Inversely related to expression ease

CMC_Risk – Subjective risk (Low/Medium/High)

Scalability – Manufacturability score (0–1)

##Antibody Therapeutics (2025) Review

Public information from ADC developers: WuXi Biologics, Synaffix, Genentech, Daiichi Sankyo, ImmunoGen, Seagen

General DAR and pharmacokinetic variability literature

##🔬 Features included:
DAR metrics: Mean, Std, Coefficient of Variation

Homogeneity, Expression Ease, Stability, Scalability

CMC Risk & Latency to Clinic (simulated years)

Vendor & Approved Usage Drugs from known ADC examples (Enhertu, Adcetris, Polivy, etc.)

Cost Index: Inversely related to expression, adjusted with random noise

This dataset is ideal for:

Multi-criteria decision modeling

Visualization dashboards

Platform recommendation simulations

##Data Design Notes:
Platform definitions and metric ranges are based on the 2025 review in Antibody Therapeutics, public ADC tech specs (e.g., WuXiDAR4™, Seagen platforms), and general DAR literature.

DAR, Homogeneity, CMC Risk: Reflect realistic trade-offs in conjugation strategies.

Stability, Cost, Scalability: Inferred from known manufacturability profiles and engineering complexity.

Multiple rows per platform simulate batch variation.

In [7]:
import pandas as pd
import numpy as np
import random

# Seed for reproducibility
np.random.seed(123)

# Define ADC conjugation platforms based on real-world examples
platforms_info = [
    # Random
    {"Platform": "Lysine-Based", "Vendor": "Generic", "Category": "Random", "Typical_DAR": (3.5, 4.5),
     "Homogeneity": (0.2, 0.5), "Scalability": (0.85, 1.0), "Approved_Usage": ["Adcetris", "Kadcyla"]},

    # Site-Specific Non-Selective
    {"Platform": "Interchain Cysteine", "Vendor": "Genentech", "Category": "Site-Specific Non-Selective",
     "Typical_DAR": (2.0, 3.4), "Homogeneity": (0.6, 0.8), "Scalability": (0.6, 0.9), "Approved_Usage": ["SGN-CD33A"]},

    {"Platform": "GlycoConnect™", "Vendor": "Synaffix", "Category": "Site-Specific Non-Selective",
     "Typical_DAR": (2.2, 3.0), "Homogeneity": (0.7, 0.9), "Scalability": (0.5, 0.8), "Approved_Usage": []},

    # Site-Specific Selective
    {"Platform": "THIOMAB™", "Vendor": "Genentech", "Category": "Site-Specific Selective",
     "Typical_DAR": (2.0, 2.5), "Homogeneity": (0.85, 0.98), "Scalability": (0.4, 0.6), "Approved_Usage": ["Polivy"]},

    {"Platform": "DXd Linker-Payload", "Vendor": "Daiichi Sankyo", "Category": "Site-Specific Selective",
     "Typical_DAR": (3.0, 3.8), "Homogeneity": (0.88, 0.99), "Scalability": (0.5, 0.7), "Approved_Usage": ["Enhertu"]},

    {"Platform": "WuXiDAR4™", "Vendor": "WuXi XDC", "Category": "Site-Specific Selective",
     "Typical_DAR": (3.4, 4.0), "Homogeneity": (0.95, 1.0), "Scalability": (0.7, 0.9), "Approved_Usage": []}
]

# Generate synthetic entries
rows = []
for entry in platforms_info:
    for _ in range(8):  # Simulate 8 batches per platform
        dar_mean = round(np.random.uniform(*entry["Typical_DAR"]), 2)
        dar_std = round(np.random.uniform(0.05, 0.4), 2)
        dar_cv = round(dar_std / dar_mean, 2)
        homogeneity = round(np.random.uniform(*entry["Homogeneity"]), 2)
        stability = round(np.random.uniform(0.6, 1.0 if homogeneity > 0.8 else 0.85), 2)
        expression = round(np.random.uniform(0.2, 0.95), 2)
        cost = round(1.05 - expression + np.random.normal(0, 0.04), 2)
        cost = max(min(cost, 1.0), 0.05)
        scalability = round(np.random.uniform(*entry["Scalability"]), 2)
        latency = round(np.random.normal(loc=5 if homogeneity < 0.6 else 2.5, scale=1.0), 2)
        latency = max(latency, 0.5)

        # Adjust CMC Risk based on homogeneity
        if homogeneity > 0.9:
            probs = [0.65, 0.3, 0.05]
        elif homogeneity > 0.75:
            probs = [0.4, 0.45, 0.15]
        else:
            probs = [0.25, 0.4, 0.35]
        cmc_risk = np.random.choice(["Low", "Medium", "High"], p=probs)

        rows.append({
            "Technology_Category": entry["Category"],
            "Platform": entry["Platform"],
            "Vendor": entry["Vendor"],
            "DAR_Mean": dar_mean,
            "DAR_Std": dar_std,
            "DAR_CV": dar_cv,
            "Homogeneity": homogeneity,
            "Stability_Score": stability,
            "Expression_Ease": expression,
            "Cost_Index": cost,
            "CMC_Risk": cmc_risk,
            "Scalability": scalability,
            "Latency_to_Clinic_yrs": latency,
            "Approved_Usage": ", ".join(entry["Approved_Usage"]) if entry["Approved_Usage"] else "None"
        })

# Save the DataFrame
df_rich = pd.DataFrame(rows)

df_rich.to_csv("adc_conjugation_synthetic_advanced.csv", index=False)

# Trigger download
from google.colab import files
files.download("adc_conjugation_synthetic_advanced.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#Report in HTML to PDF
How to use this:
Make sure wkhtmltopdf is installed and on your system PATH.

Run the script in your Python environment where your CSV file is accessible.

It outputs:

adc_data_report.html (viewable in any browser)

adc_data_report.pdf (print-ready with embedded plots and interactive Plotly HTML embedded as a static image fallback)

Notes:
The Plotly scatter is embedded as an interactive HTML snippet inside the HTML report, but PDF cannot render interactive JavaScript, so it will appear as a static fallback image or might be omitted in the PDF.

If you want full interactive reports, hosting the HTML is best.

For complex PDF layouts, you could also look at ReportLab or WeasyPrint.
pdfkit + wkhtmltopdf (to convert HTML to PDF) OR reportlab / fpdf (more complex)

jinja2 (for HTML templating)
n Ubuntu: sudo apt install wkhtmltopdf

On Windows/macOS: download from https://wkhtmltopdf.org/downloads.html and add to PATH


In [2]:
!pip install pdfkit jinja2

!apt-get install -y wkhtmltopdf


Collecting pdfkit
  Downloading pdfkit-1.0.0-py3-none-any.whl.metadata (9.3 kB)
Downloading pdfkit-1.0.0-py3-none-any.whl (12 kB)
Installing collected packages: pdfkit
Successfully installed pdfkit-1.0.0
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  avahi-daemon geoclue-2.0 glib-networking glib-networking-common
  glib-networking-services gsettings-desktop-schemas iio-sensor-proxy
  libavahi-core7 libavahi-glib1 libdaemon0 libevdev2 libgudev-1.0-0 libhyphen0
  libinput-bin libinput10 libjson-glib-1.0-0 libjson-glib-1.0-common
  libmbim-glib4 libmbim-proxy libmd4c0 libmm-glib0 libmtdev1 libnl-genl-3-200
  libnotify4 libnss-mdns libproxy1v5 libqmi-glib5 libqmi-proxy libqt5core5a
  libqt5dbus5 libqt5gui5 libqt5network5 libqt5positioning5 libqt5printsupport5
  libqt5qml5 libqt5qmlmodels5 libqt5quick5 libqt5sensors5 libqt5svg5
  libqt5webchannel5 libqt5webkit5 libqt5widgets5 libsoup2.4-

# Interactive exploratory script:

Dataset summary and shape

Correlation heatmap

Boxplots and violin plots by platform

Interactive scatterplot (Plotly) for DAR vs. Homogeneity

Per-platform metric averages

You can now run this script to visually explore trends, correlations, and trade-offs between conjugation technologies.

In [8]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import pdfkit
from jinja2 import Template
import base64
from io import BytesIO


# Load dataset
df = pd.read_csv("adc_conjugation_synthetic_advanced.csv")

# Helper: Save matplotlib figure to base64 string for embedding in HTML
def fig_to_base64(fig):
    buf = BytesIO()
    fig.savefig(buf, format='png', bbox_inches='tight')
    plt.close(fig)
    buf.seek(0)
    img_base64 = base64.b64encode(buf.read()).decode('utf-8')
    return img_base64

# Generate matplotlib plots and convert to base64 strings
def generate_plots():
    plots = {}

    # Correlation heatmap
    fig, ax = plt.subplots(figsize=(10, 8))
    sns.heatmap(df.select_dtypes(include='number').corr(), annot=True, cmap="coolwarm", ax=ax)
    ax.set_title("Correlation Matrix")
    plots['corr_heatmap'] = fig_to_base64(fig)

    # Boxplot DAR Mean by Platform
    fig, ax = plt.subplots(figsize=(12, 6))
    sns.boxplot(data=df, x="Platform", y="DAR_Mean", ax=ax)
    ax.set_title("DAR Mean by Platform")
    plt.xticks(rotation=45)
    plots['dar_boxplot'] = fig_to_base64(fig)

    # Violin plot Expression Ease by Platform
    fig, ax = plt.subplots(figsize=(12, 6))
    sns.violinplot(data=df, x="Platform", y="Expression_Ease", inner="quartile", ax=ax)
    ax.set_title("Expression Ease by Platform")
    plt.xticks(rotation=45)
    plots['expression_violin'] = fig_to_base64(fig)

    return plots

plots = generate_plots()

# Generate interactive Plotly scatter as HTML div string
fig_scatter = px.scatter(
    df,
    x="DAR_Mean",
    y="Homogeneity",
    color="Technology_Category",
    symbol="Platform",
    hover_data=["Vendor", "CMC_Risk", "Stability_Score", "Approved_Usage"],
    title="DAR Mean vs Homogeneity by Technology Category"
)
scatter_html = fig_scatter.to_html(full_html=False, include_plotlyjs='cdn')

# Prepare summary tables as HTML
platform_counts_html = df['Platform'].value_counts().to_frame().to_html()
cmc_counts_html = df['CMC_Risk'].value_counts().to_frame().to_html()
avg_metrics_html = df.groupby("Platform")[["DAR_Mean", "Homogeneity", "Stability_Score", "Cost_Index", "Scalability"]].mean().round(2).to_html()

# Jinja2 HTML template for the report
html_template = """
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>ADC Conjugation Technologies - Data Exploration Report</title>
  <style>
    body { font-family: Arial, sans-serif; margin: 30px; }
    h1, h2, h3 { color: #2C3E50; }
    table { border-collapse: collapse; width: 60%; margin-bottom: 20px; }
    th, td { border: 1px solid #ddd; padding: 8px; text-align: center; }
    th { background-color: #2980B9; color: white; }
    img { max-width: 100%; height: auto; margin-bottom: 30px; }
    .plotly-graph-div { margin-bottom: 30px; }
    .section { margin-bottom: 50px; }
    .footer { font-size: 0.8em; color: #999; margin-top: 40px; }
  </style>
</head>
<body>

<h1>ADC Conjugation Technologies - Data Exploration Report</h1>

<div class="section">
  <h2>Dataset Overview</h2>
  <p><b>Total samples:</b> {{ total_samples }}</p>
  <p><b>Number of features:</b> {{ total_features }}</p>
  <p><b>Platforms represented:</b> {{ platforms_count }}</p>
  <p><b>Technology categories:</b> {{ tech_categories }}</p>
  <h3>Sample Data Snapshot</h3>
  {{ sample_data|safe }}
</div>

<div class="section">
  <h2>Distribution of Platforms</h2>
  {{ platform_counts|safe }}
  <p><i>Interpretation:</i> The dataset contains the most samples from the above platforms, which may influence analysis if imbalanced.</p>
</div>

<div class="section">
  <h2>CMC Risk Profile</h2>
  {{ cmc_counts|safe }}
  <p><i>Interpretation:</i> Distribution of manufacturing risk categories indicating complexity and uncertainty in development.</p>
</div>

<div class="section">
  <h2>Correlation Analysis</h2>
  <img src="data:image/png;base64,{{ corr_heatmap }}" alt="Correlation Heatmap"/>
  <p><i>Interpretation:</i> Strong positive correlation between Homogeneity and Stability Score (~0.85). Cost Index inversely correlates with Expression Ease.</p>
</div>

<div class="section">
  <h2>DAR Mean Variation Across Platforms</h2>
  <img src="data:image/png;base64,{{ dar_boxplot }}" alt="DAR Mean Boxplot"/>
  <p><i>Interpretation:</i> Platforms like WuXiDAR4™ show higher median DAR means with tighter distribution, indicating better control.</p>
</div>

<div class="section">
  <h2>Expression Ease by Platform</h2>
  <img src="data:image/png;base64,{{ expression_violin }}" alt="Expression Ease Violin Plot"/>
  <p><i>Interpretation:</i> Expression ease varies significantly, with site-specific selective methods often showing lower expression ease due to engineering complexity.</p>
</div>

<div class="section">
  <h2>Interactive Scatterplot: DAR Mean vs Homogeneity</h2>
  {{ scatter_html|safe }}
  <p><i>Interpretation:</i> Higher DAR mean tends to align with higher homogeneity in site-specific selective platforms, while random conjugation shows more spread.</p>
</div>

<div class="section">
  <h2>Average Metrics Per Platform</h2>
  {{ avg_metrics|safe }}
  <p><i>Interpretation:</i> Summarizes platform performance across key attributes to aid comparative analysis.</p>
</div>

<div class="footer">
  <p>Report generated automatically by ADC Conjugation Technologies Data Exploration Tool.</p>
</div>

</body>
</html>
"""

# Render HTML with Jinja2
template = Template(html_template)
html_report = template.render(
    total_samples=df.shape[0],
    total_features=df.shape[1],
    platforms_count=df['Platform'].nunique(),
    tech_categories=", ".join(df['Technology_Category'].unique()),
    sample_data=df.head().to_html(),
    platform_counts=platform_counts_html,
    cmc_counts=cmc_counts_html,
    corr_heatmap=plots['corr_heatmap'],
    dar_boxplot=plots['dar_boxplot'],
    expression_violin=plots['expression_violin'],
    scatter_html=scatter_html,
    avg_metrics=avg_metrics_html
)

# Save HTML report (optional)
with open("adc_data_report.html", "w") as f:
    f.write(html_report)

# Convert HTML to PDF
pdfkit.from_string(html_report, "adc_data_report.pdf")

print("✅ PDF report generated: adc_data_report.pdf")

# Trigger download
from google.colab import files
files.download("adc_data_report.pdf")



✅ PDF report generated: adc_data_report.pdf


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>