# Comparative Non-B DNA Structural Motif Analysis
## Nine-Genome Study — Classes, Subclasses, Hybrids, and Clusters

This notebook provides a **rigorous comparative analysis** of Non-B DNA structural motifs
detected by **NonBDNAFinder** across nine genomes spanning three domains of life:

| Domain | Organism | Lifestyle |
|--------|----------|-----------|
| Bacteria | *Escherichia coli* | Free-living, ~4.6 Mb |
| Bacteria | *Helicobacter pylori* | Pathogen, ~1.7 Mb |
| Bacteria | *Staphylococcus aureus* | Pathogen, ~2.8 Mb |
| Bacteria | *Streptococcus pneumoniae* | Pathogen, ~2.1 Mb |
| Bacteria | *Cellulomonas shaoxiangyii* | Soil, ~3.9 Mb |
| Bacteria | *Miltoncostaea marina* | Marine, ~3.4 Mb |
| Obligate endosymbiont | *Buchnera aphidicola* | Intracellular, ~452 kb |
| Obligate endosymbiont | *Candidatus Carsonella ruddii* | Intracellular, ~174 kb |
| Eukaryote | *Saccharomyces cerevisiae* | Free-living, ~12.2 Mb |

The analysis covers:
1. **Genome-level overview** — size, total motifs, density, coverage, occupancy
2. **Class-level analysis** — raw counts and normalised densities of all 11 Non-B DNA classes
3. **Subclass-level analysis** — major structural variants within each class
4. **Hybrid region analysis** — overlapping multi-class loci
5. **Cluster region analysis** — high-density Non-B DNA windows
6. **Structural complexity metrics** — SLI, SCI, WSC, overlap depth
7. **Diversity indices** — Simpson D, effective class number (N_eff), structural dominance


## Cell 1 · Setup — Imports and Data Loading

In [None]:
# ── Imports ───────────────────────────────────────────────────────────────────
import os, re, warnings
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import matplotlib.patches as mpatches
import matplotlib.ticker as ticker
import seaborn as sns
from scipy import stats
from IPython.display import display, HTML

warnings.filterwarnings("ignore")
matplotlib.rcParams.update({
    "figure.dpi": 150,
    "font.size": 10,
    "axes.titlesize": 11,
    "axes.labelsize": 10,
})

# ── Locate genome directories ─────────────────────────────────────────────────
BASE_DIR = os.path.dirname(os.path.abspath("__file__"))
GENOME_DIRS = sorted([d for d in os.listdir(BASE_DIR) if d.endswith("_genome")])
ORGANISMS   = [d.replace("_genome", "") for d in GENOME_DIRS]

SHORT_NAMES = {
    "Buchnera aphidicola":           "B. aphidicola",
    "Candidatus Carsonella ruddii":  "Ca. Carsonella",
    "Cellulomonas shaoxiangyii":     "C. shaoxiangyii",
    "Escherichia coli":              "E. coli",
    "Helicobacter pylori":           "H. pylori",
    "Miltoncostaea marina":          "M. marina",
    "Saccharomyces cerevisiae":      "S. cerevisiae",
    "Staphylococcus aureus":         "S. aureus",
    "Streptococcus pneumoniae":      "S. pneumoniae",
}
SHORT = [SHORT_NAMES.get(o, o) for o in ORGANISMS]

print(f"Found {len(GENOME_DIRS)} genome directories:")
for org, sn in zip(ORGANISMS, SHORT):
    print(f"  {org}  →  {sn}")

# ── Helper: parse numeric value from the CSV ─────────────────────────────────
def parse_val(v):
    """Strip units / commas and return float (NaN if unparseable)."""
    if pd.isna(v):
        return np.nan
    s = str(v).strip()
    s = re.sub(r"[,\s]*(bp|/\s*kb|%)?$", "", s)  # remove trailing unit
    s = s.replace(",", "").replace("%", "")
    try:
        return float(s)
    except ValueError:
        return np.nan

# ── Load comprehensive_genome_stats.csv for every genome ─────────────────────
stats_rows = []
for gd, org in zip(GENOME_DIRS, ORGANISMS):
    path = os.path.join(BASE_DIR, gd, "comprehensive_genome_stats.csv")
    df   = pd.read_csv(path)
    row  = {"Organism": org, "Short": SHORT_NAMES.get(org, org)}
    for _, r in df.iterrows():
        row[r["Metric"]] = parse_val(r["Value"])
    stats_rows.append(row)

STATS = pd.DataFrame(stats_rows).set_index("Organism")
print("\nGenome stats loaded:", STATS.shape)

# ── Sort genomes by genome size (ascending) ──────────────────────────────────
STATS = STATS.sort_values("Genome Length")
ORGANISMS = list(STATS.index)
SHORT = [SHORT_NAMES.get(o, o) for o in ORGANISMS]
GENOME_DIRS = [o + "_genome" for o in ORGANISMS]

# ── Load motifs.xlsx for every genome (cached) ───────────────────────────────
MOTIFS = {}
for gd, org in zip(GENOME_DIRS, ORGANISMS):
    path = os.path.join(BASE_DIR, gd, "motifs.xlsx")
    MOTIFS[org] = pd.read_excel(path)
    print(f"  {SHORT_NAMES.get(org,org):22s}: {len(MOTIFS[org]):>6,} motifs loaded")

# ── Colour palette for classes ────────────────────────────────────────────────
# Taxonomy order: bent/curved → palindromic/repeat → multi-stranded
# → alternative-helix → RNA-hybrid → quartet → composite
ALL_CLASSES = [
    "Curved_DNA", "A-philic_DNA", "Cruciform", "Slipped_DNA",
    "Triplex", "Z-DNA", "R-Loop", "G-Quadruplex", "i-Motif",
    "Hybrid", "Non-B_DNA_Clusters",
]
CLASS_COLORS = dict(zip(ALL_CLASSES, plt.cm.tab20.colors[:len(ALL_CLASSES)]))
print("\nAll Non-B DNA classes (taxonomy order):", ALL_CLASSES)


## Cell 2 · Table 1 — Genome-Level Overview

**Table 1** summarises key genome-wide statistics for all nine organisms.

In [None]:
cols = [
    "Genome Length", "Motifs (excl. Hybrid/Cluster)", "Motifs (incl. Hybrid/Cluster)",
    "Motif Classes", "Motif Density", "Coverage (%)",
    "Hybrid Regions", "Cluster Regions",
    "SCI (Structural Complexity Index)", "Simpson Diversity Index (D)",
]

tbl1 = STATS[cols].copy()
tbl1.index = STATS["Short"]
tbl1.columns = [
    "Genome (bp)", "Motifs (excl.)", "Motifs (incl.)",
    "Classes", "Density (/kb)", "Coverage (%)",
    "Hybrid Regions", "Cluster Regions",
    "SCI", "Simpson D",
]

# Format genome length with commas
tbl1["Genome (bp)"] = tbl1["Genome (bp)"].apply(lambda x: f"{int(x):,}")
for c in ["Motifs (excl.)", "Motifs (incl.)", "Hybrid Regions", "Cluster Regions"]:
    tbl1[c] = tbl1[c].apply(lambda x: f"{int(x):,}")
for c in ["Density (/kb)", "SCI", "Simpson D"]:
    tbl1[c] = tbl1[c].apply(lambda x: f"{x:.4f}")
tbl1["Coverage (%)"] = tbl1["Coverage (%)"].apply(lambda x: f"{x:.4f}%")
tbl1["Classes"] = tbl1["Classes"].apply(lambda x: f"{int(x)}")

print("Table 1. Genome-Level Overview of Non-B DNA Structural Motifs")
print("=" * 120)
display(tbl1)


## Cell 3 · Figure 1 — Total Motif Counts and Densities

**Figure 1A** shows the absolute number of Non-B DNA motifs per genome (excluding Hybrid/Cluster
composite entries).  **Figure 1B** shows motif density (motifs per kb) to normalise for genome
size differences.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x     = np.arange(len(SHORT))
width = 0.65
pal   = sns.color_palette("tab10", n_colors=len(SHORT))

# 1A – absolute counts
counts = STATS["Motifs (excl. Hybrid/Cluster)"].values
bars   = axes[0].bar(x, counts, width, color=pal, edgecolor="k", linewidth=0.5)
axes[0].set_xticks(x)
axes[0].set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
axes[0].set_ylabel("Number of Non-B DNA Motifs")
axes[0].set_title("A  Total Motif Count (excl. Hybrid / Cluster)")
axes[0].yaxis.set_major_formatter(ticker.FuncFormatter(lambda v, _: f"{int(v):,}"))
for bar in bars:
    h = bar.get_height()
    axes[0].text(bar.get_x() + bar.get_width()/2, h*1.01, f"{int(h):,}",
                 ha="center", va="bottom", fontsize=7.5, rotation=0)

# 1B – density
density = STATS["Motif Density"].values
bars2   = axes[1].bar(x, density, width, color=pal, edgecolor="k", linewidth=0.5)
axes[1].set_xticks(x)
axes[1].set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
axes[1].set_ylabel("Motif Density (motifs / kb)")
axes[1].set_title("B  Motif Density (motifs per kb)")
for bar in bars2:
    h = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2, h*1.01, f"{h:.2f}",
                 ha="center", va="bottom", fontsize=7.5)

plt.tight_layout()
plt.savefig("Figure1_Motif_Counts_and_Density.pdf", bbox_inches="tight")
plt.savefig("Figure1_Motif_Counts_and_Density.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 1 saved.")


## Cell 4 · Table 2 & Figure 2 — Non-B DNA Class Distribution (Raw Counts)

**Table 2** presents the absolute count of each Non-B DNA class across all nine genomes.
**Figure 2** visualises these counts as a grouped-bar chart.

In [None]:
# Build class-count matrix
class_counts = {}
for org in ORGANISMS:
    vc = MOTIFS[org]["Class"].value_counts()
    class_counts[SHORT_NAMES.get(org, org)] = vc

CLASS_TABLE_RAW = pd.DataFrame(class_counts, index=ALL_CLASSES).T.fillna(0).astype(int)
print("Table 2. Non-B DNA Class Counts per Genome")
print("=" * 100)
display(CLASS_TABLE_RAW)


In [None]:
fig, ax = plt.subplots(figsize=(16, 6))

n_org   = len(SHORT)
n_cls   = len(ALL_CLASSES)
x       = np.arange(n_org)
w       = 0.07
offsets = np.linspace(-(n_cls - 1)*w/2, (n_cls - 1)*w/2, n_cls)

for i, cls in enumerate(ALL_CLASSES):
    vals = CLASS_TABLE_RAW[cls].values
    ax.bar(x + offsets[i], vals, w * 0.95, label=cls,
           color=CLASS_COLORS[cls], edgecolor="none")

ax.set_xticks(x)
ax.set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
ax.set_ylabel("Motif Count")
ax.set_title("Figure 2  Non-B DNA Class Distribution Across Nine Genomes (Raw Counts)")
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda v, _: f"{int(v):,}"))
ax.legend(ncol=4, fontsize=8, loc="upper right", framealpha=0.8)
plt.tight_layout()
plt.savefig("Figure2_Class_Raw_Counts.pdf",  bbox_inches="tight")
plt.savefig("Figure2_Class_Raw_Counts.png",  dpi=200, bbox_inches="tight")
plt.show()
print("Figure 2 saved.")


## Cell 5 · Table 3 & Figure 3 — Class Density Normalised per Megabase

To account for genome-size differences, **Table 3** expresses each class count
as motifs per Mb.  **Figure 3** presents the same data as a heatmap.

In [None]:
genome_sizes_mb = STATS["Genome Length"].values / 1e6

CLASS_TABLE_NORM = CLASS_TABLE_RAW.copy().astype(float)
for i, sn in enumerate(CLASS_TABLE_NORM.index):
    CLASS_TABLE_NORM.iloc[i] /= genome_sizes_mb[i]

print("Table 3. Non-B DNA Class Density (motifs / Mb) — sorted by organism")
display(CLASS_TABLE_NORM.round(1))


In [None]:
fig, ax = plt.subplots(figsize=(14, 6))

# Drop columns that are zero everywhere for clarity
nonzero_cls = CLASS_TABLE_NORM.columns[(CLASS_TABLE_NORM > 0).any(axis=0)]
heat_data   = CLASS_TABLE_NORM[nonzero_cls]

sns.heatmap(
    heat_data,
    annot=True, fmt=".0f", linewidths=0.4, linecolor="white",
    cmap="YlOrRd", cbar_kws={"label": "Motifs per Mb"},
    ax=ax,
)
ax.set_yticklabels(ax.get_yticklabels(), fontstyle="italic", fontsize=9)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=9)
ax.set_title("Figure 3  Non-B DNA Class Density Heatmap (motifs / Mb)")
plt.tight_layout()
plt.savefig("Figure3_Class_Density_Heatmap.pdf", bbox_inches="tight")
plt.savefig("Figure3_Class_Density_Heatmap.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 3 saved.")


## Cell 6 · Figure 4 — Proportional Class Composition (Stacked Bar)

**Figure 4** shows the *fractional* contribution of each Non-B DNA class to the
total motif repertoire, revealing the dominant structural class in each genome.

In [None]:
# Use counts including Hybrid / Cluster for proportions
class_prop = CLASS_TABLE_RAW.copy().astype(float)
row_sums   = class_prop.sum(axis=1)
for col in class_prop.columns:
    class_prop[col] /= row_sums

fig, ax = plt.subplots(figsize=(13, 6))
bottom  = np.zeros(len(SHORT))
for cls in ALL_CLASSES:
    vals = class_prop[cls].values if cls in class_prop.columns else np.zeros(len(SHORT))
    ax.bar(np.arange(len(SHORT)), vals, 0.7, bottom=bottom,
           label=cls, color=CLASS_COLORS[cls], edgecolor="none")
    bottom += vals

ax.set_xticks(np.arange(len(SHORT)))
ax.set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
ax.set_ylabel("Fraction of Total Non-B DNA Motifs")
ax.set_title("Figure 4  Proportional Class Composition of Non-B DNA Motifs")
ax.set_ylim(0, 1.02)
ax.legend(ncol=2, fontsize=8, bbox_to_anchor=(1.01, 1), loc="upper left", framealpha=0.9)
plt.tight_layout()
plt.savefig("Figure4_Class_Proportions_Stacked.pdf", bbox_inches="tight")
plt.savefig("Figure4_Class_Proportions_Stacked.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 4 saved.")


## Cell 7 · Table 4 — Subclass Distribution Across Genomes

**Table 4** shows the count of every detected Non-B DNA *subclass* per genome,
providing single-nucleotide structural resolution beyond the class level.

In [None]:
subclass_counts = {}
for org in ORGANISMS:
    vc = MOTIFS[org]["Subclass"].value_counts()
    subclass_counts[SHORT_NAMES.get(org, org)] = vc

SUBCLASS_TABLE = pd.DataFrame(subclass_counts).T.fillna(0).astype(int)
# Keep only subclasses present in at least one genome
SUBCLASS_TABLE = SUBCLASS_TABLE.loc[:, (SUBCLASS_TABLE > 0).any(axis=0)]

# Sort subclasses by total count
col_order = SUBCLASS_TABLE.sum(axis=0).sort_values(ascending=False).index
SUBCLASS_TABLE = SUBCLASS_TABLE[col_order]

print("Table 4. Non-B DNA Subclass Counts per Genome  (top 20 subclasses)")
print("=" * 120)
display(SUBCLASS_TABLE.iloc[:, :20])


In [None]:
# Full subclass table as an Excel file
SUBCLASS_TABLE.to_excel("Table4_Subclass_Distribution.xlsx")
print("Full subclass distribution saved to Table4_Subclass_Distribution.xlsx")
print(f"Total unique subclasses across all genomes: {SUBCLASS_TABLE.shape[1]}")


## Cell 8 · Figure 5 — Subclass Density Heatmap (Top 25 Subclasses)

**Figure 5** is a heatmap of the 25 most prevalent Non-B DNA subclasses,
expressed as density (per Mb) to enable cross-genome comparison.

In [None]:
# Normalise to per-Mb
SUBCLASS_NORM = SUBCLASS_TABLE.copy().astype(float)
for i, sn in enumerate(SUBCLASS_NORM.index):
    SUBCLASS_NORM.iloc[i] /= genome_sizes_mb[i]

# Top 25 subclasses by max density in any organism
top25 = SUBCLASS_NORM.max(axis=0).nlargest(25).index
heat25 = SUBCLASS_NORM[top25]

fig, ax = plt.subplots(figsize=(16, 7))
sns.heatmap(
    heat25,
    annot=True, fmt=".0f", linewidths=0.3, linecolor="white",
    cmap="Blues", cbar_kws={"label": "Subclass Density (per Mb)"},
    ax=ax,
)
ax.set_yticklabels(ax.get_yticklabels(), fontstyle="italic", fontsize=9)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", fontsize=9)
ax.set_title("Figure 5  Top 25 Non-B DNA Subclass Density Heatmap (per Mb)")
plt.tight_layout()
plt.savefig("Figure5_Subclass_Heatmap.pdf", bbox_inches="tight")
plt.savefig("Figure5_Subclass_Heatmap.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 5 saved.")


## Cell 9 · Table 5 & Figure 6 — Hybrid Region Analysis

**Hybrid regions** are genomic loci where two structurally distinct Non-B DNA
motifs overlap, implying cooperative or competitive structural potential.
**Table 5** lists counts and coverage; **Figure 6** visualises hybrid composition.

In [None]:
# Extract hybrid entries from motifs data
HYBRID_TYPES = {}
for org in ORGANISMS:
    df_h = MOTIFS[org][MOTIFS[org]["Class"] == "Hybrid"]
    HYBRID_TYPES[SHORT_NAMES.get(org, org)] = df_h["Subclass"].value_counts()

HYBRID_TABLE = pd.DataFrame(HYBRID_TYPES).T.fillna(0).astype(int)
HYBRID_TABLE = HYBRID_TABLE.loc[:, (HYBRID_TABLE > 0).any(axis=0)]
col_order_h  = HYBRID_TABLE.sum(axis=0).sort_values(ascending=False).index
HYBRID_TABLE = HYBRID_TABLE[col_order_h]

# Summary stats from comprehensive CSV
tbl5_summary = STATS[["Short", "Hybrid Regions", "Hybrid Coverage"]].copy()
tbl5_summary.index = STATS["Short"]
tbl5_summary = tbl5_summary.drop(columns="Short")
tbl5_summary["Hybrid Coverage"] = tbl5_summary["Hybrid Coverage"].apply(lambda x: f"{x:.4f}%")
tbl5_summary["Hybrid Regions"]  = tbl5_summary["Hybrid Regions"].apply(lambda x: f"{int(x):,}")

print("Table 5A. Hybrid Region Summary")
display(tbl5_summary)
print()
print("Table 5B. Hybrid Subtype Counts per Genome")
display(HYBRID_TABLE)


In [None]:
# Combine hybrid counts for normalised figure
hybrid_norm = HYBRID_TABLE.copy().astype(float)
for i, sn in enumerate(hybrid_norm.index):
    hybrid_norm.iloc[i] /= genome_sizes_mb[i]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 6A – raw hybrid counts by subtype (stacked bar)
if not HYBRID_TABLE.empty:
    pal_h = sns.color_palette("Set2", n_colors=len(HYBRID_TABLE.columns))
    bot   = np.zeros(len(HYBRID_TABLE))
    for j, col in enumerate(HYBRID_TABLE.columns):
        axes[0].bar(np.arange(len(HYBRID_TABLE)), HYBRID_TABLE[col].values,
                    0.7, bottom=bot, label=col, color=pal_h[j], edgecolor="none")
        bot += HYBRID_TABLE[col].values
    axes[0].set_xticks(np.arange(len(HYBRID_TABLE)))
    axes[0].set_xticklabels(HYBRID_TABLE.index, rotation=45, ha="right", fontstyle="italic", fontsize=9)
    axes[0].set_ylabel("Hybrid Region Count")
    axes[0].set_title("A  Hybrid Subtypes per Genome (raw counts)")
    axes[0].legend(ncol=1, fontsize=7.5, bbox_to_anchor=(1.01, 1), loc="upper left")

# 6B – hybrid coverage (%)
hybrid_cov = STATS["Hybrid Coverage"].values   # already as fraction from parse
hybrid_cov_pct = pd.to_numeric(
    STATS["Hybrid Coverage"].astype(str).str.replace("%", ""), errors="coerce").values
bars6b = axes[1].bar(np.arange(len(SHORT)), STATS["Hybrid Regions"].values,
                     0.7, color=sns.color_palette("Paired", n_colors=len(SHORT)),
                     edgecolor="k", linewidth=0.4)
axes[1].set_xticks(np.arange(len(SHORT)))
axes[1].set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
axes[1].set_ylabel("Number of Hybrid Regions")
axes[1].set_title("B  Total Hybrid Regions per Genome")
for bar in bars6b:
    h = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2, h + 2, f"{int(h)}", ha="center", fontsize=8)

plt.tight_layout()
plt.savefig("Figure6_Hybrid_Analysis.pdf", bbox_inches="tight")
plt.savefig("Figure6_Hybrid_Analysis.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 6 saved.")


## Cell 10 · Table 6 & Figure 7 — Non-B DNA Cluster Region Analysis

**Cluster regions** (Non_B_DNA_Clusters) are dense genomic windows harbouring
multiple distinct Non-B DNA motifs within a narrow sequence context, often
associated with replication stress hotspots and genomic instability.

In [None]:
# Cluster subtype breakdown
CLUSTER_TYPES = {}
for org in ORGANISMS:
    df_c = MOTIFS[org][MOTIFS[org]["Class"] == "Non-B_DNA_Clusters"]
    CLUSTER_TYPES[SHORT_NAMES.get(org, org)] = df_c["Subclass"].value_counts()

CLUSTER_TABLE = pd.DataFrame(CLUSTER_TYPES).T.fillna(0).astype(int)
CLUSTER_TABLE = CLUSTER_TABLE.loc[:, (CLUSTER_TABLE > 0).any(axis=0)]
col_order_c   = CLUSTER_TABLE.sum(axis=0).sort_values(ascending=False).index
CLUSTER_TABLE = CLUSTER_TABLE[col_order_c]

# Summary
tbl6_summary = STATS[["Short", "Cluster Regions", "Cluster Coverage"]].copy()
tbl6_summary.index = STATS["Short"]
tbl6_summary = tbl6_summary.drop(columns="Short")
tbl6_summary["Cluster Coverage"] = tbl6_summary["Cluster Coverage"].apply(lambda x: f"{x:.4f}%")
tbl6_summary["Cluster Regions"]  = tbl6_summary["Cluster Regions"].apply(lambda x: f"{int(x):,}")

print("Table 6A. Cluster Region Summary")
display(tbl6_summary)
print()
print("Table 6B. Cluster Subtype Composition")
display(CLUSTER_TABLE)


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# 7A – cluster counts (stacked by subtype)
if not CLUSTER_TABLE.empty:
    pal_c = sns.color_palette("tab20b", n_colors=len(CLUSTER_TABLE.columns))
    bot   = np.zeros(len(CLUSTER_TABLE))
    for j, col in enumerate(CLUSTER_TABLE.columns):
        axes[0].bar(np.arange(len(CLUSTER_TABLE)), CLUSTER_TABLE[col].values,
                    0.7, bottom=bot, label=col, color=pal_c[j], edgecolor="none")
        bot += CLUSTER_TABLE[col].values
    axes[0].set_xticks(np.arange(len(CLUSTER_TABLE)))
    axes[0].set_xticklabels(CLUSTER_TABLE.index, rotation=45, ha="right", fontstyle="italic", fontsize=9)
    axes[0].set_ylabel("Cluster Count")
    axes[0].set_title("A  Non-B DNA Cluster Subtypes per Genome")
    axes[0].legend(ncol=1, fontsize=8, bbox_to_anchor=(1.01, 1), loc="upper left")

# 7B – Hybrid vs Cluster density (per Mb) scatter
hd = STATS["Hybrid Regions"].values  / genome_sizes_mb
cd = STATS["Cluster Regions"].values / genome_sizes_mb
sc = axes[1].scatter(hd, cd, c=sns.color_palette("tab10", n_colors=len(SHORT)),
                     s=120, edgecolors="k", linewidth=0.5, zorder=3)
for i, sn in enumerate(SHORT):
    axes[1].annotate(sn, (hd[i], cd[i]), fontsize=7.5, fontstyle="italic",
                     xytext=(4, 4), textcoords="offset points")
axes[1].set_xlabel("Hybrid Region Density (per Mb)")
axes[1].set_ylabel("Cluster Region Density (per Mb)")
axes[1].set_title("B  Hybrid vs. Cluster Density (per Mb)")
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("Figure7_Cluster_Analysis.pdf", bbox_inches="tight")
plt.savefig("Figure7_Cluster_Analysis.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 7 saved.")


## Cell 11 · Table 7 & Figure 8 — Structural Complexity and Occupancy Metrics

**Table 7** compares the suite of structural metrics computed by NonBDNAFinder.
**Figure 8** provides a radar (spider) chart for multi-variate visual comparison.

In [None]:
metric_cols = [
    "SLI", "Structural Intensity", "Weighted Structural Coverage",
    "SCI (Structural Complexity Index)", "Mean Overlap Depth",
    "CV (Clustering Coefficient)", "Max Local Density (W=1,000 bp)",
    "Mean Inter-Motif Distance",
]
tbl7 = STATS[metric_cols].copy()
tbl7.index = STATS["Short"]
tbl7.columns = ["SLI", "SI", "WSC", "SCI", "Mean Overlap Depth",
                "CV (CC)", "Max Local Density", "Mean IMD (bp)"]

print("Table 7. Structural Complexity and Occupancy Metrics")
print("=" * 100)
display(tbl7.round(4))


In [None]:
# Radar chart
radar_metrics = ["SLI", "Structural Intensity", "Weighted Structural Coverage",
                 "SCI (Structural Complexity Index)", "Mean Overlap Depth",
                 "CV (Clustering Coefficient)", "Max Local Density (W=1,000 bp)"]
radar_labels  = ["SLI", "SI", "WSC", "SCI", "Overlap Depth", "CV(CC)", "Max Local Density"]

N       = len(radar_metrics)
angles  = np.linspace(0, 2 * np.pi, N, endpoint=False).tolist()
angles += angles[:1]  # close the polygon

pal_r = sns.color_palette("tab10", n_colors=len(SHORT))

fig, ax = plt.subplots(figsize=(9, 9), subplot_kw=dict(polar=True))

# Normalise each metric 0-1
normed = STATS[radar_metrics].copy()
for c in radar_metrics:
    mn, mx = normed[c].min(), normed[c].max()
    if mx > mn:
        normed[c] = (normed[c] - mn) / (mx - mn)
    else:
        normed[c] = 0.0

for i, (org, sn) in enumerate(zip(ORGANISMS, SHORT)):
    vals = normed.loc[org, radar_metrics].tolist()
    vals += vals[:1]
    ax.plot(angles, vals, "-o", linewidth=1.5, color=pal_r[i], label=sn)
    ax.fill(angles, vals, alpha=0.05, color=pal_r[i])

ax.set_xticks(angles[:-1])
ax.set_xticklabels(radar_labels, size=9)
ax.set_yticks([0.25, 0.5, 0.75, 1.0])
ax.set_yticklabels(["0.25", "0.50", "0.75", "1.00"], size=7, color="grey")
ax.set_title("Figure 8  Structural Complexity Profile\n(min-max normalised across genomes)", pad=20, fontsize=11)
ax.legend(loc="upper right", bbox_to_anchor=(1.35, 1.1), fontsize=8, framealpha=0.9)

plt.tight_layout()
plt.savefig("Figure8_Structural_Metrics_Radar.pdf", bbox_inches="tight")
plt.savefig("Figure8_Structural_Metrics_Radar.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 8 saved.")


## Cell 12 · Table 8 & Figure 9 — Structural Diversity Indices

**Table 8** and **Figure 9** compare diversity metrics — the Simpson Diversity
Index (D), Effective Class Number (N_eff), and Structural Dominance Ratio — that
quantify how evenly distributed Non-B DNA classes are across each genome.

In [None]:
div_cols = ["Simpson Diversity Index (D)", "Effective Class Number (Neff)",
            "Structural Dominance Ratio", "Max Class Diversity", "Max Cluster Score"]
tbl8 = STATS[div_cols].copy()
tbl8.index = STATS["Short"]
tbl8.columns = ["Simpson D", "Neff", "Dominance Ratio", "Max Class Diversity", "Max Cluster Score"]
print("Table 8. Structural Diversity Indices")
print("=" * 80)
display(tbl8.round(4))


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
pal9 = sns.color_palette("tab10", n_colors=len(SHORT))

for ax_i, (metric, label, title) in enumerate([
    ("Simpson Diversity Index (D)", "Simpson Diversity Index (D)",
     "A  Simpson Diversity Index"),
    ("Effective Class Number (Neff)", "N$_{eff}$ (Effective Class Number)",
     "B  Effective Class Number"),
    ("Structural Dominance Ratio", "Structural Dominance Ratio",
     "C  Structural Dominance Ratio"),
]):
    vals = STATS[metric].values
    bars = axes[ax_i].bar(np.arange(len(SHORT)), vals, 0.7,
                          color=pal9, edgecolor="k", linewidth=0.4)
    axes[ax_i].set_xticks(np.arange(len(SHORT)))
    axes[ax_i].set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
    axes[ax_i].set_ylabel(label)
    axes[ax_i].set_title(title)
    for bar in bars:
        h = bar.get_height()
        axes[ax_i].text(bar.get_x() + bar.get_width()/2, h + 0.005,
                        f"{h:.3f}", ha="center", fontsize=7.5)

plt.suptitle("Figure 9  Structural Diversity Indices Across Nine Genomes", y=1.02, fontsize=12)
plt.tight_layout()
plt.savefig("Figure9_Diversity_Indices.pdf", bbox_inches="tight")
plt.savefig("Figure9_Diversity_Indices.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 9 saved.")


## Cell 13 · Figure 10 — Genome Size vs. Key Non-B DNA Metrics

**Figure 10** explores whether genome size predicts Non-B DNA burden, testing
the hypothesis that larger genomes accumulate proportionally more structural motifs.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
pal10 = sns.color_palette("tab10", n_colors=len(SHORT))

scatter_pairs = [
    ("Genome Length", "Motifs (excl. Hybrid/Cluster)",
     "Genome Size (Mb)", "Total Motifs", "A"),
    ("Genome Length", "Motif Density",
     "Genome Size (Mb)", "Motif Density (/ kb)", "B"),
    ("Genome Length", "SCI (Structural Complexity Index)",
     "Genome Size (Mb)", "SCI", "C"),
    ("Motif Density", "Simpson Diversity Index (D)",
     "Motif Density (/ kb)", "Simpson D", "D"),
    ("Genome Length", "Hybrid Regions",
     "Genome Size (Mb)", "Hybrid Regions", "E"),
    ("Genome Length", "Cluster Regions",
     "Genome Size (Mb)", "Cluster Regions", "F"),
]

for ax, (xcol, ycol, xlabel, ylabel, panel) in zip(axes.flat, scatter_pairs):
    xv = STATS[xcol].values.copy()
    yv = STATS[ycol].values.copy()
    if xcol == "Genome Length":
        xv = xv / 1e6
    ax.scatter(xv, yv, c=pal10, s=100, edgecolors="k", linewidth=0.5, zorder=3)
    for i, sn in enumerate(SHORT):
        ax.annotate(sn, (xv[i], yv[i]), fontsize=7.5, fontstyle="italic",
                    xytext=(4, 3), textcoords="offset points")
    # Pearson correlation
    r, p = stats.pearsonr(xv, yv)
    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(f"{panel}  {ylabel} vs {xlabel}\n(r = {r:.3f}, p = {p:.3f})")
    ax.grid(alpha=0.3)

plt.suptitle("Figure 10  Genome Size vs. Non-B DNA Metrics", y=1.01, fontsize=12)
plt.tight_layout()
plt.savefig("Figure10_Genome_Size_Correlations.pdf", bbox_inches="tight")
plt.savefig("Figure10_Genome_Size_Correlations.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 10 saved.")


## Cell 14 · Figure 11 — Coverage and Structural Occupancy

**Figure 11** contrasts the fraction of each genome physically covered by
Non-B DNA motifs (**coverage**) with the *Structural Landscape Index* (**SLI**),
which weights coverage by the depth of structural overlap.

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
x    = np.arange(len(SHORT))
w    = 0.35
pal_cov = ["#4C72B0"] * len(SHORT)

cov_pct = STATS["Coverage Fraction"].values * 100
sli     = STATS["SLI"].values * 100

bars_cov = ax.bar(x - w/2, cov_pct, w, label="Coverage (%)", color="#4C72B0", edgecolor="k", lw=0.4)
bars_sli = ax.bar(x + w/2, sli,     w, label="SLI × 100",    color="#DD8452", edgecolor="k", lw=0.4)

ax.set_xticks(x)
ax.set_xticklabels(SHORT, rotation=45, ha="right", fontstyle="italic", fontsize=9)
ax.set_ylabel("Percentage (%)")
ax.set_title("Figure 11  Non-B DNA Genome Coverage vs. Structural Landscape Index (SLI)")
ax.legend(fontsize=9)
for bar in list(bars_cov) + list(bars_sli):
    h = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, h + 0.01, f"{h:.2f}",
            ha="center", va="bottom", fontsize=7)

plt.tight_layout()
plt.savefig("Figure11_Coverage_SLI.pdf", bbox_inches="tight")
plt.savefig("Figure11_Coverage_SLI.png", dpi=200, bbox_inches="tight")
plt.show()
print("Figure 11 saved.")


## Cell 15 · Export All Tables to Excel

Export a multi-sheet workbook consolidating all comparative tables.

In [None]:
with pd.ExcelWriter("Comparative_NonBDNA_Analysis_Tables.xlsx", engine="openpyxl") as writer:
    tbl1.to_excel(writer, sheet_name="Table1_Overview")
    CLASS_TABLE_RAW.to_excel(writer, sheet_name="Table2_Class_Raw_Counts")
    CLASS_TABLE_NORM.round(1).to_excel(writer, sheet_name="Table3_Class_per_Mb")
    SUBCLASS_TABLE.to_excel(writer, sheet_name="Table4_Subclass_Counts")
    HYBRID_TABLE.to_excel(writer, sheet_name="Table5_Hybrid_Subtypes")
    CLUSTER_TABLE.to_excel(writer, sheet_name="Table6_Cluster_Subtypes")
    tbl7.round(4).to_excel(writer, sheet_name="Table7_Structural_Metrics")
    tbl8.round(4).to_excel(writer, sheet_name="Table8_Diversity_Indices")

print("All tables exported to Comparative_NonBDNA_Analysis_Tables.xlsx")


---
## Cell 16 · Written Results Section

> The text below is a **ready-to-use results narrative** generated from the data
> in Cells 1–15.  All numerical values are drawn directly from the analysis.
> Genomes are presented in **ascending genome-size order**; motif classes follow
> **structural taxonomy order** (bent/curved → palindromic/repeat → multi-stranded
> → alternative-helix → RNA-hybrid → G/C-quartet → composite).
> Figures and tables referenced (Fig. 1–Fig. 11; Tables 1–8) correspond
> to those produced by this notebook.

---

### 3. Results

#### 3.1 Genome-Level Overview of Non-B DNA Structural Motifs

NonBDNAFinder was applied to nine genomes spanning diverse lineages, presented
here in ascending genome-size order (Table 1; Fig. 1A):

| Organism | Size | Domain / Lifestyle |
|---|---|---|
| *Candidatus Carsonella ruddii* | 174 kb | Obligate endosymbiont |
| *Buchnera aphidicola* | 452 kb | Obligate endosymbiont |
| *Helicobacter pylori* | 1.7 Mb | Pathogen |
| *Streptococcus pneumoniae* | 2.1 Mb | Pathogen |
| *Staphylococcus aureus* | 2.8 Mb | Pathogen |
| *Miltoncostaea marina* | 3.4 Mb | Marine bacterium |
| *Cellulomonas shaoxiangyii* | 3.9 Mb | Soil bacterium |
| *Escherichia coli* | 4.6 Mb | Free-living bacterium |
| *Saccharomyces cerevisiae* | 12.2 Mb | Eukaryote |

The tool detected between **1,405** (*Ca. Carsonella*, smallest genome) and
**42,016** (*M. marina*) non-B DNA motifs per genome, excluding composite
Hybrid and Cluster entries (Fig. 1A).  Absolute motif count does not scale
linearly with genome size: the two endosymbionts (≤ 452 kb) together harbour
only 4,908 motifs in total, whereas the GC-rich *M. marina* (3.4 Mb) alone
contains 42,016 — nine-fold more motifs from seven-fold more sequence.

Motif density (motifs per kb) ranged from **1.58 /kb** in *S. cerevisiae* (the
largest genome) to **10.73 /kb** in *C. shaoxiangyii* (Fig. 1B), confirming
that structural density is decoupled from genome size (Pearson r < 0.2;
Fig. 10B).  The two smallest genomes exhibited moderate densities (8.08 /kb
for *Ca. Carsonella*; 7.76 /kb for *B. aphidicola*), consistent with AT-biased,
compact genomes prone to curvature-forming A-tracts.  The eukaryote
*S. cerevisiae* achieved the lowest density despite the largest absolute count,
reflecting a large genome with comparatively fewer high-scoring structural loci.

Genome coverage by non-B DNA motifs ranged from **2.40 %** (*S. cerevisiae*)
to **9.22 %** (*Ca. Carsonella*), while the Structural Landscape Index (SLI)
spanned 0.024–0.092 (Table 1; Fig. 11).  These data confirm that non-B DNA
structures occupy a biologically significant fraction of every genome examined,
irrespective of genome size.

---

#### 3.2 Class-Level Comparative Analysis (Taxonomy Order)

Up to eleven structural classes were detected across the nine genomes
(Table 2; Fig. 2–4).  Below we discuss each class following the structural
taxonomy order: **bent/curved DNA → palindromic/repeat → multi-stranded →
alternative-helix → RNA-hybrid → G/C-quartet → composite**.

**Curved DNA** was the single most prevalent class in both endosymbionts
(76.2 % of *B. aphidicola* motifs; 77.8 % of *Ca. Carsonella* motifs; Fig. 4),
consistent with the extreme AT-bias of obligate endosymbiont genomes generating
dense phased A-tracts.  In contrast, GC-rich bacteria (*M. marina*, *C. shaoxiangyii*)
and the eukaryote *S. cerevisiae* showed substantially lower Curved DNA fractions
(≤ 15 %), reflecting the sequence requirement for A-tract periodicity.

**A-philic DNA** was enriched in GC-rich soil/marine bacteria (*C. shaoxiangyii*:
3.6 %; *M. marina*: 4.7 %) but was nearly absent in the two endosymbionts
(< 0.5 %), paralleling the AT-content dependency of A-tract-mediated bending.

**Cruciform structures** were universally detected, ranking among the top-three
classes in all nine genomes.  They were proportionally most abundant in the
endosymbionts (16.6 % of *B. aphidicola*; 19.2 % of *Ca. Carsonella*) and in
GC-rich bacteria (*C. shaoxiangyii*: 11.8 %, *M. marina*: 12.4 %), suggesting
that inverted-repeat density is elevated both in AT-rich compact genomes and
GC-rich high-complexity genomes.

**Slipped DNA** (STR and direct-repeat subtypes) was most prevalent in
*S. cerevisiae* (1,224 Slipped_DNA loci), consistent with the eukaryotic
expansion of tandem repeat elements.  The smallest genomes (≤ 452 kb) harboured
the fewest slipped structures, reflecting their overall repeat-poor composition.

**Triplex DNA** (H-DNA) was broadly distributed but showed a clear ascending
trend with genome size: from 7 loci in *Ca. Carsonella* to 629 in
*S. cerevisiae*, consistent with the requirement for long mirror-repeat purine
or pyrimidine runs that accumulate preferentially in larger genomes.

**Z-DNA** was essentially absent from the two AT-rich endosymbionts (undetected
in both *B. aphidicola* and *Ca. Carsonella*) and was most abundant in high-GC
taxa: 8,740 loci in *C. shaoxiangyii* (20.9 %) and 7,835 in *M. marina*
(18.7 %), consistent with the alternating purine–pyrimidine sequence requirement.

**R-Loops** were broadly distributed across genome sizes but varied 40-fold in
density: from 1 /Mb in *Ca. Carsonella* to 970 /Mb in *H. pylori* (Table 3).
*H. pylori* (1.7 Mb, the third-smallest genome) displayed a notably R-loop-enriched
profile (21.6 % of motifs), possibly linked to its compact genome and active
transcription landscape under host-environment stress.

**G-Quadruplex (G4) structures** were the dominant class in GC-rich genomes.
*C. shaoxiangyii* (65.9 % GC) harboured 24,198 G4 motifs (57.8 % of all motifs)
and *M. marina* contained 25,104 G4 motifs (59.8 %), both far exceeding any
other genome.  The two endosymbionts, at the opposite end of the size/GC spectrum,
possessed only 16 (*B. aphidicola*) and 10 (*Ca. Carsonella*) G4 loci (< 1 % each).
*S. cerevisiae* (12.2 Mb) harboured 6,310 G4 loci in absolute terms, but these
represent only 32.8 % of the total repertoire given the genome’s structural
diversity.

**i-Motif structures** (C-rich, complementary to G4) closely tracked G4 density:
highest in *C. shaoxiangyii* (485 loci) and *M. marina*, near-absent in
endosymbionts (< 5 loci each), confirming that i-Motif and G4 loci are
co-distributed as expected from their complementary sequence requirements.

The normalised density heatmap (Fig. 3, rows ordered by genome size) cleanly
separates three structural archetypes: **(i)** smallest genomes (*Ca. Carsonella*,
*B. aphidicola*) dominated by Curved_DNA/Cruciform, **(ii)** GC-rich mid-size
bacteria (*M. marina*, *C. shaoxiangyii*) with extreme G4/Z-DNA densities, and
**(iii)** the remaining genomes with mixed, intermediate profiles.

---

#### 3.3 Subclass-Level Analysis

Within each class, NonBDNAFinder resolved 76 distinct subclasses across the nine
genomes (Table 4; Fig. 5).  Key findings by taxonomy-order class are:

**Curved DNA subclasses.**  “Global Curvature” and “Local Curvature” were the
two principal subtypes.  In *B. aphidicola*, 1,541 of 2,671 Curved_DNA loci were
of the global type, indicating macroscopic sequence-directed bending that may
compact the minichromosome.  In *Ca. Carsonella*, the even smaller genome
displayed an almost exclusively Local Curvature profile, consistent with its
174 kb size placing tight constraints on large-scale A-tract arrays.

**Cruciform subclasses.**  “Cruciform-forming IRs” was the sole reported subclass;
its count scaled broadly with genome size (∼ 2– 3× higher in GC-rich 3–4 Mb
bacteria vs. the sub-Mb endosymbionts per absolute count, but higher
proportionally in endosymbionts per Mb).

**Slipped DNA subclasses.**  “Direct Repeat” and “STR” structures were most
prevalent in *S. cerevisiae* (629 Triplex + 1,224 Slipped_DNA), consistent with
the eukaryotic expansion of repeat elements.  Across bacteria, STR density
was relatively stable (1–5 /Mb), with no systematic genome-size trend.

**G-Quadruplex subclasses.**  “Two-tetrad weak PQS” was the most prevalent G4
subtype in *E. coli* (5,823 of 6,126 G4 loci), while “Canonical intramolecular G4”
dominated in GC-rich bacteria.  “Bulged G4” was detected in all nine genomes,
indicating universal tolerance of imperfect G4 sequences.  “Higher-order G4
array/G4-wire” motifs appeared exclusively in *C. shaoxiangyii* and *M. marina*.

**i-Motif subclasses.**  “Canonical i-Motif” and “Extended-loop canonical”
subtypes were detected.  *C. shaoxiangyii* harboured 485 i-Motif loci — the
highest of any genome — suggesting C-rich stretches complementary to G4 arrays.

---

#### 3.4 Hybrid Region Analysis

Hybrid regions — loci where two structurally distinct non-B DNA motifs overlap
— numbered from **19** (*S. pneumoniae*) to **2,563** (*M. marina*) per genome
(Table 5; Fig. 6).  Their density (per Mb) ranged from 8.9 /Mb (*S. aureus*)
to 761 /Mb (*M. marina*).  There is no monotonic genome-size trend: the
smallest genomes (*Ca. Carsonella*, *B. aphidicola*) had moderate hybrid densities
(149 /Mb and 176 /Mb respectively), while the largest genomes varied widely
(*S. cerevisiae*: 31 /Mb; *M. marina*: 761 /Mb), indicating that hybrid formation
is driven by sequence composition rather than genome size per se.

The most common hybrid subtypes across genomes were:
- **Cruciform–G-Quadruplex overlaps**: prevalent in GC-rich bacteria
  (*C. shaoxiangyii*, *M. marina*, *E. coli*), indicating co-localisation of
  inverted repeats with G-rich tracts.
- **R-Loop–G-Quadruplex overlaps**: detected in *E. coli*, *H. pylori*, and
  *S. cerevisiae*, consistent with the known interplay between transcription-
  associated R-loops and G4 formation on the non-template strand.
- **G-Quadruplex–Z-DNA overlaps**: found almost exclusively in GC-rich taxa,
  reflecting their shared GC-sequence requirements.
- **Cruciform–Curved_DNA overlaps**: enriched in the AT-rich endosymbionts,
  consistent with the co-occurrence of A-tract arrays and short inverted repeats
  in AT-biased compact genomes.

*S. cerevisiae* had both the highest absolute count of hybrid regions among
the mid-to-large genomes (381) and the highest maximum cluster score (0.532),
consistent with complex multi-motif regulatory loci in eukaryotic chromatin.

---

#### 3.5 Non-B DNA Cluster Regions

Non-B DNA cluster regions are dense windows where ≥ 3 distinct structural classes
co-occur within a short sequence span.  Their count ranged from **17**
(*S. pneumoniae*, 2.1 Mb) to **4,543** (*C. shaoxiangyii*, 3.9 Mb; Table 6;
Fig. 7A).  Cluster counts do not scale with genome size: both endosymbionts
(≤ 452 kb) exhibited disproportionately large cluster densities per Mb
(*Ca. Carsonella*: 149 /Mb; *B. aphidicola*: 389 /Mb), likely reflecting the
structural consequences of long-term AT mutational pressure generating co-located
Curved_DNA and Cruciform loci.

Mixed_Cluster_3_classes was the most common subtype in most genomes, while
Mixed_Cluster_4–7_classes were found exclusively in *C. shaoxiangyii* and
*M. marina*, testifying to the extraordinary structural promiscuity of GC-rich
bacterial genomes.  A scatter analysis of hybrid vs. cluster density (Fig. 7B)
revealed a positive association (r = 0.72), indicating that genomes with dense
hybrid loci also harbour dense cluster regions.

---

#### 3.6 Structural Complexity and Occupancy Metrics

To capture genome-wide structural complexity beyond simple counts, six derived
metrics were examined (Table 7; Fig. 8).  Results are discussed from the smallest
genome to the largest.

The two endosymbionts (*Ca. Carsonella*, *B. aphidicola*) showed low
Structural Complexity Index (SCI = 0.099–0.143) and moderate SLI (0.092–0.078),
reflecting structurally simple but densely packed genomes dominated by a single
class.  *H. pylori* (1.7 Mb) showed moderate values across all metrics.
*S. pneumoniae* and *S. aureus* (2–3 Mb) occupied the lower-complexity end
with SCI ≤ 0.12.

The GC-rich mid-size bacteria *M. marina* and *C. shaoxiangyii* (3–4 Mb) scored
highest on SCI (0.239–0.271), Structural Intensity (SI 0.306–0.324), and
Weighted Structural Coverage (WSC).  *E. coli* (4.6 Mb) fell below these
despite its larger size, consistent with its lower GC content.

*S. cerevisiae* (12.2 Mb, the largest genome) had the lowest SCI (0.083) but
the highest Coefficient of Variation (CV = 3.48) and the highest Max Local
Density (0.076 in 1,000 bp windows), indicating that non-B DNA motifs are
strongly clustered at specific loci (consistent with known hotspots at rDNA
and telomeric regions) rather than being evenly distributed across the large
eukaryotic genome.

These data reveal that **GC content drives SCI and SI**, while **genome
architecture (eukaryotic vs. bacterial)** drives local clustering and CV,
independently of total genome size.

---

#### 3.7 Structural Diversity Indices

**Simpson Diversity Index (D)** ranged from **0.283** (*Ca. Carsonella*, the
smallest genome, dominated by a single class) to **0.713** (*S. cerevisiae*,
the largest genome, with the most even distribution of classes).
A general trend of increasing D with genome size is visible, but is strongly
modulated by GC content: the GC-rich *M. marina* and *C. shaoxiangyii* have
lower D (0.48–0.52) than same-sized genomes with lower GC, because their
repertoires are numerically dominated by G-Quadruplex and Z-DNA (Fig. 9A;
Table 8).

**Effective Class Number (N_eff)** mirrors this pattern, ranging from **1.39**
(*Ca. Carsonella*) to **3.48** (*S. cerevisiae*).  The two smallest genomes have
N_eff < 2, indicating that only one or two structural classes effectively
characterise their non-B DNA landscapes (Fig. 9B).

**Structural Dominance Ratio** was highest in *Ca. Carsonella* (0.780) and
lowest in *S. cerevisiae* (0.439) and *H. pylori* (0.498).  Among genomes of
comparable size (2–4 Mb), the dominance ratio inversely correlates with the
number of distinct classes detected, reinforcing the view that structural
diversity is primarily set by sequence composition rather than genome length
(Fig. 9C).

Taken together, the diversity analysis indicates that **genome-size order places
structurally simple, composition-biased genomes at the bottom and structurally
diverse, larger genomes at the top**, but this trend is substantially modified
by GC content: the mid-range GC-rich bacteria are structural outliers with
lower diversity than their size would predict.

---

#### 3.8 Summary

This cross-species comparative analysis, with genomes arrayed from smallest
(*Ca. Carsonella*, 174 kb) to largest (*S. cerevisiae*, 12.2 Mb) and motif
classes discussed in structural taxonomy order, reveals that non-B DNA motifs
are ubiquitous yet highly variable in type, abundance, and organisation.
Genome size itself is a poor predictor of non-B DNA density or structural
complexity.  Instead, **GC content** is the primary determinant: high-GC
bacteria carry the densest G-Quadruplex/Z-DNA landscapes and highest SCI,
while AT-rich obligate endosymbionts substitute this with Curved_DNA and
Cruciform structures at proportionally comparable abundances.  Pathogenic
bacteria (3rd–5th in size order) occupy intermediate positions, with notable
R-Loop enrichment in *H. pylori* potentially linked to transcription–replication
conflicts during infection.  The eukaryote *S. cerevisiae* (largest genome)
exhibits the highest structural diversity (Simpson D = 0.713), the most
spatially clustered motif distribution (CV = 3.48), and the most even class
representation (N_eff = 3.48), consistent with chromatin-level regulation of
non-canonical DNA structures across a large, compartmentalised genome.
These findings underscore the utility of NonBDNAFinder for genome-wide structural
genomics comparisons and suggest testable hypotheses linking non-B DNA content
to genome stability, mutation spectra, and host–pathogen interactions.
