# Replicon Call Audit: pf32/wp Cross-Family Conflicts

**Purpose:** Identify contigs where pf32 and wp databases assign different 
replicon *families* (not just variants within the same family). These are 
potential classification errors in the existing 10.2 calls.


**Categories:**
1. `chromosome_paralog` — pf32 hits chromosomal pf32 paralogs, wp identifies actual plasmid
2. `lp56_cp32` — pf32 says lp56 (which carries cp32-derived pf32 genes), wp says cp32-X
3. `other_mismatch` — different families, shared sequence tiebreaks
**Exclusions:**
- Fusion calls (containing `:::`)
  - this was functionality I added to plasmid_caller after all original annotation was performed. For this, we'll keep the existing calls. This will be discussed in detail in future work.
- Large contigs (≥100kb) correctly resolved as chromosome by CHROMOSOME_MIN_BP logic

In [1]:
import pandas as pd
import re

# --- Config ---
SUMMARY_PATH = "calls/summary_best_hits.tsv" # latest hits from the most UTD plasmid_caller parsed from the existing BLAST results.
OLD_CALLS_PATH = "best_hits_1000bp_v10.2.csv"

# --- Helper ---
def get_family(name):
    """Extract replicon family from a call name."""
    if pd.isna(name):
        return None
    name = name.lower()
    if name == "chromosome":
        return "chromosome"
    m = re.match(r"(cp32|lp\d+|cp\d+|lp5)", name.split(":::")[0].split("+")[0])
    return m.group(1) if m else name

def categorize_error(r):
    """Categorize a cross-family pf32/wp conflict."""
    if r.plasmid_name_pf32 == "chromosome" and r.plasmid_name_wp != "chromosome":
        return "chromosome_paralog"
    if r.plasmid_name_pf32 == "lp56" and str(r.plasmid_name_wp).startswith("cp32"):
        return "lp56_cp32"
    return "other_mismatch"

# --- Load data ---
df = pd.read_csv(SUMMARY_PATH, sep="\t")
old = pd.read_csv(OLD_CALLS_PATH)
old_keys = set(zip(old["assembly_id"], old["contig_id"]))

print(f"New caller summary: {len(df)} contigs")
print(f"Old 10.2 calls: {len(old)} contigs")

# --- Find all pf32/wp conflicts ---
conflict = df[
    df["plasmid_name_pf32"].notna()
    & df["plasmid_name_wp"].notna()
    & (df["plasmid_name_pf32"] != df["plasmid_name_wp"])
].copy()
conflict["fam_pf32"] = conflict["plasmid_name_pf32"].apply(get_family)
conflict["fam_wp"] = conflict["plasmid_name_wp"].apply(get_family)

same_fam = conflict[conflict["fam_pf32"] == conflict["fam_wp"]]
diff_fam = conflict[conflict["fam_pf32"] != conflict["fam_wp"]].copy()

print(f"\nTotal pf32/wp conflicts: {len(conflict)}")
print(f"  Same family (tiebreaks, not errors): {len(same_fam)}")
print(f"  Different family (potential errors): {len(diff_fam)}")

# --- Filter out fusions(see above) and correctly resolved large chromosomes ---
errors = diff_fam[
    ~(diff_fam["final_call"].str.contains(":::", na=False))
    & ~((diff_fam["contig_len"] >= 100_000) & (diff_fam["final_call"] == "chromosome"))
].copy()

fusions_excluded = len(diff_fam) - len(
    diff_fam[~(diff_fam["final_call"].str.contains(":::", na=False))]
)
large_chrom_excluded = len(diff_fam) - fusions_excluded - len(errors)

print(f"\nExcluded:")
print(f"  Fusions (:::): {fusions_excluded}")
print(f"  Large chromosomes (≥100kb): {large_chrom_excluded}")
print(f"  Remaining errors to audit: {len(errors)}")

# --- Categorize and check if in 10.2 ---
errors["error_type"] = errors.apply(categorize_error, axis=1)


def check_in_old(row):
    """Check if this contig exists in the 10.2 calls table."""
    for a, c in old_keys:
        if a == row.assembly_id and row.contig_id.startswith(c.split(" ")[0]):
            return True
    return False


errors["in_10.2"] = errors.apply(check_in_old, axis=1)

# --- Report ---
print(f"\n{'='*100}")
for t in ["chromosome_paralog", "lp56_cp32", "other_mismatch"]:
    subset = errors[errors.error_type == t].sort_values("contig_len", ascending=False)
    in_old_count = subset["in_10.2"].sum()
    print(f"\n=== {t} ({len(subset)} total, {in_old_count} in 10.2) ===")
    for _, r in subset.iterrows():
        marker = " " if r["in_10.2"] else "*"
        print(
            f" {marker} {r.assembly_id}/{r.contig_id} ({r.contig_len}bp): "
            f"pf32={r.plasmid_name_pf32} wp={r.plasmid_name_wp} final={r.final_call}"
        )

print(f"\n{'='*100}")
print(f"Legend: * = not in 10.2 (below 1000bp or key mismatch)")
print(f"\nTotal errors in 10.2: {errors['in_10.2'].sum()}")
print(f"Total errors not in 10.2: {(~errors['in_10.2']).sum()}")

New caller summary: 4016 contigs
Old 10.2 calls: 1685 contigs

Total pf32/wp conflicts: 237
  Same family (tiebreaks, not errors): 190
  Different family (potential errors): 47

Excluded:
  Fusions (:::): 8
  Large chromosomes (≥100kb): 6
  Remaining errors to audit: 33


=== chromosome_paralog (17 total, 17 in 10.2) ===
   URI112H/contig_2 [gcode=11] [topology=linear] (71363bp): pf32=chromosome wp=lp28-3 final=chromosome*
   UCT30H/contig_6 [gcode=11] [topology=linear] (34559bp): pf32=chromosome wp=lp21-cp9 final=chromosome
   URI33H/contig_6 [gcode=11] [topology=linear] (30965bp): pf32=chromosome wp=lp21-cp9 final=chromosome
   URI111H/contig_9 [gcode=11] [topology=linear] (30763bp): pf32=chromosome wp=lp21-cp9 final=chromosome
   UCT31H/contig_12 [gcode=11] [topology=linear] (30338bp): pf32=chromosome wp=lp21-cp9 final=chromosome
   URI41H/contig_11 [gcode=11] [topology=linear] (30141bp): pf32=chromosome wp=lp21-cp9 final=chromosome
   URI88H/contig_14 [gcode=11] [topology=linear] (

## Fusion Calls (extra_annotation category)

These are contigs where the new caller detected multiple potential pf32 loci,
producing compound calls like `lp28-3:::lp36`. The prior 10.2 call
used only the best pf32 loci for classification and we will default to that behavior.
We will default to those prior calls and only flag cross-family hits.

To be discussed in future work.

In [3]:
# --- Fusion calls: new caller found multiple pf32 loci ---
fusions = diff_fam[
    diff_fam["final_call"].str.contains(":::", na=False)
].copy()

fusions["in_10.2"] = fusions.apply(check_in_old, axis=1)

# Extract primary call (first component before :::)
fusions["primary_call"] = fusions["final_call"].str.split(":::").str[0]

print(f"Fusion calls (cross-family only): {len(fusions)}")
print(f"In 10.2: {fusions['in_10.2'].sum()}")
print()

for _, r in fusions.sort_values("contig_len", ascending=False).iterrows():
    marker = " " if r["in_10.2"] else "*"
    print(
        f" {marker} {r.assembly_id}/{r.contig_id} ({r.contig_len}bp): "
        f"10.2 would be='{r.primary_call}' | full={r.final_call} | "
        f"pf32={r.plasmid_name_pf32} wp={r.plasmid_name_wp}"
    )

Fusion calls (cross-family only): 8
In 10.2: 8

   UCT35H/contig_2 [gcode=11] [topology=linear] (88152bp): 10.2 would be='lp28-3' | full=lp28-3:::lp28-4 | pf32=lp28-4 wp=lp25
   UCT29H/contig_2 [gcode=11] [topology=linear] (75871bp): 10.2 would be='lp38' | full=lp38:::lp36 | pf32=lp38 wp=lp36
   URI46H/contig_2 [gcode=11] [topology=linear] (64384bp): 10.2 would be='chromosome' | full=chromosome:::lp28-5 | pf32=chromosome wp=lp28-5
   URI88H/contig_2 [gcode=11] [topology=linear] (59619bp): 10.2 would be='lp28-4' | full=lp28-4:::lp25 | pf32=lp28-4 wp=lp25
   UCT30H/contig_2 [gcode=11] [topology=linear] (57651bp): 10.2 would be='lp28-4' | full=lp28-4:::lp25 | pf32=lp28-4 wp=lp25
   ESI361H/contig_2 [gcode=11] [topology=linear] (56559bp): 10.2 would be='lp36' | full=lp36:::lp28-4 | pf32=lp36 wp=lp28-4
   UCT35H/contig_4 [gcode=11] [topology=linear] (50141bp): 10.2 would be='lp28-1' | full=lp28-1:::lp36 | pf32=lp28-1 wp=lp36
   UCT92H/contig_3 [gcode=11] [topology=linear] (42605bp): 10.2 wo

In [5]:
df = pd.read_csv('comparison/final_comparison.tsv', sep='\t')
print(df.category.value_counts().to_string())
print()
ea = df[df.category=='extra_annotation']
print(f'\nExtra annotation ({len(ea)}):')
for _, r in ea.sort_values('contig_len', ascending=False).iterrows():
    print(f'  {r.assembly_id}/{r.contig_id} ({r.contig_len}bp): {r.old_call} -> {r.new_call}')

category
new_only                2331
exact_match             1560
annotation_suffix         52
extra_annotation          38
same_family_tiebreak      22
different                 13


Extra annotation (38):
  UCT35H/contig_2 [gcode=11] [topology=linear] (88152bp): lp28-3 -> lp28-3:::lp28-4
  URI103H/contig_2 [gcode=11] [topology=linear] (86873bp): lp28-3 -> lp28-3:::lp36
  URI86H/contig_2 [gcode=11] [topology=linear] (86452bp): lp28-3 -> lp28-4:::lp28-3
  URI47H/contig_2 [gcode=11] [topology=linear] (83849bp): lp28-3 -> lp28-3:::lp36
  UCT29H/contig_2 [gcode=11] [topology=linear] (75871bp): lp38 -> lp38:::lp36
  UCT50H/contig_2 [gcode=11] [topology=linear] (66334bp): lp28-2 -> lp28-4:::lp28-2
  URI46H/contig_2 [gcode=11] [topology=linear] (64384bp): chromosome -> chromosome:::lp28-5
  GCF_002151485.1_ASM215148v1_genomic/NZ_CP019921.1 Borreliella burgdorferi strain PAbe plasmid p_cp32-9-4, complete sequence (62238bp): cp32-4 -> cp32-9:::cp32-4
  GCF_002151505.1_ASM215150v1_genomic/NZ_C

# Generate Override Candidates

Reads the comparison table from compare_calls.py and produces an editable
TSV of all contigs that are NOT exact_match. 

**Workflow:**
1. Run this cell to generate `override_candidates.tsv`
2. Review the table — edit the `resolved_call` column for any contigs you want to override
3. Save as `overrides.tsv` (keep only rows you changed)
4. Re-run compare_calls.py with `--overrides overrides.tsv`

**Format:**
```
assembly_id  contig_id  contig_len  old_call  new_call  category  resolved_call  action
```
- `resolved_call` = what the comparison script chose (defaults to old call)
- `action` = suggested action: KEEP, REVIEW, or OVERRIDE
- Edit `resolved_call` to change the final call for that contig

In [16]:
import pandas as pd
import re

COMPARISON_PATH = "comparison/final_comparison.tsv"
SUMMARY_PATH = "calls/summary_best_hits.tsv"
OUTPUT_PATH = "comparison/override_candidates.tsv"

df = pd.read_csv(COMPARISON_PATH, sep="\t")
summary = pd.read_csv(SUMMARY_PATH, sep="\t")

# Strip brackets for join key
def strip_brackets(s):
    return re.sub(r'\s*\[.*?\]', '', str(s)).strip()

df["_join_key"] = df["assembly_id"] + "||" + df["contig_id"].apply(strip_brackets)
summary["_join_key"] = summary["assembly_id"] + "||" + summary["contig_id"].apply(strip_brackets)

# Select hit columns from summary
hit_cols = [
    "_join_key",
    "plasmid_name_pf32", "overall_percent_identity_pf32",
    "query_coverage_percent_pf32", "query_covered_length_pf32",
    "plasmid_name_wp", "overall_percent_identity_wp",
    "query_coverage_percent_wp", "ref_length_wp",
    "ref_covered_length_wp",
]
# Only keep columns that exist
hit_cols = [c for c in hit_cols if c in summary.columns]
hit_info = summary[hit_cols].drop_duplicates(subset=["_join_key"], keep="first")

# Merge
df = df.merge(hit_info, on="_join_key", how="left")
df = df.drop(columns=["_join_key"])

# Helper to extract replicon family
def get_family(name):
    if pd.isna(name):
        return None
    name = str(name).lower()
    if name == "chromosome":
        return "chromosome"
    m = re.match(r"(cp32|lp\d+|cp\d+|lp5)", name.split(":::")[0].split("+")[0])
    return m.group(1) if m else name

# Identify pf32/wp cross-family conflicts (these may be exact_match in comparison)
df["fam_pf32"] = df["plasmid_name_pf32"].apply(get_family)
df["fam_wp"] = df["plasmid_name_wp"].apply(get_family)
df["pf32_wp_conflict"] = (
    df["plasmid_name_pf32"].notna()
    & df["plasmid_name_wp"].notna()
    & (df["fam_pf32"] != df["fam_wp"])
)

# Filter: non-exact-match categories OR pf32/wp cross-family conflicts
candidates = df[
    ~df["category"].isin(["exact_match", "new_only"])
    | df["pf32_wp_conflict"]
].copy().sort_values(["category", "contig_len"], ascending=[True, False])

# Drop helper columns from output
candidates = candidates.drop(columns=["fam_pf32", "fam_wp", "pf32_wp_conflict"])

# Add suggested action
def suggest_action(r):
    if r["category"] in ("annotation_suffix", "same_family_tiebreak"):
        return "KEEP"
    if r["category"] == "exact_match":
        # Only here because of pf32/wp cross-family conflict
        pf32 = str(r.get("plasmid_name_pf32", "")).lower()
        wp = str(r.get("plasmid_name_wp", "")).lower()
        if pf32 == "chromosome" and wp != "chromosome":
            return "OVERRIDE - chromosome paralog bug"
        if pf32 == "lp56" and wp.startswith("cp32"):
            return "REVIEW - lp56/cp32 conflict"
        return "REVIEW - pf32/wp family conflict"
    if r["category"] == "extra_annotation":
        # Flag fusions where primary component differs from old call
        if ":::" in str(r["new_call"]):
            primary = str(r["new_call"]).split(":::")[0]
            if primary.lower() != str(r["old_call"]).lower():
                return "REVIEW - primary differs from 10.2"
        return "KEEP"
    if r["category"] == "different":
        if str(r["old_call"]).lower() == "chromosome" and str(r["new_call"]).lower() != "chromosome":
            return "OVERRIDE - chromosome paralog bug"
        return "REVIEW"
    return "REVIEW"

candidates["action"] = candidates.apply(suggest_action, axis=1)

# Add ref_cov_pct for wp hits
candidates["wp_ref_cov_pct"] = pd.to_numeric(
    candidates.get("ref_covered_length_wp"), errors="coerce"
) / pd.to_numeric(
    candidates.get("ref_length_wp"), errors="coerce"
) * 100
candidates["wp_ref_cov_pct"] = candidates["wp_ref_cov_pct"].round(1)

# Reorder columns for readability
col_order = [
    "assembly_id", "contig_id", "contig_len",
    "old_call", "new_call", "resolved_call", "override_call",
    "category", "action",
    "plasmid_name_pf32", "overall_percent_identity_pf32", 
    "query_coverage_percent_pf32", "query_covered_length_pf32",
    "plasmid_name_wp", "overall_percent_identity_wp",
    "query_coverage_percent_wp", "ref_length_wp", 
    "ref_covered_length_wp", "wp_ref_cov_pct",
]

# Add empty override column for manual edits
candidates["override_call"] = ""

col_order = [c for c in col_order if c in candidates.columns]
candidates = candidates[col_order]

# Summary
print(f"Override candidates: {len(candidates)}")
print(f"\nBy category:")
print(candidates["category"].value_counts().to_string())
print(f"\nBy suggested action:")
print(candidates["action"].value_counts().to_string())

# Write
candidates.to_csv(OUTPUT_PATH, sep="\t", index=False)
print(f"\nWrote -> {OUTPUT_PATH}")
print(f"\nWorkflow:")
print(f"  1. Open {OUTPUT_PATH} in a spreadsheet or text editor")
print(f"  2. Fill in 'override_call' column for any contigs to override")
print(f"  3. To generate overrides.tsv for compare_calls.py:")
print(f"     Filter to rows where override_call is not empty,")
print(f"     rename override_call -> resolved_call,")
print(f"     keep columns: assembly_id, contig_id, resolved_call")

Override candidates: 163

By category:
category
annotation_suffix       52
exact_match             38
extra_annotation        38
same_family_tiebreak    22
different               13

By suggested action:
action
KEEP                                  103
OVERRIDE - chromosome paralog bug      21
REVIEW - pf32/wp family conflict       12
REVIEW - lp56/cp32 conflict            10
REVIEW - primary differs from 10.2      9
REVIEW                                  8

Wrote -> comparison/override_candidates.tsv

Workflow:
  1. Open comparison/override_candidates.tsv in a spreadsheet or text editor
  2. Fill in 'override_call' column for any contigs you want to change
  3. To generate overrides.tsv for compare_calls.py:
     Filter to rows where override_call is not empty,
     rename override_call -> resolved_call,
     keep columns: assembly_id, contig_id, resolved_call


## Extract Overrides from Edited Candidates Table

After manually reviewing `override_candidates.tsv` and filling in the
`override_call` column for contigs that need correction, this cell:

1. Reads the edited candidates table
2. Filters to rows where `override_call` is populated
3. Writes `overrides.tsv` in the format expected by compare_calls.py
4. Summarizes what was overridden and why

In [17]:
CANDIDATES_PATH = "comparison/override_candidates.tsv"
OVERRIDES_PATH = "comparison/overrides.tsv"

df = pd.read_csv(CANDIDATES_PATH, sep="\t")

# Filter to rows with an override
overrides = df[
    df["override_call"].notna() & (df["override_call"].astype(str).str.strip() != "")
].copy()

print(f"Total overrides: {len(overrides)}")
print(f"\nBy category:")
print(overrides["category"].value_counts().to_string())
print(f"\nBy action:")
print(overrides["action"].value_counts().to_string())

print(f"\n{'='*100}")
print(f"{'Assembly':<45} {'Contig Len':>10}  {'Old Call':<15} {'Override To':<15} {'Category'}")
print(f"{'-'*45} {'-'*10}  {'-'*15} {'-'*15} {'-'*20}")
for _, r in overrides.sort_values("contig_len", ascending=False).iterrows():
    asm = r["assembly_id"][:44]
    print(
        f"{asm:<45} {r['contig_len']:>10}  "
        f"{str(r['old_call']):<15} {str(r['override_call']):<15} {r['category']}"
    )

# Write overrides.tsv for compare_calls.py
out = overrides[["assembly_id", "contig_id", "override_call"]].rename(
    columns={"override_call": "resolved_call"}
)
out.to_csv(OVERRIDES_PATH, sep="\t", index=False)

print(f"\n{'='*100}")
print(f"Wrote {len(out)} overrides -> {OVERRIDES_PATH}")
print(f"\nNext: run compare_calls.py with --overrides {OVERRIDES_PATH}")

Total overrides: 29

By category:
category
exact_match         16
different           10
extra_annotation     3

By action:
action
OVERRIDE - chromosome paralog bug     21
REVIEW                                 5
KEEP                                   2
REVIEW - primary differs from 10.2     1

Assembly                                      Contig Len  Old Call        Override To     Category
--------------------------------------------- ----------  --------------- --------------- --------------------
GCF_002151505.1_ASM215150v1_genomic                61418  cp32-1+5        cp32-1+5        extra_annotation
ESI26H                                             61321  cp32-1+5        cp32-1+5        extra_annotation
UCT30H                                             34559  chromosome      lp21-cp9        exact_match
URI89H                                             32382  lp56            cp32-6          extra_annotation
URI33H                                             30965  chromosome   

## Generate v11 Calls Table (Full Columns)

Merges resolved calls with wp alignment stats to match
the column format of best_hits_1000bp_v10.2.csv.
Preserves original call_method from 10.2; only marks
overridden contigs as v11_override.

In [3]:
import pandas as pd
import re

def strip_brackets(s):
    return re.sub(r'\s*\[.*?\]', '', str(s)).strip()

# Load resolved calls and summary
comp = pd.read_csv("comparison/final_comparison.tsv", sep="\t")
summary = pd.read_csv("calls/summary_best_hits.tsv", sep="\t")
old = pd.read_csv("best_hits_1000bp_v10.2.csv")

# Filter to v11 contigs (exclude new_only) since these are all short and filtered. (<1000bp)
v11 = comp[comp["category"] != "new_only"][
    ["assembly_id", "contig_id", "contig_len", "old_call", "resolved_call", "category"]
].copy()

# Join keys
v11["_key"] = v11["assembly_id"] + "||" + v11["contig_id"].apply(strip_brackets)
summary["_key"] = summary["assembly_id"] + "||" + summary["contig_id"].apply(strip_brackets)
old["_key"] = old["assembly_id"] + "||" + old["contig_id"].apply(strip_brackets)

# Pull wp columns for the hit stats
wp_cols = {
    "plasmid_id_wp": "plasmid_id",
    "plasmid_name_wp": "wp_plasmid_name",
    "strain_wp": "strain",
    "query_length_wp": "query_length",
    "ref_length_wp": "ref_length",
    "overall_percent_identity_wp": "overall_percent_identity",
    "query_covered_length_wp": "query_covered_length",
    "ref_covered_length_wp": "ref_covered_length",
    "covered_intervals_wp": "covered_intervals",
    "query_intervals_wp": "query_intervals",
    "subject_hit_coords_wp": "subject_hit_coords",
    "query_coverage_percent_wp": "query_coverage_percent",
}

select = ["_key"] + list(wp_cols.keys())
select = [c for c in select if c in summary.columns]
hits = summary[select].drop_duplicates(subset=["_key"], keep="first").rename(columns=wp_cols)

# Pull original call_method from 10.2
old_method = old[["_key", "call_method"]].drop_duplicates(subset=["_key"], keep="first")

v11 = v11.merge(hits, on="_key", how="left")
v11 = v11.merge(old_method, on="_key", how="left")
v11 = v11.drop(columns=["_key"])

# Use resolved_call as plasmid_name
v11["plasmid_name"] = v11["resolved_call"]

# Mark overridden contigs
v11.loc[v11["old_call"] != v11["resolved_call"], "call_method"] = "v11_override"

v11 = v11.drop(columns=["resolved_call", "old_call", "category", "wp_plasmid_name"])

# Match column order from 10.2
col_order = [
    "assembly_id", "contig_id", "contig_len", "plasmid_id",
    "plasmid_name", "strain", "query_length", "ref_length",
    "overall_percent_identity", "query_covered_length",
    "ref_covered_length", "covered_intervals", "query_intervals",
    "subject_hit_coords", "query_coverage_percent", "call_method",
]
v11 = v11[col_order]

print(f"Total contigs: {len(v11)}")
print(f"Columns: {list(v11.columns)}")
print(f"Empty plasmid_name: {v11.plasmid_name.isna().sum()}")
print(f"\nCall method counts:")
print(v11["call_method"].value_counts().to_string())
# Call method counts:
# call_method
# pf32            1407
# wp               260
# v11_override      18
v11.to_csv("best_hits_1000bp_v11.csv", index=False)
print(f"\nWrote -> best_hits_1000bp_v11.csv")

Total contigs: 1685
Columns: ['assembly_id', 'contig_id', 'contig_len', 'plasmid_id', 'plasmid_name', 'strain', 'query_length', 'ref_length', 'overall_percent_identity', 'query_covered_length', 'ref_covered_length', 'covered_intervals', 'query_intervals', 'subject_hit_coords', 'query_coverage_percent', 'call_method']
Empty plasmid_name: 0

Call method counts:
call_method
pf32            1407
wp               260
v11_override      18

Wrote -> best_hits_1000bp_v11.csv


In [8]:
df = pd.read_csv("best_hits_1000bp_v10.2.csv", sep=",")

In [11]:
df['call_method'].value_counts()
# call_method
# pf32    1422
# wp       263
# Name: count, dtype: int64

call_method
pf32    1422
wp       263
Name: count, dtype: int64