# Summary of epitope deconvolution results
In `serum_epitope_deconvolution.ipynb`, I tested a wide range of different uniqueness and spatial regularization parameters. Most of these parameters generated redundant models. Here, I summarize the 'typical' model I get for each serum, and the parameters that altered epitope deconvolution.

In [1]:
import pickle

import altair as alt

import pandas as pd

import polyclonal

import warnings
warnings.filterwarnings('ignore')

In [2]:
import os
os.chdir('../../')

In [3]:
spatial_distances = polyclonal.pdb_utils.inter_residue_distances(
    "data/PDBs/4o5n.pdb",
    target_chains=["A", "B"],
)

spatial_distances

Unnamed: 0,site_1,site_2,distance,chain_1,chain_2
0,9,10,1.328212,A,A
1,9,11,3.469929,B,B
2,9,12,6.336130,B,B
3,9,13,9.189821,B,B
4,9,14,8.930696,B,A
...,...,...,...,...,...
260276,497,499,15.936294,B,B
260277,497,500,16.632641,B,B
260278,498,499,23.859705,B,B
260279,498,500,13.285421,B,B


## AUSAB-05

In [4]:
prob_escape = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-05_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape.notnull().all().all()

prob_escape_filtered_05 = prob_escape.loc[(prob_escape['antibody_concentration'] == 0.0074) |
                                          (prob_escape['antibody_concentration'] == 0.0111)
                                         ]

In [10]:
model = polyclonal.Polyclonal(
    n_epitopes=1,
    data_to_fit=prob_escape_filtered_05.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 503 parameters at Wed Jan  4 15:16:58 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.025208       44868       44865           0           0           0              0               0       3.6049
          42      1.4649      612.49      606.83      2.1069           0           0              0               0       3.5517
# Successfully finished at Wed Jan  4 15:16:59 2023.
# Starting optimization of 3244 parameters at Wed Jan  4 15:16:59 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.022914      774.35      748.62       22.18  9.2654e-33           0              0               0       3.5517
          61      1.7758      746.34      739.99       2.199    0.088467           0              0               0       4.0

**Tested reg_spatial2_weight ranging from 1e-7 to 1e-1, and reg_uniqueness2_weight ranging from 0.1 to 2, with 2 epitopes assigned. All models looked pretty much the same as below. Never got clear, unique sites in different models.** 

In [20]:
reference_sites = pd.read_csv("data/site_map.csv")["reference_site"].tolist()

model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_filtered_05.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 15:22:37 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.046578       44409       44401           0           0           0              0               0       8.2095
          39      2.2545      613.69      609.49     0.59899           0     0.60297              0       0.0040473       2.9898
# Successfully finished at Wed Jan  4 15:22:39 2023.
# Starting optimization of 6488 parameters at Wed Jan  4 15:22:39 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.062915      760.61       748.6      7.5375  4.7685e-33     0.60297              0         0.88281       2.9898
          34       2.274      747.26      742.44      1.3237    0.022715    0.072957              0       0.0091632        3

**Exception is that very low reg_spatial2_weight values spike up the escape value of site 92 and assign it to a distinct epitope, while all other sites are in epitope 1. Weird to me that changing this reg value changes the actual site escape values for certain mutations.**

In [23]:
reference_sites = pd.read_csv("data/site_map.csv")["reference_site"].tolist()

model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_filtered_05.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.000001,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 15:29:29 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.037509       44409       44401           0           0           0              0               0       8.2095
          50      2.4905      611.55      606.51      2.1262           0    0.063176              0       0.0096261       2.8457
# Successfully finished at Wed Jan  4 15:29:31 2023.
# Starting optimization of 6488 parameters at Wed Jan  4 15:29:31 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.057066      776.96      750.26      22.485  1.6397e-32    0.063176              0          1.2974       2.8457
          98      6.7372      746.98      741.16      2.3196    0.078515   0.0016549              0        0.019841       3.

## AUSAB-07

In [21]:
prob_escape = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-07_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape.notnull().all().all()

prob_escape_filtered_07 = prob_escape.loc[(prob_escape['antibody_concentration'] == 0.0010) |
                                       (prob_escape['antibody_concentration'] == 0.0015) |
                                       (prob_escape['antibody_concentration'] == 0.0023) |
                                       (prob_escape['antibody_concentration'] == 0.0034)
                                      ]

**Pretty clearly a single targeted site, but this is what forcing 2 epitopes looks like:**

In [22]:
model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_filtered_07.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 15:26:35 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.055609       89018       89007           0           0           0              0               0       11.386
         134      8.9468      2440.9      2424.7      2.7599           0      10.581              0        0.082892        2.719
# Successfully finished at Wed Jan  4 15:26:44 2023.
# Starting optimization of 6488 parameters at Wed Jan  4 15:26:44 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.079578      2974.3      2907.1      31.607  4.2721e-32      10.581              0          22.197        2.719
         144      11.484      2783.8      2727.1      30.677      1.2779      18.976              0          2.4925       3.

**If I force 3 epitopes:**

In [24]:
reference_sites = pd.read_csv("data/site_map.csv")["reference_site"].tolist()

model = polyclonal.Polyclonal(
    n_epitopes=3,
    data_to_fit=prob_escape_filtered_07.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1509 parameters at Wed Jan  4 15:31:01 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.067572       88678       88661           0           0           0              0               0       17.079
         155      11.848      2436.1      2417.5       3.961           0      12.505              0         0.17667       1.9894
# Successfully finished at Wed Jan  4 15:31:13 2023.
# Starting optimization of 9732 parameters at Wed Jan  4 15:31:13 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0     0.10928      3014.9      2909.5      47.138  2.2339e-31      12.505              0          43.734       1.9894
         200      21.989        2792      2727.9       35.04       1.752      18.898              0          2.8256       5.

**Again, changing the # of epitopes / regularization values actually changes some site escape scores. Site 159 went from -11 in single or double epitope models, to -35 in 3-epitope model. Probably not a concern if it's only showing up in more extreme scenarios, but I want to make sure this isn't because of a bug in the modeling.**

## AUSAB-11

In [25]:
prob_escape = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-11_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape.notnull().all().all()

prob_escape_filtered_11 = prob_escape.loc[(prob_escape['antibody_concentration'] == 0.0067) |
                                       (prob_escape['antibody_concentration'] == 0.0100) |
                                       (prob_escape['antibody_concentration'] == 0.0150)
                                      ]

In [26]:
model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_filtered_11.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 15:36:48 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.051684       58261       58253           0           0           0              0               0       8.0095
          15      1.4162      5971.9      5913.2      2.6324           0      54.974              0         0.22631      0.90466
# Successfully finished at Wed Jan  4 15:36:49 2023.
# Starting optimization of 6488 parameters at Wed Jan  4 15:36:49 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.069603      7099.1      6921.1      40.707  5.9029e-32      54.974              0          81.363      0.90466
         146      11.239        6479      6420.2       32.48     0.82307      20.261              0          3.4318       1.

**Most regularization parameters I tested resulted in models similar to this one. If I lower the uniqueness weights, site assignment doesn't really change, certain sites just get assigned to both epitopes.** 

**Dropping reg_spatial2_weight a lot (<= 7e-6) results in site 189 getting pulled into a different epitope from 201, 212, 244:**

In [31]:
model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_filtered_11.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.000007,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 15:45:26 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.044581       58261       58253           0           0           0              0               0       8.0095
         200       10.99      5632.1      5612.1      10.542           0      7.9757              0         0.60451      0.94904
         287      14.918      5631.7      5612.1      10.847           0      7.3251              0         0.51332      0.94524
# Successfully finished at Wed Jan  4 15:45:41 2023.
# Starting optimization of 6488 parameters at Wed Jan  4 15:45:41 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0     0.06561      6817.2      6682.8      77.611  6.5377e-32      7.3251              0          48.506      0.9

**This makes sense spatially, as 189 sort of sticks out from the head and is pretty distant from these other sites (lower on the head). But deconvolution of other sites also gets messy. I think I need a higher spatial reg penalty to separate all these negative effects into reasonable epitopes.**

## AUSAB-13

In [41]:
prob_escape_13 = pd.read_csv(
    "results/prob_escape/libA_221027_1_AUSAB-13_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape_13.notnull().all().all()

In [42]:
model = polyclonal.Polyclonal(
    n_epitopes=2,
    data_to_fit=prob_escape_13.rename(
        columns={
            "antibody_concentration": "concentration",
            "aa_substitutions_reference": "aa_substitutions",
        }
    ),
    alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    sites=reference_sites,
    spatial_distances=spatial_distances,
)

# fit model
opt_res = model.fit(
    logfreq=200,
    reg_escape_weight=0.1,
    reg_uniqueness_weight=0,
    reg_uniqueness2_weight=1,
    reg_spatial_weight=0.0,
    reg_spatial2_weight=0.0005,
)

# display results
display(model.activity_wt_barplot())
display(model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False))

# First fitting site-level model.
# Starting optimization of 1006 parameters at Wed Jan  4 16:00:37 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.076069  1.6416e+05  1.6415e+05           0           0           0              0               0       11.001
          86      9.0168      848.47      837.85      1.7829           0      3.5556              0         0.15686        5.127
# Successfully finished at Wed Jan  4 16:00:46 2023.
# Starting optimization of 6474 parameters at Wed Jan  4 16:00:46 2023.
        step    time_sec        loss    fit_loss  reg_escape  reg_spread reg_spatial reg_uniqueness reg_uniqueness2 reg_activity
           0    0.098001      1093.7      1006.5      19.331  4.0706e-32      3.5556              0          59.188        5.127
         170      17.423      949.25      920.68      14.825      1.7835      6.2598              0         0.58725       5.

**It's pretty clear most neutralization activity comes from epitope 1 - that cluster of peaks are all in antigenic site A. But the minor peaks getting assigned to epitopes 1 vs 2 also make a lot of sense. We see slight escape at sites 48, 92, and 201, which are all on the stalk / lower head. While the small peaks at 186 and 189 are in antigenic site B, which is adjacent to antigenic site A, so it makes sense that they're grouping with the major escape sites.**