# Testing single-concentration models for AUSAB-05
Sera AUSAB-05 and 02 are less potent, and their IC99 concentrations also neutralize around 40% of the H6 standard. This means we can't accurately fit models using concentrations at or above the IC99 as usual. This may be a more significant issue for multi-selection models, because the apparent average prob escape **increases** as serum becomes more potent. Here, I fit models on each single concentration to see what the data looks like, and also include AUSAB-11 as a comparison.

In [1]:
import pickle

import altair as alt

import pandas as pd

import polyclonal

import warnings
warnings.filterwarnings('ignore')

from IPython.utils import io

In [2]:
import os
os.chdir('../../')

In [3]:
# set up function for mean prob escape chart to avoid clutter from large block of code

def plot_avg_escape(prob_escape):
    max_aa_subs = 4  # group if >= this many substitutions
    
    mean_prob_escape = (
        prob_escape.assign(
            n_subs=lambda x: (
                x["aa_substitutions_reference"]
                .str.split()
                .map(len)
                .clip(upper=max_aa_subs)
                .map(lambda n: str(n) if n < max_aa_subs else f">{max_aa_subs - 1}")
            )
        )
        .groupby(["antibody_concentration", "n_subs"], as_index=False)
        .aggregate({"prob_escape": "mean", "prob_escape_uncensored": "mean"})
        .rename(
            columns={
                "prob_escape": "censored to [0, 1]",
                "prob_escape_uncensored": "not censored",
            }
        )
        .melt(
            id_vars=["antibody_concentration", "n_subs"],
            var_name="censored",
            value_name="probability escape",
        )
    )

    mean_prob_escape_chart = (
        alt.Chart(mean_prob_escape)
        .encode(
            x=alt.X("antibody_concentration"),
            y=alt.Y(
                "probability escape",
                scale=alt.Scale(type="symlog", constant=0.05),
            ),
            column=alt.Column("censored", title=None),
            color=alt.Color("n_subs", title="n substitutions"),
            tooltip=[
                alt.Tooltip(c, format=".3g") if mean_prob_escape[c].dtype == float else c
                for c in mean_prob_escape.columns
            ],
        )
        .mark_line(point=True, size=0.5)
        .properties(width=200, height=125)
        .configure_axis(grid=False)
    )

    return mean_prob_escape_chart

In [4]:
def generate_model(
    prob_escape_df,
    n_epitopes=1
):
    
    model = polyclonal.Polyclonal(
        n_epitopes=n_epitopes,
        data_to_fit=prob_escape_df.rename(
            columns={
                "antibody_concentration": "concentration",
                "aa_substitutions_reference": "aa_substitutions",
            }
        ),
        alphabet=polyclonal.AAS_WITHSTOP_WITHGAP,
    )

    # fit model, suppressing output text to avoid clutter in notebook
    with io.capture_output() as captured:
        opt_res = model.fit(
            logfreq=200,
            reg_escape_weight=0.1,
        )

    mut_escape_plot = model.mut_escape_plot(addtl_slider_stats={"times_seen": 3}, init_floor_at_zero=False)
    return mut_escape_plot

## Get AUSAB-05 prob escape data

In [5]:
prob_escape_05 = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-05_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape_05.notnull().all().all()
prob_escape_05.head()

Unnamed: 0,library,antibody_sample,no-antibody_sample,aa_substitutions_sequential,n_aa_substitutions,barcode,prob_escape,prob_escape_uncensored,antibody_count,no-antibody_count,antibody_neut_standard_count,no-antibody_neut_standard_count,total_no_antibody_count,no_antibody_count_threshold,aa_substitutions_reference,antibody,antibody_concentration
0,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,K297I,1,ATAACACAAAAAAGTA,0.0582,0.0582,51938,339935,74631,28408,10675748,21,K278I,AUSAB-05,0.056
1,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,R111S V366M R402S,3,TATCTACCTAACGAAA,0.1608,0.1608,36366,86104,74631,28408,10675748,21,R92S V347M R383S,AUSAB-05,0.056
2,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,L89I L263H Q520R,3,CTCTTTAAAATCCATT,0.2285,0.2285,29107,48487,74631,28408,10675748,21,L70I L244H Q501R,AUSAB-05,0.056
3,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,Q94M A182S S218A L386H,4,ACAGAATACCTTAACG,0.2638,0.2638,24177,34880,74631,28408,10675748,21,Q75M A163S S199A L367H,AUSAB-05,0.056
4,libA,221223_1_antibody_AUSAB-05_0.056_1,221223_1_no-antibody_control_1,R220G N235M L263Q,3,CTAACCAGTTAGACAC,0.192,0.192,19784,39213,74631,28408,10675748,21,R201G N216M L244Q,AUSAB-05,0.056


In [6]:
plot_avg_escape(prob_escape_05)

In [7]:
selection_df_05 = (
    prob_escape_05.groupby("antibody_concentration")
    .aggregate(n_variants=pd.NamedAgg("barcode", "nunique"))
    .reset_index()
)

selections_05 = selection_df_05['antibody_concentration'].tolist()

selections_05

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373, 0.056]

In [8]:
full_model = generate_model(prob_escape_05)
full_model

## Model escape on each single concentration, and visualize escape plots for each
These plots are displayed in order, starting from the lowest concentration and going up in potency.

In [9]:
escape_plots_05 = []

for selection in selections_05:
    single_conc = prob_escape_05.loc[prob_escape_05['antibody_concentration'] == selection]
    single_conc_plot = generate_model(single_conc)
    escape_plots_05.append(single_conc_plot)

In [10]:
escape_plots_05[0]

In [11]:
escape_plots_05[1]

In [12]:
escape_plots_05[2]

In [13]:
escape_plots_05[3]

In [14]:
escape_plots_05[4]

In [15]:
escape_plots_05[5]

In [16]:
escape_plots_05[6]

### Main takeaways:
The model generated with all 7 concentrations looks most similar to the lowest-potency single-concentration model. I'm not sure if this concentration having the least off-target H6 neutralization is part of the reason? 

Generally, lower concentrations have more signal from sensitizing mutations. The lowest concentration also has the greatest overall escape magnitudes, which makes sense - more variants are getting through in low-potency selections. This gradually decreases through concentration #4, then increases again through 5, 6, and 7. i.e. follows the same trend we see with overall avg_prob_escape.

Positive escape mutations are most clearly resolved in concentrations 2, 3, and 4. Note that 4 is the IC99, and I typically use concentrations 3, 4, and 5 (1.5-fold above and below the IC99) for modeling. So concentration 5 would maybe be more helpful if we didn't have the standard neutralized here. Main escape sites in models 2, 3, and 4 are:
* 83: antigenic site E
* 138, 144: antigenic site A
* 160, 192/193: antigenic site B
* 222/224: near antigenic site D, also targeted by AUSAB-11

A model generated on the combination of these 3 selections looks like:

In [17]:
selections_05

[0.0049, 0.0074, 0.0111, 0.0166, 0.0249, 0.0373, 0.056]

In [18]:
prob_escape_05_filtered = prob_escape_05.loc[(prob_escape_05['antibody_concentration'] == 0.0074) |
                                             (prob_escape_05['antibody_concentration'] == 0.0111) |
                                             (prob_escape_05['antibody_concentration'] == 0.0166)   
                                            ]
generate_model(prob_escape_05_filtered)

**So we do see those sites still popping up in the model - 94, 159, and 192/193. But the magnitude is very low and they're even with the general stalk mutations that I see in every set of selections.**

For reference, here's what the avg prob escape plot looks like just for these 3 (highest concentration here is the IC99) - 

In [19]:
plot_avg_escape(prob_escape_05_filtered)

## Repeat with AUSAB-11 for comparison
This is another less-potent serum, but did not neutralize H6 to the same extent as AUSAB-05. I was able to resolve clear escape + sensitizing mutations with a multi-selection model. So I'm generating the same set of single-concentration models to show what these limited escape plots look like for a workable serum.

In [20]:
prob_escape_11 = pd.read_csv(
    "results/prob_escape/libA_221223_1_AUSAB-11_1_prob_escape.csv", keep_default_na=False, na_values="nan"
).query(
    "`no-antibody_count` >= no_antibody_count_threshold"
)  # filter for those with sufficient no-antibody counts
assert prob_escape_11.notnull().all().all()
prob_escape_11.head()

Unnamed: 0,library,antibody_sample,no-antibody_sample,aa_substitutions_sequential,n_aa_substitutions,barcode,prob_escape,prob_escape_uncensored,antibody_count,no-antibody_count,antibody_neut_standard_count,no-antibody_neut_standard_count,total_no_antibody_count,no_antibody_count_threshold,aa_substitutions_reference,antibody,antibody_concentration
0,libA,221223_1_antibody_AUSAB-11_0.0338_1,221223_1_no-antibody_control_1,N27E N57M T179Q Q192D R241L G294I D438E,7,TACCTATGAAAAACAT,1.0,7.9063,93000,1665,200696,28408,10675748,21,N8E N38M T160Q Q173D R222L G275I D419E,AUSAB-11,0.0338
1,libA,221223_1_antibody_AUSAB-11_0.0338_1,221223_1_no-antibody_control_1,K297I,1,ATAACACAAAAAAGTA,0.0278,0.0278,66658,339935,200696,28408,10675748,21,K278I,AUSAB-11,0.0338
2,libA,221223_1_antibody_AUSAB-11_0.0338_1,221223_1_no-antibody_control_1,R111S V366M R402S,3,TATCTACCTAACGAAA,0.0601,0.0601,36536,86104,200696,28408,10675748,21,R92S V347M R383S,AUSAB-11,0.0338
3,libA,221223_1_antibody_AUSAB-11_0.0338_1,221223_1_no-antibody_control_1,Y113F N141E K154S A182E L263I Q382T,6,AAGACCAAATTACCCA,0.1836,0.1836,31122,23991,200696,28408,10675748,21,Y94F N122E K135S A163E L244I Q363T,AUSAB-11,0.0338
4,libA,221223_1_antibody_AUSAB-11_0.0338_1,221223_1_no-antibody_control_1,L89I L263H Q520R,3,CTCTTTAAAATCCATT,0.0803,0.0803,27515,48487,200696,28408,10675748,21,L70I L244H Q501R,AUSAB-11,0.0338


In [21]:
selection_df_11 = (
    prob_escape_11.groupby("antibody_concentration")
    .aggregate(n_variants=pd.NamedAgg("barcode", "nunique"))
    .reset_index()
)

selections_11 = selection_df_11['antibody_concentration'].tolist()

selections_11

[0.003, 0.0045, 0.0067, 0.01, 0.015, 0.0225, 0.0338]

**First, here's what the model looks like on a set of 3 concentrations roughly equivalent to the AUSAB-05 set (IC99 and 2 lower conc. Note that IC99=0.00845 for this serum, so it sits between 3 and 4. I went with the lower set here.) -**

In [22]:
prob_escape_11_low3 = prob_escape_11.loc[(prob_escape_11['antibody_concentration'] == 0.003) |
                                         (prob_escape_11['antibody_concentration'] == 0.0045) |
                                         (prob_escape_11['antibody_concentration'] == 0.0067) 
                                            ]
generate_model(prob_escape_11_low3)

**This looks similarly noisy to AUSAB-05 model above - a few sites pop out, but there's a lot of signal across the protein. Now look at a model if we drop the low concentration and include a higher one -**

In [23]:
prob_escape_11_higher3 = prob_escape_11.loc[(prob_escape_11['antibody_concentration'] == 0.0067) |
                                             (prob_escape_11['antibody_concentration'] == 0.01) |
                                             (prob_escape_11['antibody_concentration'] == 0.015) 
                                            ]
generate_model(prob_escape_11_higher3)

**The general shape of the model doesn't really change, but escape and sensitizing mutations are resolved more clearly against background signal.**

### Generate single-selection models and visualize for comparison

In [24]:
escape_plots_11 = []

for selection in selections_11:
    single_conc = prob_escape_11.loc[prob_escape_11['antibody_concentration'] == selection]
    single_conc_plot = generate_model(single_conc)
    escape_plots_11.append(single_conc_plot)

In [25]:
escape_plots_11[0]

In [26]:
escape_plots_11[1]

In [27]:
escape_plots_11[2]

In [28]:
escape_plots_11[3]

In [29]:
escape_plots_11[4]

In [30]:
escape_plots_11[5]

In [31]:
escape_plots_11[6]

### Main takeaways
Similar to AUSAB-05, the low-potency models have more signal from sensitizing mutations. Positive escape sites start to dominate at concentrations 3, 4, and 5. And then we see diminishing returns at concentrations 6 and 7. These are the concentrations where we get more significant H6 neutralization and start to see an uptick in avg_prob_escape -

In [32]:
plot_avg_escape(prob_escape_11)

So **overall**, increasing concentrations yields clearer resolution of major escape sites. Incorporating these more potent selections into the model helps bump up the magnitude of these positive escape sites, and they become easier to resolve over background signal. But once the serum also begins to neutralize H6, even models fit on just these single concentrations become noisy and uninterpretable. 

What this means for my experiments is that it's worth trying to manually normalize selections for AUSAB-05 (and probably AUSAB-02 once I run it), so that we can get data from a higher-potency selection.