# Build distance maps from antigenic escape assay data

[Lee et al. 2019](https://doi.org/10.7554/eLife.49324) identified positions in seasonal influenza A/H3N2 hemagglutinin (HA) that allowed specific viruses (e.g., A/Perth/16/2009) to escape the antibodies present in human polyclonal sera. These assays quantify the degree to which specific mutations enabled antigenic escape in a measure called the `mut_diffsel`. The resulting data contain the `mut_diffsel` value for each amino acid mutation from wildtype at each position in HA and specific human serum.

To understand whether we can use these data to improve long-term forecasts of H3N2 populations, we developed the following distance maps for use in forecasting:

1. antigenic escape epitope sites: a binary distance map of positions with the strongest immune escape values across all available human sera. This map reflects putative epitope sites akin to the Koel et al. 2013 sites without any consideration of what mutations occur at those sites. This map is most likely to avoid overfitting to a specific serum sample.

2. antigenic escape weighted epitope sites: a floating point distance map for each position in HA reflecting the proportion of total antigenic escape provided by that position (total `mut_diffself` at that site divided by the total `mut_diffsel` across all sites and averaged across all available sera). This map avoids using arbitrary thresholds of `mut_diffself` values to identify epitope sites by weighting all sites across all sera.

3. antigenic escape mut_diffsel: a site- and mutation-specific floating point distance map reflecting the specific mutational effect of any given pair of derived and ancestral mutations. This map provides distances based directly on the absolute measurements of the mutation escape at each position. This map is the most likely to overfit to specific serum samples.

Below, we build each map in order of increasing complexity and export the corresponding maps in the format required by the `augur distance` command.

## Imports and configuration

In [1]:
import json
import pandas as pd

In [2]:
data_url = "https://raw.githubusercontent.com/dms-view/influenza/master/data/HA/Lee2019/Lee2019.csv"

In [3]:
antigenic_escape_distance_map = "../config/distance_maps/h3n2/ha/antigenic_escape_ep.json"

In [4]:
weighted_antigenic_escape_distance_map = "../config/distance_maps/h3n2/ha/weighted_antigenic_escape_ep.json"

In [5]:
valid_conditions = [
    "2009-age-53",
    "2009-age-64",
    "2009-age-65",
    "2010-age-21"
]

## Load data

In [6]:
df = pd.read_csv(data_url)

In [7]:
df.head()

Unnamed: 0,mut_positive mutdiffsel,site_positive diffsel,condition,label_site,wildtype,mutation,mut_mutdiffsel,site_abs diffsel,site_negative diffsel,site_max diffsel,site_min diffsel,site,protein_chain,protein_site
0,3.6671,15.837,2010-age-21,193,F,D,3.6671,17.384,-1.5467,3.6671,-0.52377,208,A,193
1,3.1793,15.837,2010-age-21,193,F,N,3.1793,17.384,-1.5467,3.6671,-0.52377,208,A,193
2,2.082,15.837,2010-age-21,193,F,Q,2.082,17.384,-1.5467,3.6671,-0.52377,208,A,193
3,2.0098,15.837,2010-age-21,193,F,E,2.0098,17.384,-1.5467,3.6671,-0.52377,208,A,193
4,1.096,15.837,2010-age-21,193,F,L,1.096,17.384,-1.5467,3.6671,-0.52377,208,A,193


In [8]:
df = df[df["condition"].isin(valid_conditions)].copy()

In [9]:
df["condition"].value_counts()

2009-age-53    11320
2010-age-21    11320
2009-age-64    11320
2009-age-65    11320
Name: condition, dtype: int64

In [10]:
df.columns

Index(['mut_positive mutdiffsel', 'site_positive diffsel', 'condition',
       'label_site', 'wildtype', 'mutation', 'mut_mutdiffsel',
       'site_abs diffsel', 'site_negative diffsel', 'site_max diffsel',
       'site_min diffsel', 'site', 'protein_chain', 'protein_site'],
      dtype='object')

In [11]:
df["label_site"].drop_duplicates().sort_values().values

array(['(HA2)1', '(HA2)10', '(HA2)100', '(HA2)101', '(HA2)102',
       '(HA2)103', '(HA2)104', '(HA2)105', '(HA2)106', '(HA2)107',
       '(HA2)108', '(HA2)109', '(HA2)11', '(HA2)110', '(HA2)111',
       '(HA2)112', '(HA2)113', '(HA2)114', '(HA2)115', '(HA2)116',
       '(HA2)117', '(HA2)118', '(HA2)119', '(HA2)12', '(HA2)120',
       '(HA2)121', '(HA2)122', '(HA2)123', '(HA2)124', '(HA2)125',
       '(HA2)126', '(HA2)127', '(HA2)128', '(HA2)129', '(HA2)13',
       '(HA2)130', '(HA2)131', '(HA2)132', '(HA2)133', '(HA2)134',
       '(HA2)135', '(HA2)136', '(HA2)137', '(HA2)138', '(HA2)139',
       '(HA2)14', '(HA2)140', '(HA2)141', '(HA2)142', '(HA2)143',
       '(HA2)144', '(HA2)145', '(HA2)146', '(HA2)147', '(HA2)148',
       '(HA2)149', '(HA2)15', '(HA2)150', '(HA2)151', '(HA2)152',
       '(HA2)153', '(HA2)154', '(HA2)155', '(HA2)156', '(HA2)157',
       '(HA2)158', '(HA2)159', '(HA2)16', '(HA2)160', '(HA2)161',
       '(HA2)162', '(HA2)163', '(HA2)164', '(HA2)165', '(HA2)166',
    

In [12]:
def annotate_gene(site):
    if site.startswith("(HA2)"):
        return "HA2"
    elif site.startswith("-"):
        return "SigPep"
    else:
        return "HA1"

In [13]:
def annotate_gene_position(site):
    if site.startswith("(HA2)"):
        position = int(site[5:])
    elif site.startswith("-"):
        position = int(site) + 16 + 1
    else:
        position = int(site)
        
    return position

In [14]:
df["gene"] = df["label_site"].apply(annotate_gene)

In [15]:
df["gene_site"] = df["label_site"].apply(annotate_gene_position)

In [16]:
df.head()

Unnamed: 0,mut_positive mutdiffsel,site_positive diffsel,condition,label_site,wildtype,mutation,mut_mutdiffsel,site_abs diffsel,site_negative diffsel,site_max diffsel,site_min diffsel,site,protein_chain,protein_site,gene,gene_site
0,3.6671,15.837,2010-age-21,193,F,D,3.6671,17.384,-1.5467,3.6671,-0.52377,208,A,193,HA1,193
1,3.1793,15.837,2010-age-21,193,F,N,3.1793,17.384,-1.5467,3.6671,-0.52377,208,A,193,HA1,193
2,2.082,15.837,2010-age-21,193,F,Q,2.082,17.384,-1.5467,3.6671,-0.52377,208,A,193,HA1,193
3,2.0098,15.837,2010-age-21,193,F,E,2.0098,17.384,-1.5467,3.6671,-0.52377,208,A,193,HA1,193
4,1.096,15.837,2010-age-21,193,F,L,1.096,17.384,-1.5467,3.6671,-0.52377,208,A,193,HA1,193


## Identify antigenic escape epitope sites

Calculate site-specific total mutdiffsel values, plot their distributions per serum, and identify those positions with values greater than 4 standard deviations from the mean by serum.

In [17]:
site_columns = [
    "condition",
    "site",
    "gene",
    "gene_site",
    "site_positive diffsel"
]

In [18]:
site_specific_df = df.loc[:, site_columns].drop_duplicates().sort_values(["condition", "site"]).reset_index().drop(columns=["index"])

In [19]:
site_specific_df.head()

Unnamed: 0,condition,site,gene,gene_site,site_positive diffsel
0,2009-age-53,0,SigPep,1,0.25444
1,2009-age-53,1,SigPep,2,3.0482
2,2009-age-53,2,SigPep,3,1.6275
3,2009-age-53,3,SigPep,4,0.95476
4,2009-age-53,4,SigPep,5,0.90589


In [20]:
site_specific_df.shape

(2264, 5)

In [21]:
condition_specific_distributions = site_specific_df.groupby("condition").aggregate({"site_positive diffsel": ["mean", "std"]})

In [22]:
condition_specific_distributions["upper_threshold"] = (
    condition_specific_distributions["site_positive diffsel", "mean"] + 4 * condition_specific_distributions["site_positive diffsel", "std"]
)

In [23]:
upper_thresholds = condition_specific_distributions.reset_index().loc[:, ["condition", "upper_threshold"]]

In [24]:
upper_thresholds.columns = upper_thresholds.columns.droplevel(level=1)

In [25]:
upper_thresholds

Unnamed: 0,condition,upper_threshold
0,2009-age-53,5.351155
1,2009-age-64,6.556842
2,2009-age-65,5.352842
3,2010-age-21,7.004616


In [26]:
site_specific_df = site_specific_df.merge(
    upper_thresholds,
    on="condition"
)

In [27]:
site_specific_df[site_specific_df["site_positive diffsel"] > site_specific_df["upper_threshold"]]

Unnamed: 0,condition,site,gene,gene_site,site_positive diffsel,upper_threshold
172,2009-age-53,172,HA1,157,16.382,5.351155
175,2009-age-53,175,HA1,160,8.5167,5.351155
740,2009-age-64,174,HA1,159,27.283,6.556842
803,2009-age-64,237,HA1,222,11.799,6.556842
825,2009-age-64,259,HA1,244,7.0883,6.556842
1306,2009-age-65,174,HA1,159,10.816,5.352842
1307,2009-age-65,175,HA1,160,7.1958,5.352842
1340,2009-age-65,208,HA1,193,14.185,5.352842
1857,2010-age-21,159,HA1,144,9.9572,7.004616
1872,2010-age-21,174,HA1,159,11.617,7.004616


In [28]:
antigenic_epitope_sites = site_specific_df.loc[
    site_specific_df["site_positive diffsel"] > site_specific_df["upper_threshold"],
    ["gene", "gene_site"]
].drop_duplicates().sort_values(["gene", "gene_site"])

In [29]:
antigenic_epitope_sites

Unnamed: 0,gene,gene_site
1857,HA1,144
172,HA1,157
740,HA1,159
175,HA1,160
1340,HA1,193
803,HA1,222
825,HA1,244


In [30]:
koel_sites = [145, 155, 156, 158, 159, 189, 193]

In [31]:
antigenic_escape_map = {
    "name": "antigenic_escape_epitope_sites",
    "description": "Sites with greater than 4 stddev positive site-level mutdiffsel per serum across four human sera from Lee et al. 2019",
    "default": 0,
    "map": {}
}

In [32]:
for record in antigenic_epitope_sites.to_dict("records"):
    if record["gene"] not in antigenic_escape_map["map"]:
        antigenic_escape_map["map"][record["gene"]] = {}
        
    antigenic_escape_map["map"][record["gene"]][str(record["gene_site"])] = 1

In [33]:
antigenic_escape_map

{'name': 'antigenic_escape_epitope_sites',
 'description': 'Sites with greater than 4 stddev positive site-level mutdiffsel per serum across four human sera from Lee et al. 2019',
 'default': 0,
 'map': {'HA1': {'144': 1,
   '157': 1,
   '159': 1,
   '160': 1,
   '193': 1,
   '222': 1,
   '244': 1}}}

In [34]:
with open(antigenic_escape_distance_map, "w") as oh:
    json.dump(antigenic_escape_map, oh, indent=4, sort_keys=True)

## Identify weighted antigenic escape epitope sites

Instead of using an arbitrary threshold (e.g., 4 std dev from the mean), calculate the average proportion of the total positive mutdiffsel value at each position in HA across all sera. The resulting distance map weights each position in HA by its relative contribution to antigenic escape.

In [35]:
total_positive_diffsel = site_specific_df.groupby("condition").aggregate({"site_positive diffsel": "sum"}).reset_index()

In [36]:
total_positive_diffsel = total_positive_diffsel.rename(columns={"site_positive diffsel": "total_positivediffsel"})

In [37]:
total_positive_diffsel

Unnamed: 0,condition,total_positivediffsel
0,2009-age-53,586.67057
1,2009-age-64,430.847894
2,2009-age-65,557.166885
3,2010-age-21,667.777515


In [38]:
site_specific_df = site_specific_df.merge(
    total_positive_diffsel,
    on="condition"
)

In [39]:
site_specific_df.head()

Unnamed: 0,condition,site,gene,gene_site,site_positive diffsel,upper_threshold,total_positivediffsel
0,2009-age-53,0,SigPep,1,0.25444,5.351155,586.67057
1,2009-age-53,1,SigPep,2,3.0482,5.351155,586.67057
2,2009-age-53,2,SigPep,3,1.6275,5.351155,586.67057
3,2009-age-53,3,SigPep,4,0.95476,5.351155,586.67057
4,2009-age-53,4,SigPep,5,0.90589,5.351155,586.67057


Sum positive diffsel values per site and condition and calculate the proportion of these totals per site. This approach averages the antigenic escape effect per site across all conditions (sera).

In [40]:
total_site_specific_positivediffsel = site_specific_df.groupby(["gene", "gene_site"]).aggregate({
    "site_positive diffsel": "sum",
    "total_positivediffsel": "sum"
}).reset_index()

In [41]:
total_site_specific_positivediffsel["proportion_of_positivediffsel"] = (
    total_site_specific_positivediffsel["site_positive diffsel"] / total_site_specific_positivediffsel["total_positivediffsel"]
)

In [42]:
total_site_specific_positivediffsel

Unnamed: 0,gene,gene_site,site_positive diffsel,total_positivediffsel,proportion_of_positivediffsel
0,HA1,1,6.49400,2242.462864,0.002896
1,HA1,2,6.62746,2242.462864,0.002955
2,HA1,3,7.46460,2242.462864,0.003329
3,HA1,4,6.44420,2242.462864,0.002874
4,HA1,5,8.00073,2242.462864,0.003568
...,...,...,...,...,...
561,SigPep,12,3.88674,2242.462864,0.001733
562,SigPep,13,5.01756,2242.462864,0.002238
563,SigPep,14,6.13612,2242.462864,0.002736
564,SigPep,15,5.14104,2242.462864,0.002293


In [43]:
total_site_specific_positivediffsel.query("gene == 'HA1' & gene_site == 160")

Unnamed: 0,gene,gene_site,site_positive diffsel,total_positivediffsel,proportion_of_positivediffsel
159,HA1,160,26.2304,2242.462864,0.011697


In [44]:
weighted_antigenic_escape_map = {
    "name": "antigenic_escape_weighted_epitope_sites",
    "description": "Average proportion of positive site-level mutdiffsel per site across four human sera from Lee et al. 2019",
    "default": 0.0,
    "map": {}
}

In [45]:
for record in total_site_specific_positivediffsel.to_dict("records"):
    if record["gene"] not in weighted_antigenic_escape_map["map"]:
        weighted_antigenic_escape_map["map"][record["gene"]] = {}
        
    weighted_antigenic_escape_map["map"][record["gene"]][str(record["gene_site"])] = record["proportion_of_positivediffsel"]

In [46]:
weighted_antigenic_escape_map

{'name': 'antigenic_escape_weighted_epitope_sites',
 'description': 'Average proportion of positive site-level mutdiffsel per site across four human sera from Lee et al. 2019',
 'default': 0.0,
 'map': {'HA1': {'1': 0.002895923096670626,
   '2': 0.0029554380175948114,
   '3': 0.003328750777241693,
   '4': 0.002873715371044787,
   '5': 0.0035678316595666123,
   '6': 0.004024369877718519,
   '7': 0.003497672192243053,
   '8': 0.003872126553849817,
   '9': 0.003224668786055533,
   '10': 0.001976153127000469,
   '11': 0.000530299974784814,
   '12': 0.0016786922366372635,
   '13': 0.0010597916507882046,
   '14': 0.0008310862267944429,
   '15': 0.001329119000558485,
   '16': 0.0004593030353984978,
   '17': 0.0010853468476796741,
   '18': 0.0007279808404099009,
   '19': 0.0015733370917028749,
   '20': 0.00018283468888742791,
   '21': 0.0024813030754831534,
   '22': 0.0009615731145620724,
   '23': 0.0004630105661385224,
   '24': 0.0014904817619582793,
   '25': 0.0022310648177534453,
   '26': 0

In [47]:
with open(weighted_antigenic_escape_distance_map, "w") as oh:
    json.dump(weighted_antigenic_escape_map, oh, indent=4, sort_keys=True)