## What are the salient differences between strains?

This notebook addresses a [question of Tom Wenseleers](https://twitter.com/TWenseleers/status/1438780125479329792) about the salient differences between two strains, say between B.1.617.1 and the similar B.1.617.2 and B.1.617.3. You should be able to run this notebook merely after git cloning, to explore other salient differences.

First load the precomputed data.

In [1]:
import pandas as pd
strains_df = pd.read_csv("paper/strains.tsv", sep="\t")
mutations_df = pd.read_csv("paper/mutations.tsv", sep="\t")

Convert to dictionaries for easier use.

In [2]:
mutations_per_strain = {
    strain: frozenset(mutations.split(","))
    for strain, mutations in zip(strains_df["strain"], strains_df["mutations"])
}

In [3]:
effect_size = dict(zip(mutations_df["mutation"], mutations_df["Δ log R"]))

Create a helper to explore pairwise differences.

In [4]:
def print_diff(strain1, strain2, max_results=10):
    mutations1 = mutations_per_strain[strain1]
    mutations2 = mutations_per_strain[strain2]
    diff = [(m, effect_size[m]) for m in mutations1 ^ mutations2]
    diff.sort(key=lambda me: -abs(me[1]))
    print(f"{strain1} versus {strain2}")
    print("AA mutation     Δ log R    Present in strain")
    print("--------------------------------------------")
    for m, e in diff[:max_results]:
        strain = strain1 if m in mutations1 else strain2
        print(f"{m: <15s} {e: <10.3g} {strain}")

Examine some example sequences.

In [5]:
print_diff("B.1.617.2", "B.1.617.1")

B.1.617.2 versus B.1.617.1
AA mutation     Δ log R    Present in strain
--------------------------------------------
ORF8:L84S       0.235      B.1.617.1
ORF1a:S318L     0.206      B.1.617.1
ORF1a:G392D     0.111      B.1.617.1
ORF1b:P314F     0.11       B.1.617.1
N:A220V         0.0752     B.1.617.2
S:E484Q         0.0703     B.1.617.1
S:A222V         0.0677     B.1.617.2
ORF1a:M585V     -0.0624    B.1.617.2
ORF1a:S2535L    0.0612     B.1.617.1
N:S194L         0.0599     B.1.617.1


In [6]:
print_diff("B.1.617.2", "B.1.617.3")

B.1.617.2 versus B.1.617.3
AA mutation     Δ log R    Present in strain
--------------------------------------------
ORF1a:K680N     0.415      B.1.617.3
N:A220V         0.0752     B.1.617.2
S:E484Q         0.0703     B.1.617.3
S:A222V         0.0677     B.1.617.2
ORF1a:M585V     -0.0624    B.1.617.2
N:S187L         0.0596     B.1.617.3
ORF1a:D2980N    0.0498     B.1.617.2
ORF1a:M3655I    0.0413     B.1.617.3
ORF1a:P309L     0.0409     B.1.617.2
ORF1a:S3675-    0.0409     B.1.617.3
