# Uniqueness Demo
---

Goal: If a set of SNVs are found in a mixture sequencing dataset, what is the probability that a particular lineage is present?

In [13]:
mutations = ['S:A222V']
pangolin_lineage = 'b.1.177'
location_id = 'USA' # optional

To start, it's worth quantifying how often that SNV appears in the lineage of interest, versus all other lineages

In [14]:
import numpy as np
import pandas as pd
from outbreak_data import outbreak_data
from outbreak_tools import outbreak_tools

df = outbreak_data.mutations_by_lineage(mutations, location=location_id)

lin_count = df.loc[df['pangolin_lineage'] == pangolin_lineage].mutation_count.item()

non_lin_count = np.sum(
    df.loc[df['pangolin_lineage'] != pangolin_lineage].mutation_count
)

print(df)
print(f'{pangolin_lineage} count: {lin_count}')
print(f'non-{pangolin_lineage} count: {non_lin_count}')

    pangolin_lineage  lineage_count  mutation_count  proportion  \
0             ba.1.1         500053             264    0.000528   
1             ay.103         295483             674    0.002281   
2              ay.44         252180             268    0.001063   
3            b.1.1.7         237896             285    0.001198   
4          ba.2.12.1         231938              61    0.000263   
..               ...            ...             ...         ...   
235       b.1.177.46              1               1    1.000000   
236       b.1.177.50              1               1    1.000000   
237              xbc              1               1    1.000000   
238          xbc.1.1              1               1    1.000000   
239            xbc.2              1               1    1.000000   

     proportion_ci_lower  proportion_ci_upper  
0               0.000467             0.000595  
1               0.002114             0.002458  
2               0.000941             0.001196  
3  

---
Next, we will quantify the uniqueness of the set of mutations to our lineage of interest. To do this, we find the conditional probability of observing our lineage of interest given the set of mutations are present. We accomplish this using Bayes' theorem, where $P(L)$ is the probability of observing a lineage, $P(M)$ is the probabilty of observing all mutations in M in any sequenced genome:
$$
P(L|M)=\frac{P(M|L)P(L)}{P(M)} 
$$
---

In [15]:
P_L_given_M = outbreak_tools.uniqueness(
    mutations,
    pangolin_lineage,
    location=location_id
)

    pangolin_lineage  lineage_count  mutation_count  proportion  \
181          b.1.177            172             163    0.947674   

     proportion_ci_lower  proportion_ci_upper  
181             0.906691             0.973797  
------------------------
P(M|L): 0.9476744186046512
P(L):   0.005461393920350189
P(M):   0.029546951891287206
------------------------
P(L|M): 0.17516606542974836
