# Building `dms-view` datasets for [Sourisseau *et al.*, 2019](https://research.fhcrc.org/content/dam/stripe/bloom/labfiles/publications/Sourisseau2019.pdf)

This jupyter notebook builds a `dms-view` datafile for the Deep Mutational Scanning (DMS; `Sourisseau2019_DMS.csv`) and the Mutational Antigenic Profiling (MAP; `Sourisseau2019_MAP.csv`) of Zika Envelope protein.
The data is scraped from the [paper repo](https://github.com/jbloomlab/ZIKV_DMS_with_EvansLab). 

## notebook setup

In [1]:
import pandas as pd
from scipy.stats import entropy

## Deep Mutational Scanning data

The DMS data will show two "conditions": raw preferences, and rescaled preferences. 
For each "condition", the dot plot will show either the entropy, n effective, or the RSA and the logoplot will show the value and the mutational effect.

### paths to datafiles on github

In [2]:
muteffects_fname = ('https://raw.githubusercontent.com/jbloomlab/ZIKV_DMS_with_EvansLab/master/results/'
                    'muteffects/unscaled_muteffects.csv')
prefs_fname = ('https://raw.githubusercontent.com/jbloomlab/ZIKV_DMS_with_EvansLab/master/results/'
               'prefs/summary_avgprefs.csv')
rescaledprefs_fname = ('https://raw.githubusercontent.com/jbloomlab/ZIKV_DMS_with_EvansLab/master/results/'
                       'prefs/rescaled_prefs.csv')
sitesummary_fname= ('https://raw.githubusercontent.com/jbloomlab/ZIKV_DMS_with_EvansLab/master/results/'
                    'struct_props/struct_props_mut_tol.csv')

### RSA

In [3]:
# read in RSA and entropy/n effective 
RSA = (pd.read_csv(sitesummary_fname).query("pdb == '5ire'")
       [['site', 'RSA', 'mutational_tolerance_measure', 'mutational_tolerance']]
       .rename(columns={'RSA': 'site_RSA'})
       .drop(columns=['mutational_tolerance_measure', 'mutational_tolerance']))
RSA.head()

Unnamed: 0,site,site_RSA
0,1,0.030457
4,2,0.244526
8,3,0.023952
12,4,0.142132
16,5,0.625


### mutational effects

In [4]:
mut_effects = (pd.read_csv(muteffects_fname)
               .drop(columns=['mutation', 'effect'])
               .rename(columns={'mutant': 'mutation', 
                                'log2effect': 'mut_mutational effect'}))
mut_effects.head()

Unnamed: 0,site,wildtype,mutation,mut_mutational effect
0,1,I,A,-5.04603
1,1,I,C,-4.308696
2,1,I,D,-3.88453
3,1,I,E,-4.290569
4,1,I,F,-4.370049


In [5]:
amino_acids = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L','M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
df = (pd.concat([(pd.read_csv(fname)[['site'] + amino_acids])
                    .assign(condition=condition)
                    for condition, fname in [('raw preferences', prefs_fname),
                                             ('rescaled preferences', rescaledprefs_fname)]
                   ]))
df['site_entropy'] = df[amino_acids].apply(lambda x: entropy(x), axis=1)
df['site_n effective'] = df['site_entropy'].apply(lambda x: 2**x)
df = pd.melt(df, 
                id_vars=['site', 'condition', 'site_entropy', 'site_n effective'], 
                var_name='mutation', 
                value_name='mut_value')
df = pd.merge(df, RSA, on=['site'])
df = pd.merge(df, mut_effects, on=['site', 'mutation'])
df.head()

Unnamed: 0,site,condition,site_entropy,site_n effective,mutation,mut_value,site_RSA,wildtype,mut_mutational effect
0,1,raw preferences,1.809948,3.506295,A,0.005438,0.030457,I,-5.04603
1,1,raw preferences,1.809948,3.506295,A,0.005438,0.030457,I,-5.04603
2,1,rescaled preferences,1.317192,2.491807,A,0.000505,0.030457,I,-5.04603
3,1,rescaled preferences,1.317192,2.491807,A,0.000505,0.030457,I,-5.04603
4,1,raw preferences,1.809948,3.506295,C,0.009065,0.030457,I,-4.308696


In [6]:
df['protein_chain'] = 'A'
df['protein_site'] = df['site']
df['label_site'] = df[['wildtype', 'site']].apply(lambda x: f'{x[0]} {x[1]}', axis=1)
df.head()

Unnamed: 0,site,condition,site_entropy,site_n effective,mutation,mut_value,site_RSA,wildtype,mut_mutational effect,protein_chain,protein_site,label_site
0,1,raw preferences,1.809948,3.506295,A,0.005438,0.030457,I,-5.04603,A,1,I 1
1,1,raw preferences,1.809948,3.506295,A,0.005438,0.030457,I,-5.04603,A,1,I 1
2,1,rescaled preferences,1.317192,2.491807,A,0.000505,0.030457,I,-5.04603,A,1,I 1
3,1,rescaled preferences,1.317192,2.491807,A,0.000505,0.030457,I,-5.04603,A,1,I 1
4,1,raw preferences,1.809948,3.506295,C,0.009065,0.030457,I,-4.308696,A,1,I 1


In [7]:
df.to_csv('Sourisseau2019_DMS.csv', index=False)