# Plot data on structures

The goal of this notebook is to plot the data on various CHIKV structures using [`dms-viz`](https://dms-viz.github.io/v0/). [Here's](https://dms-viz.github.io/dms-viz-docs/) the documentation for `dms-viz`. Unfortunately, it's a little slow on larger structures and big datasets like this. You might have to wait a bit for your interactions to register.

There are two papers that elucidate the structure of CHIKV E in complex with Mxra8——[this paper](https://www.sciencedirect.com/science/article/pii/S0092867419303940?via%3Dihub) and [this paper](https://www.cell.com/cell/pdf/S0092-8674(19)30392-7.pdf).

For several reasons, I think the structure from [Micheal Diamond and Daved Fremont's paper](https://www.cell.com/cell/pdf/S0092-8674(19)30392-7.pdf) is better suited for viewing our data. They used a combination of X-ray crystallography, Cryo-EM of viral particles, and computational reconstruction to get structures of 'mature' CHIKV E  in complex with mouse Mxra8 on [VLPs](https://www.rcsb.org/structure/6NK6) and [infectious particles](https://www.rcsb.org/structure/6NK7). It’s easy to orient yourself visually with these structures because they contain the transmembrane domain and capsid; The models reflect the T=4 symmetry of CHIKV E, making it easy to show all 4 binding ‘sites’ with Mxra8; There are structures with (from infectious particles) and without (from the VLP) E3 retention.

In this notebook, I'll plot the functional scores and the difference in functional scores between cell types on the [VLP](https://www.rcsb.org/structure/6NK6) and [infectious particle](https://www.rcsb.org/structure/6NK7) structures.


In [116]:
import pandas as pd
import os
import sys

## Data Processing

In this section, I'll combine, filter, annotate, and calculate the difference between functional selection scores. First, I'll combine the average functional scores——the effect of mutations on cell entry——for each cell type into a single dataset.

In [117]:
# Average *observed* effect on cell entry for each cell line
TIM1_func_effects = pd.read_csv('../results/func_effects/averages/293T-TIM1_entry_func_effects.csv')
TIM1_func_effects["condition"] = 'TIM1'
MXRA8_func_effects = pd.read_csv('../results/func_effects/averages/293T-Mxra8_entry_func_effects.csv')
MXRA8_func_effects["condition"] = 'MXRA8'
C636_func_effects = pd.read_csv('../results/func_effects/averages/C636_entry_func_effects.csv')
C636_func_effects["condition"] = 'C636'

In [118]:
# Annotations for each site in CHIKV E
CHIKV_E_sitemap = pd.read_csv('./dms-viz/sitemap/CHIKV_sitemap.csv')
CHIKV_E_annotations = CHIKV_E_sitemap.rename(columns={'sequential_site': 'site'})
# Replace NaN with empty strings for annotations
columns_to_replace = ['domain', 'contacts']
CHIKV_E_annotations[columns_to_replace] = CHIKV_E_annotations[columns_to_replace].fillna('')

In [119]:
# Combine all functional effects
combined_func_effects = (
    pd.concat([TIM1_func_effects, MXRA8_func_effects, C636_func_effects])
    .merge(
        CHIKV_E_annotations[['site', 'wildtype', 'region', 'immature_numbering', 'mature_numbering', 'domain', 'contacts']], 
        on=['site', 'wildtype'],
        how='left'
    )
) 
combined_func_effects.head()

Unnamed: 0,site,wildtype,mutant,effect,effect_std,times_seen,n_selections,condition,region,immature_numbering,mature_numbering,domain,contacts
0,1,M,I,-5.765,0.008958,17.25,4,TIM1,E3,0.0,0.0,,
1,1,M,L,-1.297,0.1075,0.5,2,TIM1,E3,0.0,0.0,,
2,1,M,M,0.0,0.0,,4,TIM1,E3,0.0,0.0,,
3,1,M,T,-5.745,0.0483,5.0,4,TIM1,E3,0.0,0.0,,
4,1,M,V,-5.748,0.01202,1.5,2,TIM1,E3,0.0,0.0,,


In [120]:
# Write to file as input for dms-viz
combined_func_effects.to_csv('./dms-viz/input/all_functional_effects.csv', index=False)
for cell_line in ['TIM1', 'MXRA8', 'C636']:
    combined_func_effects.query(f'condition == "{cell_line}"').to_csv(f'./dms-viz/input/{cell_line}_functional_effects.csv', index=False)

Now, I'll take the functional selection data and calculate the difference between functional score across all combinations of cell types.

**Note that I'm doing minimal filtering of the raw data. I'm removing observations with `times_seen <= 2` in either dataset and I'm removing stops and gaps `["*", "-"]`.**

In [121]:
# Pivot the functional effects for each cell type
functional_selection_difference = (
    combined_func_effects
    .query('mutant not in ["*", "-"]')
    .query(f'times_seen > 2')
    .pivot(index=['site', 'wildtype', 'mutant'], columns='condition', values='effect')
    .rename_axis(None, axis=1)
    .reset_index()
)

# Sites have missing data for some cell lines after filtering
missing_data = len(functional_selection_difference[
    functional_selection_difference.isna().any(axis=1)
])
print(f"Missing data in least one condition for {missing_data} sites")

Missing data in least one condition for 28 sites


In [122]:
# Get the condition columns (excluding 'site' and 'wildtype')
condition_cols = [col for col in functional_selection_difference.columns 
                 if col not in ['site', 'wildtype', 'mutant']]

# Calculate all pairwise differences
for col1 in condition_cols:
    for col2 in condition_cols:
        if col1 != col2:
            new_col_name = f"{col1}_v_{col2}"
            functional_selection_difference[new_col_name] = (
                functional_selection_difference[col1] - functional_selection_difference[col2]
            )

# Melt the comparisons into a long format
functional_selection_difference = functional_selection_difference.melt(
    id_vars=["site", "wildtype", "mutant"],
    value_vars=["C636_v_MXRA8", "C636_v_TIM1", "MXRA8_v_C636", "MXRA8_v_TIM1", "TIM1_v_C636", "TIM1_v_MXRA8"],
    var_name="comparison",
    value_name="difference"
)

# Join the annotations back to the functional effect differences
functional_selection_difference = (
    functional_selection_difference
    .merge(
        CHIKV_E_annotations[['site', 'wildtype', 'region', 'immature_numbering', 'mature_numbering', 'domain', 'contacts']],
        on=['site', 'wildtype'],
        how='left'
    )
)
functional_selection_difference.head()

Unnamed: 0,site,wildtype,mutant,comparison,difference,region,immature_numbering,mature_numbering,domain,contacts
0,1,M,I,C636_v_MXRA8,0.001,E3,0.0,0.0,,
1,1,M,T,C636_v_MXRA8,0.043,E3,0.0,0.0,,
2,2,S,A,C636_v_MXRA8,0.0814,E3,1.0,1.0,,
3,2,S,C,C636_v_MXRA8,0.0958,E3,1.0,1.0,,
4,2,S,D,C636_v_MXRA8,0.3251,E3,1.0,1.0,,


In [123]:
# Write to file
functional_selection_difference.to_csv('./dms-viz/input/all_functional_selection_difference.csv', index=False)
for comparison in ["C636_v_MXRA8", "C636_v_TIM1", "MXRA8_v_C636", "MXRA8_v_TIM1", "TIM1_v_C636", "TIM1_v_MXRA8"]:
    functional_selection_difference.query(f'comparison == "{comparison}"').to_csv(f'./dms-viz/input/{comparison}_functional_selection_difference.csv', index=False)

## Make `dms-viz` JSONs

Now, I'll use `configure-dms-viz` to make the `dms-viz` JSON visualization specification files. The following cells will execute the `configure-dms-viz` command in the command line. Make sure you've got the [`dms-viz` conda environment](./dms-viz/environment.yml) set as the active 'kernel'.

### VLP Structures ([6NK6](https://www.rcsb.org/structure/6NK6))

The **VLP doesn't retain the E3** subunit after it's cleaved by Furin, leading to a 1:1 binding mode with Mxra8 at 4 distinct sites within a single unit of the T=4 symmetry viral particle. Additionally, Mxra8 forms 3 distinct types of contact with E——wrapped, interspike, and intraspike. 

I'll use the ability of `dms-viz` to hide chains in order to illustrate each of these binding modes.

#### Wrapped

In [124]:
# Make the wrapped version of the site map
CHIKV_E_sitemap['chains'] = CHIKV_E_sitemap['region'].apply(
    lambda region: "E" if region in ["E3", "E2"] else "A" if region in ["E1", "6K"] else None
)
CHIKV_E_sitemap.to_csv('./dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv', index=False)

In [125]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_effects.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv \
    --output ./dms-viz/output/CHIKV_VLP_wrapped_monomer_functional_scores.json \
    --name "CHIKV Func. Scores" \
    --metric "effect" \
    --metric-name "Functional Effect" \
    --exclude-amino-acids "*, -" \
    --included-chains "A E" \
    --excluded-chains "M N P B C D F G H J K L" \
    --condition "condition" \
    --condition-name "Cell Line" \
    --filter-cols "{'n_selections': '# of Selections', 'times_seen': 'Times Seen'}" \
    --filter-limits "{'times_seen': [0, 2, 25]}" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549"

[32m
Formatting data for visualization using the 'effect' column from './dms-viz/input/all_functional_effects.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv'.[0m
[0m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_wrapped_monomer_functional_scores.json'[0m


In [126]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_selection_difference.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv \
    --output ./dms-viz/output/CHIKV_VLP_wrapped_monomer_functional_differences.json \
    --name "CHIKV Cell Entry Difference" \
    --metric "difference" \
    --metric-name "Effect Difference" \
    --exclude-amino-acids "*, -" \
    --included-chains "A E" \
    --excluded-chains "M N P B C D F G H J K L" \
    --condition "comparison" \
    --condition-name "Comparison (Left - Right)" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549,#009E73,#E69F00,#56B4E9"

[32m
Formatting data for visualization using the 'difference' column from './dms-viz/input/all_functional_selection_difference.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv'.[0m
[31m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_wrapped_monomer_functional_differences.json'[0m


#### Intraspike

The wrapped sitemap works for this visualization. However, I'll show a the 'intraspike' Mxra8 by hiding the 'wrapped' Mxra8.

In [127]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_effects.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv \
    --output ./dms-viz/output/CHIKV_VLP_intraspike_monomer_functional_scores.json \
    --name "CHIKV Func. Scores" \
    --metric "effect" \
    --metric-name "Functional Effect" \
    --exclude-amino-acids "*, -" \
    --included-chains "A E" \
    --excluded-chains "O N P B C D F G H J K L" \
    --condition "condition" \
    --condition-name "Cell Line" \
    --filter-cols "{'n_selections': '# of Selections', 'times_seen': 'Times Seen'}" \
    --filter-limits "{'times_seen': [0, 2, 25]}" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549"

[32m
Formatting data for visualization using the 'effect' column from './dms-viz/input/all_functional_effects.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv'.[0m
[0m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_intraspike_monomer_functional_scores.json'[0m


In [128]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_selection_difference.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv \
    --output ./dms-viz/output/CHIKV_VLP_intraspike_monomer_functional_differences.json \
    --name "CHIKV Cell Entry Difference" \
    --metric "difference" \
    --metric-name "Effect Difference" \
    --exclude-amino-acids "*, -" \
    --included-chains "A E" \
    --excluded-chains "O N P B C D F G H J K L" \
    --condition "comparison" \
    --condition-name "Comparison (Left - Right)" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549,#009E73,#E69F00,#56B4E9"

[32m
Formatting data for visualization using the 'difference' column from './dms-viz/input/all_functional_selection_difference.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_wrapped.csv'.[0m
[31m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_intraspike_monomer_functional_differences.json'[0m


#### Interspike

I'll need to make a new sitemap for this to show the data on the only heterodimer that makes 'interspike' contacts.

In [129]:
# Make the wrapped version of the site map
CHIKV_E_sitemap['chains'] = CHIKV_E_sitemap['region'].apply(
    lambda region: "H" if region in ["E3", "E2"] else "D" if region in ["E1", "6K"] else None
)
CHIKV_E_sitemap.to_csv('./dms-viz/sitemap/CHIKV_sitemap_6NK6_interspike.csv', index=False)

In [137]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_effects.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_interspike.csv \
    --output ./dms-viz/output/CHIKV_VLP_interspike_monomer_functional_scores.json \
    --name "CHIKV Func. Scores" \
    --metric "effect" \
    --metric-name "Functional Effect" \
    --exclude-amino-acids "*, -" \
    --included-chains "D H" \
    --excluded-chains "M N P A B C E F G J K I" \
    --condition "condition" \
    --condition-name "Cell Line" \
    --filter-cols "{'n_selections': '# of Selections', 'times_seen': 'Times Seen'}" \
    --filter-limits "{'times_seen': [0, 2, 25]}" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549"

[32m
Formatting data for visualization using the 'effect' column from './dms-viz/input/all_functional_effects.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_interspike.csv'.[0m
[0m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_interspike_monomer_functional_scores.json'[0m


In [138]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_selection_difference.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_interspike.csv \
    --output ./dms-viz/output/CHIKV_VLP_interspike_monomer_functional_differences.json \
    --name "CHIKV Cell Entry Difference" \
    --metric "difference" \
    --metric-name "Effect Difference" \
    --exclude-amino-acids "*, -" \
    --included-chains "D H" \
    --excluded-chains "M N P A B C E F G J K I" \
    --condition "comparison" \
    --condition-name "Comparison (Left - Right)" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549,#009E73,#E69F00,#56B4E9"

[32m
Formatting data for visualization using the 'difference' column from './dms-viz/input/all_functional_selection_difference.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_interspike.csv'.[0m
[31m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_interspike_monomer_functional_differences.json'[0m


#### Full Unit

Finally, I'll plot the data on the full T=4 unit of the icosahedral particle. This will be useful for comparing with the infectious particle.

In [132]:
# Make the wrapped version of the site map
CHIKV_E_sitemap['chains'] = CHIKV_E_sitemap['region'].apply(
    lambda region: "E F G H" if region in ["E3", "E2"] else "A B C D" if region in ["E1", "6K"] else None
)
CHIKV_E_sitemap.to_csv('./dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv', index=False)

In [133]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_effects.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv \
    --output ./dms-viz/output/CHIKV_VLP_full_functional_scores.json \
    --name "CHIKV Func. Scores" \
    --metric "effect" \
    --metric-name "Functional Effect" \
    --exclude-amino-acids "*, -" \
    --included-chains "A B C D E F G H" \
    --condition "condition" \
    --condition-name "Cell Line" \
    --filter-cols "{'n_selections': '# of Selections', 'times_seen': 'Times Seen'}" \
    --filter-limits "{'times_seen': [0, 2, 25]}" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549"

[32m
Formatting data for visualization using the 'effect' column from './dms-viz/input/all_functional_effects.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv'.[0m
[0m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_full_functional_scores.json'[0m


In [134]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_selection_difference.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv \
    --output ./dms-viz/output/CHIKV_VLP_full_functional_differences.json \
    --name "CHIKV Cell Entry Difference" \
    --metric "difference" \
    --metric-name "Effect Difference" \
    --exclude-amino-acids "*, -" \
    --included-chains "A B C D E F G H" \
    --condition "comparison" \
    --condition-name "Comparison (Left - Right)" \
    --structure "6NK6" \
    --colors "#0072B2,#CC79A7,#4C3549,#009E73,#E69F00,#56B4E9"

[32m
Formatting data for visualization using the 'difference' column from './dms-viz/input/all_functional_selection_difference.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv'.[0m
[31m
[33mAbout 94.64% (812 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_VLP_full_functional_differences.json'[0m


## Infection Particles

We also want to understand the effect of E3 retention on Mxra8 binding. We can do this by comparing between the VLP particle (no E3) and the Infectious particle (has E3). E3 retention after cleavage by Furin occludes 3/4 binding site on the E trimer.

The chains labels for E1 and E2 are the same between the VLP structure and the Infectious particle structure, so we can use the same sitemap. However, we'll change the structure from 6NK6 to 6NK7.

In [135]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_effects.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv \
    --output ./dms-viz/output/CHIKV_infectious_full_functional_scores.json \
    --name "CHIKV Func. Scores" \
    --metric "effect" \
    --metric-name "Functional Effect" \
    --exclude-amino-acids "*, -" \
    --included-chains "A B C D E F G H" \
    --condition "condition" \
    --condition-name "Cell Line" \
    --filter-cols "{'n_selections': '# of Selections', 'times_seen': 'Times Seen'}" \
    --filter-limits "{'times_seen': [0, 2, 25]}" \
    --structure "6NK7" \
    --colors "#0072B2,#CC79A7,#4C3549"

[32m
Formatting data for visualization using the 'effect' column from './dms-viz/input/all_functional_effects.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv'.[0m
[0m
[33mAbout 100.00% (858 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_infectious_full_functional_scores.json'[0m


In [136]:
!configure-dms-viz format \
    --input ./dms-viz/input/all_functional_selection_difference.csv \
    --sitemap ./dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv \
    --output ./dms-viz/output/CHIKV_infectious_full_functional_differences.json \
    --name "CHIKV Cell Entry Difference" \
    --metric "difference" \
    --metric-name "Effect Difference" \
    --exclude-amino-acids "*, -" \
    --included-chains "A B C D E F G H" \
    --condition "comparison" \
    --condition-name "Comparison (Left - Right)" \
    --structure "6NK7" \
    --colors "#0072B2,#CC79A7,#4C3549,#009E73,#E69F00,#56B4E9"

[32m
Formatting data for visualization using the 'difference' column from './dms-viz/input/all_functional_selection_difference.csv'...[0m
[32m
Using sitemap from './dms-viz/sitemap/CHIKV_sitemap_6NK6_full.csv'.[0m
[31m
[33mAbout 100.00% (858 of 858) of the wildtype residues in the data match the corresponding residues in the structure.[0m
[33mAbout 4.67% (42 of 900) of the data sites are missing from the structure.[0m
[32m
Success! The visualization JSON was written to './dms-viz/output/CHIKV_infectious_full_functional_differences.json'[0m
