<a id='contents'></a>

# Figures and analysis

This notebook contains scripts for producing the main figures and results accompanying the manuscript. Here we perform basic organization and processing of data, which is then passed to functions in `figures.py` and `mplot.py` (available at this [GitHub repository](https://github.com/johnbarton/mplot)) for detailed formatting. The figures produced are stored as PDFs in the `/figures` folder.

## Contents

- [Overview and table of contents](#contents)
- [Loading libraries and global variables](#global)
- [Figures and data analysis](#figures)  
    - [Figure 1](#fig1-figs1)  
    - [Figure 2](#fig2-figs2)
    - [Figure 3](#fig3)
    - [Figure 4](#fig4-figs4)
    - [Figure 5](#fig5)
    - [Figure 6](#fig6-figs9)
    - [Extended Data Figure 1](#fig1-figs1)
    - [Extended Data Figure 2](#fig2-figs2)
    - Extended Data Figures 3 and 4 are plotted in Matlab -- see the `src/Matlab` folder for details  
    - [Extended Data Figure 5](#figs3-figs6)
    - [Extended Data Figure 6](#fig4-figs4)
    - [Supplementary Figure 1](#figs5)
    - [Extended Data Figure 7](#figs3-figs6)
    - [Extended Data Figure 8](#figs7-figs8)
    - [Extended Data Figure 9](#figs7-figs8)
    - [Supplementary Figure 2](#fig6-figs9)
    - [Extended Data Figure 10](#figs11)

<a id='global'></a>

## Libraries and variables

In [1]:
# Full library list and version numbers

print('This notebook was prepared using:')

import sys, os
from copy import deepcopy
from importlib import reload
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import scipy as sp
import scipy.stats as st
print('scipy version %s' % sp.__version__)

import pandas as pd
print('pandas version %s' % pd.__version__)

import matplotlib
import matplotlib.cm as cm
import matplotlib.pyplot as plot
import matplotlib.gridspec as gridspec
import matplotlib.image as mpimg
print('matplotlib version %s' % matplotlib.__version__)

import figures as fig
import mplot as mp


# GLOBAL VARIABLES

NUC = ['-', 'A', 'C', 'G', 'T']
REF = NUC[0]
CONS_TAG = 'CONSENSUS'
HXB2_TAG = 'B.FR.1983.HXB2-LAI-IIIB-BRU.K03455.19535'
B_PPTS = ['700010470', '700010077', '700010058', '700010040', '700010607']
C_PPTS = ['706010164', '705010198', '705010185', '705010162', '704010042', 
          '703010256', '703010159', '703010131', 'cap256']
PPT2LABEL = {'700010470': 'CH470',
             '700010077': 'CH77', 
             '700010058': 'CH58', 
             '700010040': 'CH40',
             '706010164': 'CH164', 
             '705010198': 'CH198', 
             '705010185': 'CH185', 
             '705010162': 'CH162', 
             '704010042': 'CH42', 
             '703010256': 'CH256',
             '703010159': 'CH159', 
             '703010131': 'CH131', 
             '700010607': 'CH607', 
             'cap256':    'CAP256'}

FIGPROPS = { 'transparent' : True, }

# # Code Ocean directories
# HIV_DIR = '../data/HIV'
# MPL_DIR = 'MPL'
# SIM_DIR = '../data/simulation'
# FIG_DIR = '../results'
# HIV_MPL_DIR = '../data/HIV/MPL'

# GitHub directories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src/MPL'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'
HIV_MPL_DIR = 'src/MPL/HIV'

TESTS   = [   'example',      'medium_simple',      'medium_complex']
N_VALS  = dict(example=  1000, medium_simple=  1000, medium_complex=1000)
L_VALS  = dict(example=    50, medium_simple=    50, medium_complex=  50)
T0_VALS = dict(example=     0, medium_simple=     0, medium_complex=  10)
T_VALS  = dict(example=   400, medium_simple=  1000, medium_complex= 310)
MU_VALS = dict(example=  1e-3, medium_simple=  1e-4, medium_complex=1e-4)
NB_VALS = dict(example=    10, medium_simple=    10, medium_complex=  10)
ND_VALS = dict(example=    10, medium_simple=    10, medium_complex=  10)
SB_VALS = dict(example= 0.025, medium_simple= 0.025, medium_complex= 0.1)
SD_VALS = dict(example=-0.025, medium_simple=-0.025, medium_complex=-0.1)

N_TRIALS     =  100  # number of independent trials to run for each test set
COMP_NS_VALS = [100] # number of sequence samples to collect per time point 
COMP_DT_VALS = [ 10] # time between sampling events (in discrete generations)

# NOTE: the values below are taken from `HIV-analysis.ipynb` from the output of the second code cell
# If this same pipeline is run on new or different data, these values should be updated!

# ALL SUBTYPES
TOTAL_VARS       = 350045 # number of possible mutations
TOTAL_NS_EPITOPE = 7155   # total number of nonsynonymous mutations in epitopes
TOTAL_NS_REV     = 4383   # total number of nonsynonymous reversions
TOTAL_NS_REV_EPI = 127    # total number of nonsynonymous reversions in epitopes

# # Subtype B only
# TOTAL_VARS       = 133601 # number of possible mutations
# TOTAL_NS_EPITOPE = 3385   # total number of nonsynonymous mutations in epitopes
# TOTAL_NS_REV     = 1736   # total number of nonsynonymous reversions
# TOTAL_NS_REV_EPI = 66     # total number of nonsynonymous reversions in epitopes

# # Subtype C only
# TOTAL_VARS       = 216444 # number of possible mutations
# TOTAL_NS_EPITOPE = 3545   # total number of nonsynonymous mutations in epitopes
# TOTAL_NS_REV     = 2647   # total number of nonsynonymous reversions
# TOTAL_NS_REV_EPI = 54     # total number of nonsynonymous reversions in epitopes

This notebook was prepared using:
python version 3.11.10 | packaged by conda-forge | (main, Oct 16 2024, 01:26:25) [Clang 17.0.6 ]
numpy version 1.26.4
scipy version 1.14.1
pandas version 2.2.2
matplotlib version 3.9.2


In [None]:
ch505 = '703010505-3-poly'
ch848 = '703010848-3-poly'
df = pd.read_csv('data/csv/%s.csv' % (ch505), comment='#', memory_map=True)
int(df.iloc[-1].HXB2_index) - int(df.iloc[0].HXB2_index) + len(np.unique(df[df.HXB2_index.str.isnumeric()==False].HXB2_index))

In [None]:
from importlib import reload
reload(mp)
reload(fig)

fig.plot_figure_ch505_ch848_circle()

In [None]:
from importlib import reload
reload(mp)
reload(fig)

fig.plot_figure_ch505_structure(filename='fig-ch505-structure.png')

In [None]:
from importlib import reload
reload(mp)
reload(fig)

fig.plot_figure_ch848_structure('fig-ch848-structure.png')

In [None]:
df_ch505 = pd.read_csv('data/csv/selection_RMs_common_mutations_SHIVCH505.csv')
df_ch848 = pd.read_csv('data/csv/selection_RMs_common_mutations_SHIVCH848.csv')

top_x = 20
top_muts = ['N130D', 'N279D', 'K302N', 'Y330H', 'N334S', 'H417R']

for df in [df_ch505, df_ch848]:
    rank_mean  = np.argsort(np.array(df['mean_S']))[::-1]
    rank_joint = np.argsort(np.array(df['RMs']))[::-1]
    
    m_count = 0
    j_count = 0
    print('')
    print('rank\tm mut\tm s\tm ct\tj mut\tj s\tj ct')
    for i in range(top_x):
        m_mut = df.iloc[rank_mean[i]]
        j_mut = df.iloc[rank_joint[i]]
        if str(m_mut.mutation) in top_muts:
            m_count += 1
        if str(j_mut.mutation) in top_muts:
            j_count += 1
        print('%d\t%s\t%.3f\t%d\t%s\t%.3f\t%d' % (i+1, m_mut.mutation, m_mut.mean_S, m_count, j_mut.mutation, j_mut.RMs, j_count))

In [None]:
test_frac = 0.04

df_ch505    = pd.read_csv('data/csv/enrichment_CH505_multiply_fraction.csv')
df_ch505_rm = pd.read_csv('data/csv/enrichment_grouped_SHIV_CH505_multiply_fraction.csv')
df_ch848    = pd.read_csv('data/csv/enrichment_CH848_multiply_fraction.csv')
df_ch848_rm = pd.read_csv('data/csv/enrichment_grouped_SHIV_CH848_multiply_fraction.csv')

df_ch505['P']    = 10**df_ch505['log10_P']
df_ch505_rm['P'] = 10**df_ch505_rm['log10_P']
df_ch848['P']    = 10**df_ch848['log10_P']
df_ch848_rm['P'] = 10**df_ch848_rm['log10_P']

print('CH505')
df_ch505_sub = df_ch505[df_ch505['fraction']==test_frac]
print(df_ch505_sub)
print(np.sum(df_ch505_sub['num_cutoff']))

print('\nSHIV.CH505')
df_ch505_rm_sub = df_ch505_rm[df_ch505_rm['fraction']==test_frac]
print(df_ch505_rm_sub)
print(np.sum(df_ch505_rm_sub['num_cutoff']))

print('\nCH848')
df_ch848_sub = df_ch848[df_ch848['fraction']==test_frac]
print(df_ch848_sub)
print(np.sum(df_ch848_sub['num_cutoff']))

print('\nSHIV.CH848')
df_ch848_rm_sub = df_ch848_rm[df_ch848_rm['fraction']==test_frac]
print(df_ch848_rm_sub)
print(np.sum(df_ch848_rm_sub['num_cutoff']))

In [None]:
df_ch505 = pd.read_csv('data/csv/selection_RMs_SHIVCH505.csv')
df_ch848 = pd.read_csv('data/csv/selection_RMs_SHIVCH848.csv')

top_n = 20

for df in [df_ch505, df_ch848]:
    mut_unique = np.array(np.unique(df['mutation']))
    mut_count = np.array([len(np.unique(df[df['mutation']==m]['individual'])) for m in mut_unique])
    s_avg = np.array([np.mean(df[df['mutation']==m]['selection']) for m in mut_unique])
    s_rank = np.argsort(s_avg)[::-1]
    mut_sort = mut_unique[s_rank[:top_n]]
    count_sort = mut_count[s_rank[:top_n]]
    s_sort = s_avg[s_rank[:top_n]]
    print('mut\t# RMs\tavg s')
    for i in range(top_n):
        print('%s\t%d\t%.2f' % (mut_sort[i], count_sort[i], s_sort[i]))
    print('')

In [47]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    's_files': ['data/csv/selection_vs_num-of-RMs_CH505.csv', 'data/csv/selection_vs_num-of-RMs_CH848.csv'],
    'tags':    ['CH505', 'CH848']
}

fig.plot_selection_vs_rms(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    's_files':    ['data/csv/fig3A_selection_vs_time.csv', 'data/csv/figS4A_selection_vs_time.csv'],
    'traj_files': ['data/csv/fig3C_trajectory.csv', 'data/csv/figS4C_trajectory.csv'],
    'tags':       ['CH505', 'CH848']
}

fig.plot_trajectory_selection(**pdata)

In [35]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'traj_files': ['data/csv/fig3C_trajectory.csv', 'data/csv/figS4C_trajectory.csv'],
    'tags':       ['CH505', 'CH848']
}

fig.plot_trajectory_expanded(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    's_files':    ['data/csv/selection_RMs_SHIVCH505.csv', 'data/csv/selection_RMs_SHIVCH848.csv'],
    'traj_files': ['data/csv/trajectories_RMs_SHIVCH505.csv', 'data/csv/trajectories_RMs_SHIVCH848.csv'],
    'tags':       ['SHIV.CH505', 'SHIV.CH848'],
    'rms':        ['RM5695', 'RM6167']
}

fig.plot_trajectory_selection_shiv(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'f_files':    ['data/csv/fig6A_fitness_change.csv', 'data/csv/figS6A_fitness_change.csv'],
    'tags':       ['SHIV.CH505', 'SHIV.CH848'],
    't_breadth':  dict(RM5695=16*7, RM6070=8*7, RM6163=80*7, RM6167=64*7),
    'use_breadth': True,
}

fig.plot_fitness_gain_v_time(**pdata)

In [36]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'f_files':    ['data/csv/fitness_change_w_categories_CH505.csv', 'data/csv/fitness_change_w_categories_CH848.csv'],
    'tags':       ['SHIV.CH505', 'SHIV.CH848'],
    # 't_breadth':  dict(RM5695=16*7, RM6070=8*7, RM6163=80*7, RM6167=64*7),
    'use_breadth': True,
}

fig.plot_fitness_gain_v_time_categories(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'f_files': ['data/csv/fig4A_fitness_comparison.csv', 'data/csv/fig4B_fitness_comparison.csv'],
    'tags':    ['CH505', 'CH848']
}

fig.plot_fitness_comparison(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'f_files': ['data/csv/fig4A_fitness_comparison.csv', 'data/csv/fig4B_fitness_comparison.csv'],
    'tags':    ['CH505', 'CH848']
}

fig.plot_fitness_comparison_horizontal(**pdata)

In [None]:
from importlib import reload
reload(mp)
reload(fig)

pdata = {
    'f_files':    ['data/csv/figS3A_fitness_comparison_using_shuffled_seq.csv', 'data/csv/figS3B_fitness_comparison_using_shuffled_seq.csv'],
    'tags':       ['CH505', 'CH848'],
    'filename':   'fig-f-compare-h-shuffle-wide',
    'use_breadth': False
}

fig.plot_fitness_comparison_horizontal(**pdata)

In [None]:
df = pd.read_csv('data/csv/fig3A_selection_vs_time.csv')
df_sub = df[(df['mutation'].str.contains('-142')) & (df['category']=='CH103')]
print(len(df[df['category']=='CH103']), len(df_sub))
df_sub.head()
#print(len(np.unique(df_sub['mutation'])))