# <font color = red> HIV Analysis
</font> Here we combine HIV sequence data from [the Los Alamos National Laboratory HIV Sequence Database](https://www.hiv.lanl.gov/content/index) and immunological data to investigate HIV evolution across 13 individuals. This data is contained in the data/HIV/ directory.

And we use some processed data from [here](https://github.com/bartonlab/paper-MPL-inference).
- epitopes.csv $\to$ epitopes.csv: information about epitopes 
- src-MPL-HIV.tar.gz/\*-poly-seq2state.dat $\to$ input/sequence/\*-poly-seq2state.dat: processed sequences readable by MPL
- src-MPL-HIV.tar.gz/Zanini-extended $\to$ input/Zanini-extended: mutation matrix for HIV data
- processed/\*-index.csv $\to$ notrait/processed/\*-index.csv: information about all sites
- interim/\*-poly.csv $\to$ notrait/interim/\*-poly.csv: information about polymorphic sites, which will be modified to interim/*-poly.csv when finding escape sites

## Contents
- ### [Libraries and variables](#lib)
- ### Data analyze
    - #### [Find escape sites](#escape)
    - #### [Get recombination rate over time](#VLr)
    - #### [Run C++ MPL code](#mpl)
    - #### [Deal with data](#deal)
    - #### [Calculate Δs_ij](#sij)

## <a id='lib'></a> Libraries and variables

In [1]:
print('This notebook was prepared using:')

import os
import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import pandas as pd
print('pandas version %s' % pd.__version__)

from scipy import integrate
import scipy.interpolate as sp_interpolate

import subprocess

import re
from math import isnan

import HIV


# GLOBAL VARIABLES

NUC = ['-', 'A', 'C', 'G', 'T']

ppt = ['700010040', '700010058', '700010077', '700010470', '700010607', '703010131', '703010159', 
       '703010256', '704010042', '705010162', '705010185', '705010198', '706010164']

tags = ['700010040-3', '700010040-5', '700010058-3', '700010058-5', '700010077-3', '700010077-5',
        '700010470-3', '700010470-5', '700010607-3', '700010607-5', '703010131-3', '703010131-5', 
        '703010159-3', '703010159-5', '703010256-3', '703010256-5', '704010042-3', '704010042-5', 
        '705010162-3', '705010162-5', '705010185-3', '705010185-5', '705010198-3', '705010198-5', 
        '706010164-3', '706010164-5']  

# GitHub di rectories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'

This notebook was prepared using:
python version 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
numpy version 1.24.2
pandas version 1.5.3


## Data analyze

###  <a id='escape'></a>Find escape sites
escape site = polymorphic site which is nonsynonymous in the reading frame of its epitope.

Steps to find escape sites:
1. find all polymorphic sites within one epitope
2. check if these polymorphic sites are nonsynonymous in the reading frame of its epitope
3. check if the epitope is independent using reduced row echelon form (RREF) for the integrated covariance matrix
    - independent,
        - sites containing nonsynonymous mutations $\to$ escape sites (or trait sites)
        - epitope $\to$ escape group (or trait group);
    - dependent,
        - sites $\to$ special sites
        - epitope $\neq$ escape group (or trait group). 

Special sites: sites among dependent epitopes. Binary traits are correlated of anticorrelated with some mutations, which isn
nonsynonymous mutation in the reading frame of their epitope but not escape sites because the number of escape sites in its epitope is less than 3.

#### Find all possible binary traits

In [2]:
import importlib
importlib.reload(HIV)
    
min_n = 2

for tag in tags:
    HIV.find_nons_mutations(tag)

CH040-3 has 4 binary traits, epitope AR9 : [18, 19, 20, 21], epitope EE9 : [245, 246], epitope RI9 : [257, 258, 259], epitope SR10 : [286, 287, 288, 290, 291], 
CH040-5 has 1 binary traits, epitope CR9 : [31, 32], 
CH058-3 has 1 binary traits, epitope EL9 : [60, 61, 62, 63], 
CH058-5 has 1 binary traits, epitope TW10 : [16, 17], 
CH077-3 has 6 binary traits, epitope DI9 : [28, 31, 32, 34, 35], epitope TL9 : [82, 83, 84, 85], epitope QF10 : [87, 88, 89], epitope DR9 : [122, 123, 124, 125], epitope DG9 : [149, 150, 151, 152, 153, 154, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171], epitope KF9 : [186, 187, 188, 189, 190], 
CH077-5 has 2 binary traits, epitope IW9 : [1], epitope TW10 : [4, 5], 
CH470-3 has 3 binary traits, epitope RK10 : [0, 1, 2, 3], epitope RI9 : [302, 303, 304, 305], epitope RK11 : [352, 353, 355], 
CH470-5 has 4 binary traits, epitope GK9 : [8, 9], epitope DL9 : [23, 24, 25], epitope EK9 : [78, 79, 81], epitope RL9 : [149, 150, 151, 15

#### Determine if this binary trait is independent

Run g++ src/rref.cpp -O3 -o rref.out to compile the C++ code

In [3]:
import importlib
importlib.reload(HIV)

# Use sympy_matrix.rref() to get RREF
for tag in tags:
    HIV.determine_dependence_new(tag)

CH040-3 : trait EE9, (2 NS), linked variants: 245T, 246A, 
CH077-5 : trait IW9, (2 NS), linked variants: 1C, 1T, 
CH077-5 : trait TW10, (2 NS), linked variants: 4A, 5A, 
CH470-3 : trait RI9, (4 NS), linked variants: 302C, 303A, 304T, 305T, 
CH131-3 : trait KC9, (2 NS), linked variants: 46G, 
CH131-3 : trait SI9, (2 NS), linked variants: 560C, 561A, 
CH256-3 : trait EW10, (2 NS), linked variants: 46A, 
CH256-3 : trait TW9, (2 NS), linked variants: 316A, 
CH256-3 : trait QR9, (3 NS), linked variants: 423G, 424A, 
CH042-3 : trait NL9, (7 NS), linked variants: 660-, 664A, 665C, 
CH162-3 : trait KM9, (2 NS), linked variants: 327G, 
CH162-5 : trait VD9, (3 NS), linked variants: 7C, 8A, 
CH162-5 : trait GN9, (3 NS), linked variants: 30G, 31C, 33G, 
CH162-5 : trait EW9, (6 NS), linked variants: 33A(GN9, WT), 
CH164-3 : trait NY9, (4 NS), linked variants: 89T, 
CH164-3 : trait SY9, (7 NS), linked variants: 182A, 183-, 
CH164-5 : trait EA9, (2 NS), linked variants: 51T, 


#### Output the results
Trait sites : mutations on that sites are nonsynonymous for epitope, and its epitope is independent\
Special sites: mutations on that sites are nonsynonymous for epitope but its epitope is dependent, meaning it hard to disentangle the effects

In [4]:
import importlib
importlib.reload(HIV)

n_independent = 0
n_dependent = 0
TF_correct = True
for tag in tags:
    traitsite = HIV.read_file('traitsite/traitsite-'+tag+'.dat')
    TFseq     = HIV.read_file_s('traitseq/traitseq-'+tag+'.dat')

    df = pd.read_csv('%s/interim/%s-poly.csv' %(HIV_DIR,tag), comment='#', memory_map=True)

    # Get special sites (nonsynonymous mutation for epitope but not trait sites)
    df_epi = df[(df['epitope'].notna()) & (df['escape'] == True)]
    epitopes = df_epi['epitope'].unique()
    trait_index = []
    sp_index = []
    nonsy_sites = []
    
    for n in range(len(epitopes)):
        df_n = df_epi[(df_epi['epitope'] == epitopes[n]) ]
        nonsy_sites_n = df_n['polymorphic_index'].unique()
        nonsy_sites.append(nonsy_sites_n)
        if any(np.array_equal(nonsy_sites_n, trait) for trait in traitsite):
            trait_index.append(n)
        else:
            sp_index.append(n)

    n_independent += len(trait_index)
    n_dependent   += len(sp_index)
    # if len(traitsite)!= 0:
    #     print(f'CH{tag[-5:]}|{len(traitsite)}', end = '|')
    #     for nn in trait_index:
    #         trait_epi = epitopes[nn]
    #         epiname = trait_epi[0] + trait_epi[-1] + str(len(trait_epi))
    #         print(f'<br>{epiname}: {nonsy_sites[nn]}', end = '')
    #     print('',end = '|')
    # else:
    #     print(f'CH{tag[-5:]}|0|No binary trait', end = '|')

    # if len(sp_index)!= 0:
    #     for nn in sp_index:
    #         sp_epi = epitopes[nn]
    #         epiname = sp_epi[0] + sp_epi[-1] + str(len(sp_epi))
    #         print(f'<br>{epiname}: {nonsy_sites[nn]}', end = '')
    #     print()
    # else:
    #     print(f'No special sites')

print(f'There are totally {n_independent + n_dependent} epitopes, \
{n_independent} of them are independent, which can be seen as binary traits,\
the other {n_dependent} are dependent.')

There are totally 71 epitopes, 37 of them are independent, which can be seen as binary traits,the other 34 are dependent.


tag|escape group|escape sites|special sites
:----:|:----:|:---|:-----
CH040-3|3|AR9: [18 19 20 21]<br>RI9: [257 258 259]<br>SR10: [286 287 288 290 291]|EE9: [245 246]
CH040-5|1|CR9: [31 32]|KL9: [39]<br>HT9: [104]
CH058-3|1|EL9: [60 61 62 63]|No special sites
CH058-5|1|TW10: [16 17]|IW9: [13]
CH077-3|6|DI9: [28 31 32 34 35]<br>TL9: [82 83 84 85]<br>QF10: [87 88 89]<br>DR9: [122 123 124 125]<br>DG9: [149 150 151 152 153 154 156 157 <br>158 159 160 161 162 163 164 165 166 <br>167 168 169 170 171]<br>KF9: [186 187 188 189 190]|No special sites
CH077-5|0|No binary trait|IW9: [1]<br>TW10: [4 5]
CH470-3|2|RK10: [0 1 2 3]<br>RK11: [352 353 355]|LP9: [40]<br>RI9: [302 303 304 305]<br>VK9: [334]
CH470-5|4|GK9: [8 9]<br>DL9: [23 24 25]<br>EK9: [78 79 81]<br>RL9: [149 150 151 152]|QK9: [124]
CH607-3|1|KY11: [198 199 200 201 202]|No special sites
CH607-5|0|No binary trait|No special sites
CH131-3|1|EV11: [618 619 620 621 624]|KC9: [45 46]<br>FY10: [47]<br>VV9: [76]<br>CI9: [189]<br>SI9: [560 561]
CH131-5|1|VF9: [14 15 16 19]|RY9: [259]
CH159-3|2|PL9: [370 371 372 374 375 376]<br>RD9: [417 418 420 421 422 423 424]|No special sites
CH159-5|0|No binary trait|No special sites
CH256-3|3|EK9: [28 29 30 31 32 33]<br>RN9: [345 346 347 348 349]<br>LL9: [362 363 364 366]|EW10: [45 46]<br>TW9: [316 317]<br>QR9: [423 424 427]
CH256-5|1|PS9: [169 170 171 172 174 175 176 177]|DA9: [42]<br>NA9: [208]
CH042-3|3|DR9: [95 96]<br>KY9: [200 201 202 203 204 205 207 208 210]<br>RY9: [765 766]|NL9: [660 661 662 663 664 665]
CH042-5|1|QP9: [29 30]|NQ9: [125]
CH162-3|3|LI9: [33 34]<br>RA9: [290 291 292 293 294 295 296 <br>297 298 299 300 301 302 303 <br>304 305 306 307 308 309 310 <br>311 312 313 314 315 316 317]<br>EH9: [334 335 337]|KM9: [327 328]
CH162-5|0|No binary trait|VD9: [6 7 8]<br>GN9: [30 31 33]<br>EW9: [38 39 40 42 43]<br>VL9: [155]
CH185-3|0|No binary trait|No special sites
CH185-5|1|GY9: [2 3 4 5 6]|No special sites
CH198-3|1|KF9: [194 195 196]|No special sites
CH198-5|0|No binary trait|No special sites
CH164-3|1|EV11: [453 454 455 456 457 458]|NY9: [86 87 88 89]<br>SY9: [182 183 184 185 186 187 188]<br>KY11: [463]
CH164-5|0|No binary trait|PF9: [44]<br>EA9: [50 51]<br>GW10: [170]

#### Check
Check if the TF sequence and distance is correct 

In [5]:
import importlib
importlib.reload(HIV)

TF_correct = True
for tag in tags:
    traitsite = HIV.read_file('traitsite/traitsite-'+tag+'.dat')
    TFseq     = HIV.read_file_s('traitseq/traitseq-'+tag+'.dat')
    traitdis  = HIV.read_file('traitdis/traitdis-'+tag+'.dat')
    df = pd.read_csv('%s/interim/%s-poly.csv' %(HIV_DIR,tag), comment='#', memory_map=True)
    
    for i in range(len(traitsite)):        
        for j in range(len(traitsite[i])):
            # Check if there is something wrong with wild type allele
            n_ij   = df[(df['polymorphic_index'] == traitsite[i][j]) & (df['escape'] == False)]
            
            if len(n_ij) != len(TFseq[i][j]):
                TF_correct = False
                print('there is something wrong with TF sequence for %s'%tag)
            for ii in range(len(n_ij)):
                TF     = n_ij.iloc[ii].nucleotide
                TF_nuc = NUC.index(TF)
                if TF_nuc not in TFseq[i][j]:
                    TF_correct = False
                    print('there is something wrong with TF sequence for %s'%tag)
                
        if len(traitsite[i]) != len(traitdis[i])+1:
            TF_correct = False
            print('there is something wrong with the trait distance for %s'%tag)

if TF_correct:
    print('no problem')


no problem


Check if there are some synonymous mutations on trait sites

In [6]:
for tag in tags:
    traitsite = HIV.read_file('traitsite/traitsite-'+tag+'.dat')
    df = pd.read_csv('%s/interim/%s-poly.csv' %(HIV_DIR,tag), comment='#', memory_map=True)
    for i in range(len(traitsite)):
        for j in range(len(traitsite[i])):
            n_ij   = df[df['polymorphic_index'] == traitsite[i][j]]
            for a in range(len(n_ij)):
                if n_ij.iloc[a].escape == False and n_ij.iloc[a].nucleotide != n_ij.iloc[a].TF:
                    print(f'CH{tag[-5:]} {n_ij.iloc[a].polymorphic_index} {n_ij.iloc[a].nucleotide}')

CH077-3 35 C
CH077-3 161 A
CH162-3 291 T
CH162-3 295 T


### <a id='VLr'></a> Get recombination rate over time

write recombination rates file (use VL-dependent r instead of a constant r)\
$r = 1.722e^{-10} \times VL + 1.39e^{-5}$

In [7]:

def get_times(tag):
    seq      = np.loadtxt('%s/input/sequence/%s-poly-seq2state.dat'%(HIV_DIR,tag))
    times = []
    for i in range(len(seq)):
        times.append(seq[i][0])
    return np.unique(times)


for tag in tags:

    ppt = tag[:9]
    
    df_vl_raw    = pd.read_csv('%s/virus load/%s.csv' %(HIV_DIR,ppt), header=None)
    df_vl_raw.columns = ['time', 'virus_load']
    df_vl = df_vl_raw.sort_values(by='time', ascending=True)

    sample_times = get_times(tag)
    
    times = [int(i) for i in df_vl['time'].values]
    virus_load = [np.power(10, i) for i in df_vl['virus_load'].values]
    vl_sample = np.zeros(len(sample_times))
    
#     if sample_times[-1] > times[-1]:
#         print(f'CH{tag[-5:]} last time point is {int(sample_times[-1])}, the last point detected VL is {times[-1]}')
    #     print(f'there are some time points have no VL')

    time_min = np.min([int(times[0]),sample_times[0]])
    time_max = np.max([int(times[-1]),sample_times[-1]])
    whole_time = np.linspace(time_min,time_max,int(time_max-time_min+1))
    
    interpolation = lambda a,b: sp_interpolate.interp1d(a,b,kind='linear',fill_value=(virus_load[0], virus_load[-1]), bounds_error=False)
    AllVL = interpolation(times, virus_load)(whole_time)

    for t_index in range(len(sample_times)):
        t = sample_times[t_index]
        index = list(whole_time).index(t)
        vl_sample[t_index] = AllVL[index]
        
    # recombination rate over time
    f = open('%s/input/r_rates/r-%s.dat'%(HIV_DIR,tag), 'w')
    f.write('%s\n'%'\t'.join([str((1.722*vl_sample[i]/1000000+0.139)/10000) for i in range(len(vl_sample))]))
    f.close()

### <a id='mpl'></a> Run C++ MPL code

"src/HIV.sh" : running MPL code for all HIV data with a VL-dependent r

In [8]:
f = open("src/HIV.sh",'w')
f.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -std=c++11 -lgsl -lgslcblas -o mpl\n')

for tag in tags:
            
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o output/sc-%s.dat -g 10 -m input/Zanini-extended.dat -r input/r_rates/r-%s.dat '%(tag,tag))
    f.write('-e input/traitsite/traitsite-%s.dat -es input/traitseq/traitseq-%s.dat '%(tag,tag))
    f.write('-ed input/traitdis/traitdis-%s.dat -sc output/covariance/c-%s.dat\n '%(tag,tag))
    
f.close()

"src/HIV_notrait.sh" : running previous MPL code (no trait group) for all HIV data\
In this case, we use the same regularization value (Here gamma = g/N). \
No recombination part since $R_{fl} = 0$ for individual selection

In [9]:
f = open("src/HIV_notrait.sh",'w')
f.write('g++ notrait/main.cpp notrait/inf.cpp notrait/io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
for tag in tags:
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o notrait/output/sc-%s.dat -g 1e4 -N 1e3 -m input/Zanini-extended.dat\n'%tag)
f.close()

"src/HIV_noR.sh" : running new MPL code but no recombiination term part for all HIV data
We set recombination rate rr to 0 to remove the recombination flux $R_{fl}$

In [10]:
f = open("src/HIV_noR.sh",'w')
f.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -std=c++11 -lgsl -lgslcblas -o mpl\n')

for tag in tags:
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o noR/output/sc-%s.dat -g 10 -m input/Zanini-extended.dat -rr 0 '%(tag))
    f.write('-e input/traitsite/traitsite-%s.dat -es input/traitseq/traitseq-%s.dat '%(tag,tag))
    f.write('-ed input/traitdis/traitdis-%s.dat\n'%tag)
f.close()

### <a id='deal'></a> Deal with data

#### Check if all inferred binary traits are independent
Use C++ code to calculate the RREF first（Go to covariance file directory, run g++ rref.cpp -O3 -o rref.out to compile the code and then run Python script）

In [11]:
import importlib
importlib.reload(HIV)

status = subprocess.run('g++ src/rref.cpp -O3 -std=c++11 -o rref.out', shell=True)

HIV.get_independent()

status = subprocess.run('rm rref.out', shell=True)

Done!


In [12]:
import importlib
importlib.reload(HIV)

Independence = True

for tag in tags:

    traitsite = HIV.read_file('traitsite/traitsite-'+tag+'.dat')
    ne = len(traitsite)
    if ne > 0:
        continue
        
    co_rr     = np.loadtxt('data/HIV/output/covariance/rr-c-%s.dat'%tag)
    ll = len(co_rr)
    
    pivots = []
    for row in range(ll):
        for col in range(ll):
            if co_rr[row, col] != 0:
                pivots.append(col)
                break
    
    for n in range(ne):
        column_index = ll-ne+n
        if column_index not in pivots:
            Independence = False
            print(f'{tag} still has some dependent binary trait, {traitsite[n]}')

if Independence == True:
    print('All inferred binary traits are independent')

All inferred binary traits are independent


#### Dealing with mpl results 
Normalize the selection coefficients such that the TF nucleotide at each site has a selection coefficient of zero and record information in these 2 csv files.
- /analysis/'tag'-analyze.csv 
    - coefficients in 2 cases (old: without escape part; new: with escape part)
    - allele frequencies
- /group/escape_group-'tag'.csv 
    - escape coefficients
    - escape frequencies

In [13]:
import importlib
importlib.reload(HIV)

for tag in tags:
    HIV.analyze_result(tag,True)
    HIV.analyze_result(tag,False)   # without recombination flux

print('Successfully create the analysis and escape group file')

Successfully create the analysis and escape group file


### <a id='sij'></a> Calculate Δs_ij
To quantify the linakge between different variants ($\Delta \hat s_{ij}$), we modify the input sequence by converting mutant variant $i$ into wild variant. 

$\Delta \hat s_{ij} = \hat s_j - \hat s_j^{\setminus i}$
- $\hat s_j$ ---  Selection coefficient for variant j
- $\hat s_j^{\setminus i}$ ---Selection coefficient for variant j calculated by using WT to replace variant i 

The modifications for individual loci and escape groups are different:
- individual loci part
   - replace all mutant variant $i$ by its transmitted/founder (TF) nucleotide for all sequences at all time points. 
- escape part
   - change all mutant variants within one epitope into wild type.

In [14]:
import importlib
importlib.reload(HIV)

# create the path
for tag in tags:
    # input sequence
    path = 'data/HIV/input/sequence/'+ tag
    folder = os.path.exists(path)
    if not folder:
        os.makedirs(path)
    # output inferred coefficients
    path = 'data/HIV/output/'+ tag
    folder = os.path.exists(path)
    if not folder:
        os.makedirs(path)

g = open("src/HIV_sij.sh",'w')
g.close()

for tag in tags:
    HIV.modify_seq(tag) # Add content to shell script here

Dealing with these data.

In [15]:
'''Need to run'''

import importlib
importlib.reload(HIV)

for tag in tags:
    HIV.cal_sij(tag)