# <font color = red> HIV Analysis
Here we combine HIV sequence data from [the Los Alamos National Laboratory HIV Sequence Database](https://www.hiv.lanl.gov/content/index) and immunological data to investigate HIV evolution across 13 individuals. This data is contained in the data/HIV/ directory.

And we use some processed data from [here](https://github.com/bartonlab/paper-MPL-inference).
- epitopes.csv $\to$ epitopes.csv: information about epitopes 
- src-MPL-HIV.tar.gz/\*-poly-seq2state.dat $\to$ input/sequence/\*-poly-seq2state.dat: processed sequences readable by MPL
- src-MPL-HIV.tar.gz/Zanini-extended $\to$ input/Zanini-extended: mutation matrix for HIV data
- processed/\*-index.csv $\to$ notrait/processed/\*-index.csv: information about all sites
- interim/\*-poly.csv $\to$ notrait/interim/\*-poly.csv: information about polymorphic sites, which will be modified to interim/*-poly.csv when finding escape sites


## Contents
- ### [Libraries and variables](#lib)
- ### Data analyze
    - #### [Find escape sites](#escape)
    - #### [Write shell script](#mpl)
    - #### [Deal with data](#deal)
    - #### [Calculate Δs_ij](#sij)

## <a id='lib'></a> Libraries and variables

In [1]:
print('This notebook was prepared using:')

import os
import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import pandas as pd
print('pandas version %s' % pd.__version__)

import re
from math import isnan

import HIV

# GLOBAL VARIABLES

NUC = ['-', 'A', 'C', 'G', 'T']

ppt = ['700010040', '700010058', '700010077', '700010470', '700010607', '703010131', '703010159', 
       '703010256', '704010042', '705010162', '705010185', '705010198', '706010164']

tags = ['700010040-3', '700010040-5', '700010058-3', '700010058-5', '700010077-3', '700010077-5',
        '700010470-3', '700010470-5', '700010607-3', '700010607-5', '703010131-3', '703010131-5', 
        '703010159-3', '703010159-5', '703010256-3', '703010256-5', '704010042-3', '704010042-5', 
        '705010162-3', '705010162-5', '705010185-3', '705010185-5', '705010198-3', '705010198-5', 
        '706010164-3', '706010164-5']  

# GitHub di rectories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'

This notebook was prepared using:
python version 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
numpy version 1.24.2
pandas version 1.5.3


## Data analyze

###  <a id='escape'></a>Find escape sites
escape site = polymorphic site which is nonsynonymous in the reading frame of its epitope.

Steps to find escape sites:
1. find all polymorphic sites within one epitope
2. check if these polymorphic sites are nonsynonymous in the reading frame of its epitope
3. Count the sites that meet the above conditions. 
    - \>= 2, sites $\to$ escape sites (or trait sites), epitope $\to$ escape group (or trait group);
    - <  2, sites $\to$ special sites, epitope $\neq$ escape group (or trait group). 

Special sites: nonsynonymous mutation in the reading frame of their epitope but not escape sites because the number of escape sites in its epitope is less than 3.

In [2]:
import importlib
importlib.reload(HIV)
    
min_n = 2

for tag in tags:
    HIV.find_trait_site(tag,min_n)

++ 700010040-3 has 3 (+1) escape groups, they are [18 19 20 21],[257 258 259],[286 287 288 290 291], the special sites are [245 246]
== 700010040-5 has no escape group, the special sites are: [31 32],[39],[104]
-- 700010058-3 has 1 escape groups, they are [60 61 62 63]
== 700010058-5 has no escape group, the special sites are: [13],[16 17]
-- 700010077-3 has 6 escape groups, they are [28 31 32 34 35],[82 83 84 85],[87 88 89],[122 123 124 125],[149 150 151 152 153 154 156 157 158 159 160 161 162 163 164 165 166 167
 168 169 170 171],[186 187 188 189 190]
== 700010077-5 has no escape group, the special sites are: [1],[4 5],[47]
++ 700010470-3 has 3 (+2) escape groups, they are [0 1 2 3],[302 303 304 305],[352 353 355], the special sites are [40],[334]
++ 700010470-5 has 3 (+2) escape groups, they are [23 24 25],[78 79 81],[149 150 151 152], the special sites are [8 9],[124]
-- 700010607-3 has 1 escape groups, they are [198 199 200 201 202]
   700010607-5 has no triat site
++ 703010131-3 

tag|escape group|escape sites|special sites
:----:|:----:|:---|:-----
700010040-3|3|[18,19,20,21]<br>[257,258,259]<br>[286,287,288,290,291]|[245,246]
700010040-5|0|                                                       |[31,32], [39], [104]
700010058-3|1|[60,61,62,63]                                          |
700010058-5|0|                                                       |[13], [16,17]
700010077-3|6|[28,31,32,34,35]<br>[82,83,84,85]<br>[87,88,89]<br>[122,123,124,125]<br>[149,150,151,152,153,154,156,157,158,159,160,<br>161,162,163,164,165,166,167,168,169,170,171]<br>[186,187,188,189,190]|
700010077-5|0|                                                       |[1], [4,5], [47]
700010470-3|3|[0,1,2,3]<br>[302,303,304,305]<br>[352,353,355]        |[40], [334]
700010470-5|3|[23,24,25]<br>[78,79,81]<br>[149,150,151,152]          |[8,9], [124]
700010607-3|1|[198,199,200,201,202]                                  |[]
700010607-5|0|no mutation related to epitope                         |
703010131-3|1|[618,619,620,621,624]                                  |[0], [45,46], [47], [76], [189], [560,561]
703010131-5|1|[14,15,16,19]                                          |[259]
703010159-3|2|[370,371,372,374,375,376]<br>[417,418,420,421,422,423,424]|
703010159-5|0|no mutation related to epitope                         |
703010256-3|4|[28,29,30,31,32,33]<br>[345,346,347,348,349]<br>[362,363,364,366]<br>[423,424,427]|[45,46], [316,317]
703010256-5|1|[169,170,171,172,174,175,176,177]                      |[42], [208]
704010042-3|2|[200,201,202,203,204,205,207,208,210]<br>[660,661,662,663,664,665]|[95,96], [765,766]
704010042-5|0|                                                       |[29,30], [125]
705010162-3|2|[6,7,8]<br>[30,31,33]<br>[38,39,40,42,43]              |[33,34], [327,328]
705010162-5|3|[6,7,8]<br>[30,31,33]<br>[38,39,40,42,43]              |[155]
705010185-3|0|no mutation related to epitope                         |
705010185-5|1|[2,3,4,5,6]                                            |
705010198-3|1|[194,195,196]                                          |
705010198-5|0|no mutation related to epitope                         |
706010164-3|3|[86,87,88,89]<br>[182,183,184,185,186,187,188]<br>[453,454,455,456,457,458]|[463]
706010164-5|0|                                                       |[44], [50,51], [170]

Check if the TF sequence and distance is correct 

In [3]:
import importlib
importlib.reload(HIV)

TF_correct = True
for tag in tags:
    traitsite = HIV.read_file('traitsite/traitsite-'+tag+'.dat')
    TFseq     = HIV.read_file('traitseq/traitseq-'+tag+'.dat')
    traitdis  = HIV.read_file('traitdis/traitdis-'+tag+'.dat')
    df = pd.read_csv('%s/interim/%s-poly.csv' %(HIV_DIR,tag), comment='#', memory_map=True)
    for i in range(len(traitsite)):
        for j in range(len(traitsite[i])):
            n_ij   = df[df['polymorphic_index'] == traitsite[i][j]]
            TF     = n_ij.iloc[0].TF
            TF_nuc = NUC.index(TF)
            if TFseq[i][j] != TF_nuc:
                TF_correct = False
                print('there is something wrong with TF sequence for %s'%tag)
        
        if len(traitsite[i]) != len(traitdis[i])+1:
            TF_correct = False
            print('there is something wrong with the trait distance for %s'%tag)
                
if TF_correct:
    print('no problem')

no problem


### <a id='mpl'></a> Write shell script
"src/HIV.sh" : running MPL code for all HIV data

In [4]:
f = open("src/HIV.sh",'w')
f.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
for tag in tags:
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o output/sc-%s.dat -g 10 -m input/Zanini-extended.dat -rr 1.4e-5 '%tag)
    f.write('-e input/traitsite/traitsite-%s.dat -es input/traitseq/traitseq-%s.dat '%(tag,tag))
    f.write('-ed input/traitdis/traitdis-%s.dat\n'%tag)
f.close()

"src/HIV_notrait.sh" : running previous MPL code (no trait group) for all HIV data\
In this case, we use the same regularization value (Here gamma = g/N). \
No recombination part since $R_{fl} = 0$ for individual selection

In [5]:
f = open("src/HIV_notrait.sh",'w')
f.write('g++ notrait/main.cpp notrait/inf.cpp notrait/io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
for tag in tags:
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o notrait/output/sc-%s.dat -g 1e4 -N 1e3 -m input/Zanini-extended.dat\n'%tag)
f.close()

"src/HIV_noR.sh" : running new MPL code but no recombiination term part for all HIV data
We set recombination rate rr to 0 to remove the recombination flux $R_{fl}$

In [6]:
f = open("src/HIV_noR.sh",'w')
f.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
for tag in tags:
    f.write('./mpl -d ../data/HIV -i input/sequence/%s-poly-seq2state.dat '%tag)
    f.write('-o noR/output/sc-%s.dat -g 10 -m input/Zanini-extended.dat -rr 0 '%tag)
    f.write('-e input/traitsite/traitsite-%s.dat -es input/traitseq/traitseq-%s.dat '%(tag,tag))
    f.write('-ed input/traitdis/traitdis-%s.dat\n'%tag)
f.close()

### <a id='deal'></a> Deal with data
Dealing with mpl results (Normalize the selection coefficients such that the TF nucleotide at each site has a selection coefficient of zero) and record information in these 2 csv files.
- /analysis/'tag'-analyze.csv 
    - coefficients in 2 cases (old: without escape part; new: with escape part)
    - allele frequencies
- /group/escape_group-'tag'.csv 
    - escape coefficients
    - escape frequencies

In [8]:
import importlib
importlib.reload(HIV)

for tag in tags:
    HIV.analyze_result(tag,True)
    HIV.analyze_result(tag,False)   # without recombination flux

print('Successfully create the analysis and escape group file')

Successfully create the analysis and escape group file


### <a id='sij'></a> Calculate Δs_ij
To quantify the linakge between different variants ($\Delta \hat s_{ij}$), we modify the input sequence by converting mutant variant $i$ into wild variant. 

$\Delta \hat s_{ij} = \hat s_j - \hat s_j^{\setminus i}$
- $\hat s_j$ ---  Selection coefficient for variant j
- $\hat s_j^{\setminus i}$ ---Selection coefficient for variant j calculated by using WT to replace variant i 

The modifications for individual loci and escape groups are different:
- individual loci part
   - replace all mutant variant $i$ by its transmitted/founder (TF) nucleotide for all sequences at all time points. 
- escape part
   - change all mutant variants within one epitope into wild type.

In [9]:
import importlib
importlib.reload(HIV)

# create the path
for tag in tags:
    # input sequence
    path = 'data/HIV/input/sequence/'+ tag
    folder = os.path.exists(path)
    if not folder:
        os.makedirs(path)
    # output inferred coefficients
    path = 'data/HIV/output/'+ tag
    folder = os.path.exists(path)
    if not folder:
        os.makedirs(path)

g = open("src/HIV_sij.sh",'w')
g.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
g.close()

for tag in tags:
    HIV.modify_seq(tag) # Add content to shell script here

Dealing with these data.

In [10]:
import importlib
importlib.reload(HIV)

for tag in tags:
    HIV.cal_sij(tag)