## HIV Analysis
Here we combine HIV sequence data from [the Los Alamos National Laboratory HIV Sequence Database](https://www.hiv.lanl.gov/content/index) and immunological data to investigate HIV evolution across 14 individuals. This data is contained in the data/HIV/ directory. Supplementary scripts for data processing are stored in HIV.py.

We downloaded sequence data for all patients and aligned them with [the HXB2 reference sequence](https://www.hiv.lanl.gov/components/sequence/HIV/asearch/query_one.comp?se_id=K03455) and the corresponding [clade consensus sequence](https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html#consensus) using the LANL tool [HIValign](https://www.hiv.lanl.gov/content/sequence/VIRALIGN/viralign.html). We recorded immunological data for these patients in .csv files for analysis.

## Contents
- ### [Libraries and variables](#lib)
- ### Data analyze
    - [Find escape sites](#poly)
    - [MPL](#mpl)
    - [Calculate Δs_ij](#sij)

### <a id='lib'></a> Libraries and variables

In [1]:
print('This notebook was prepared using:')

import os
import sys
print('python version %s' % sys.version)

import numpy as np
print('numpy version %s' % np.__version__)

import pandas as pd
print('pandas version %s' % pd.__version__)

import re
from math import isnan

import HIV

# GLOBAL VARIABLES

NUC = ['-', 'A', 'C', 'G', 'T']

ppt = ['700010470', '700010077', '700010058', '700010040', '700010607',
       '706010164', '705010198', '705010185', '705010162', '704010042', 
       '703010256', '703010159', '703010131']


# GitHub directories
HIV_DIR = 'data/HIV'
MPL_DIR = 'src'
SIM_DIR = 'data/simulation'
FIG_DIR = 'figures'

This notebook was prepared using:
python version 3.9.13 (main, Aug 25 2022, 18:24:45) 
[Clang 12.0.0 ]
numpy version 1.21.5
pandas version 1.4.4


###  <a id='poly'></a>Find escape sites
escape sites = polymorphic sites which is nonsynonymous in the reading frame of its epitope.

Steps to find escape sites:
1. check if this site is in one epitope
2. find the reading frame of this epitope and check if this site is nonsynonymous in this reading frame
3. count the escape sites in one epitope, if there are less than 3 sites, throw out this epitope


In [3]:
from imp import reload

reload(HIV)

min_n = 2

for i in range(len(ppt)):
    for n in range(2):
        tag = ppt[i] + '-' + str(3+2*n)
        HIV.find_site(tag,min_n)

700010470-3 has 3 escape groups, they are [0 1 2 3] [302 303 304 305] [352 353 355]
700010470-5 has 3 escape groups, they are [23 24 25] [78 79 81] [149 150 151 152]
700010077-3 has 6 escape groups, they are [28 31 32 34 35] [82 83 84 85] [87 88 89] [122 123 124 125] [149 150 151 152 153 154 156 157 158 159 160 161 162 163 164 165 166 167
 168 169 170 171] [186 187 188 189 190]
700010077-5 has no escape site
700010058-3 has 1 escape groups, they are [60 61 62 63]
700010058-5 has no escape site
700010040-3 has 3 escape groups, they are [18 19 20 21] [257 258 259] [286 287 288 290 291]
700010040-5 has no escape site
700010607-3 has 1 escape groups, they are [198 199 200 201 202]
700010607-5 has no escape site
706010164-3 has 3 escape groups, they are [86 87 88 89] [182 183 184 185 186 187 188] [453 454 455 456 457 458]
706010164-5 has no escape site
705010198-3 has 1 escape groups, they are [194 195 196]
705010198-5 has no escape site
705010185-3 has no escape site
705010185-5 has 1 esca

checking if the TF sequence is correct

In [4]:
from imp import reload

reload(HIV)

for k in range(len(ppt)):
    for n in range(2):
        tag = ppt[k] + '-' + str(3+2*n)
        polysite = HIV.read_file('polysite/'+tag+'-polysite.dat')
        TFseq    = HIV.read_file('polyseq/polysequence-'+tag+'.dat')
        df = pd.read_csv('%s/interim/%s-poly.csv' %(HIV_DIR,tag), comment='#', memory_map=True)
        for i in range(len(polysite)):
            for j in range(len(polysite[i])):
                n_poly = df[df['polymorphic_index'] == polysite[i][j]]
                TF     = n_poly.iloc[0].TF
                TF_in  = NUC.index(TF)
                if TFseq[i][j] != TF_in:
                    print('there is something wrong with TF sequence for %s'%tag)
print('no problem')

no problem


### <a id='mpl'></a> MPL

In [4]:
f = open("%s/run.sh"%MPL_DIR,'w')
f.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
    
for k in range(len(ppt)):
     for n in range(2):
        tag = ppt[k] + '-' + str(3+2*n)
        f.write('./mpl -d ../../data/HIV -i input/sequence/%s-poly-seq2state.dat \
                -o output/sc-%s.dat -g 1e4 -N 1e3 -m input/Zanini-extended.dat \
                -p input/polysite/%s-polysite.dat -ps input/polyseq/polysequence-%s.dat\n'%(tag,tag,tag,tag))
        
f.close()

dealing with mpl results and record information in these 2 csv files.
- /analysis/'tag'-analyze.csv 
    - coefficients in 2 cases (old: without escape part and new: with escape part)
    - allele frequencies
- /poly/'tag'.csv 
    - escape coefficients
    - escape frequencies

In [12]:
from imp import reload

reload(HIV)

for i in range(len(ppt)):
    for n in range(2):
        tag = ppt[i] + '-' + str(3+2*n)
        HIV.analyze_result(tag)
        print('%s'%tag)

700010470-3
700010470-5
700010077-3
700010077-5
700010058-3
700010058-5
700010040-3
700010040-5
700010607-3
700010607-5
706010164-3
706010164-5
705010198-3
705010198-5
705010185-3
705010185-5
705010162-3
705010162-5
704010042-3
704010042-5
703010256-3
703010256-5
703010159-3
703010159-5
703010131-3
703010131-5


### <a id='sij'></a> Calculate Δs_ij
$\Delta \hat s_{ij} = \hat s_j - \hat s_j^{\setminus i}$
- $\hat s_j$ ---  Selection coefficient for variant j
- $\hat s_j^{\setminus i}$ ---Selection coefficient for variant j calculated by using WT to replace variant i 

modify the input sequence by converting mutant variant $i$ into wild variant. 
- individual loci part
   - replace all mutant variant $i$ by its transmitted/founder (TF) nucleotide for all sequences at all time points. 
- escape part
   - change all mutant variants within one epitope into wild type.

In [72]:
from imp import reload

reload(HIV)

# create the path
for i in range(len(ppt)):
    for n in range(2):
        tag = ppt[i] + '-' + str(3+2*n)
        # input sequence
        path = 'data/HIV/input/sequence/'+ tag
        folder = os.path.exists(path)
        if not folder:
            os.makedirs(path)
        # output inferred coefficients
        path = 'data/HIV/output/'+ tag
        folder = os.path.exists(path)
        if not folder:
            os.makedirs(path)
            
g = open('%s/sij.sh'%(MPL_DIR), "w")
g.write('g++ main.cpp inf.cpp io.cpp -O3 -mcpu=apple-a14 -lgsl -lgslcblas -o mpl\n')
g.close()

for i in range(len(ppt)):
    for n in range(2):
        tag = ppt[i] + '-' + str(3+2*n)
        HIV.modify_seq(tag)

Dealing with these data.

In [77]:
from imp import reload

reload(HIV)

ppt = ['700010470', '700010077', '700010058', '700010040', '700010607',
       '706010164', '705010198', '705010185', '705010162', '704010042', 
       '703010256', '703010159', '703010131']

for k in range(len(ppt)):
    for n in range(2):
        tag = ppt[k] + '-' + str(3+2*n)
        HIV.cal_sij(tag)
