# Age-association analyses of evolutionarily closest HA sequence pairs

## Overview 
In this notebook, we demonstrate the analyses performed in our manuscript entitiled, "Individual immune selection pressure has limited impact on seasonal influenza virus evolution". Here, we show how we analysed for the hemagglutinin (HA) sequences of A/H3N2 collected between 2009 and July-2017.

## Pre-analysis tasks 
### Sequence curation
We collected all sequences from [GISAID](http://www.gisaid.org/) and [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/) databases that satisfy the following filters: 
  1. \>90% of full HA nucleotide length 
  2. availability of patients' age data 
  3. availability of virus passage history that could be categorized as original clinical material, MDCK, SIAT and egg passage types (Typing of passage histories was based on [Chan et al., 2016](https://www.ncbi.nlm.nih.gov/pubmed/27604224)). 

In total, 10,514 HA nucleotide sequences were obtained for A/H3N2.

We further processed the sequences obtained by: 
  1. Removed duplicated and low quality (\>10% residues missing or ambiguous) were removed. 
  2. Sequences of the same virus but different passage histories were prioritized by clinical > MDCK > SIAT > egg and/or lower passage number. **At this point, there are 8,530 sequences for A/H3N2** (i.e. H3N2-HA-nuc.fa in [Files](https://github.com/alvinxhan/ageflu/tree/master/files) folder). 
  3. Clustered and removed identical nucleotide sequences using CD-HIT, leaving behind a representative strain for each cluster to yield a final set of non-redundant sequences. You can find the CD-HIT sequence cluster output file (e.g. H3N2-HA-nuc_cd-hit-non-redundant.clstr) in the [Files](https://github.com/alvinxhan/ageflu/tree/master/files) folder. 
  
In total, there were 6,033 non-redudant sequences for A/H3N2. 

### Maximum-likelihood tree reconstruction
Prior to the analyses presented here, we first reconstruct a maximum-likelihood (ML) phylogenetic tree using HA nucleotide sequences. Due to the large number of sequences (n=6,033 for A/H3N2) and the relatively low observed genetic divergence (overall mean nucleotide p-distance of A/H3N2 calculated by MEGA = 0.00964), conventional phylogenetic methods would be computationally expensive and practically infeasible. As such, we reconstructed our ML trees with RAxML and GARLI, using a nested inference approach that we have developed. More information on this phylogenetic inference methodology can be found in our paper. 

## Getting evolutionarily closest pairs
We now can parse for evolutionarily closely-related virus pairs from the inferred phylogenetic tree (rooted, in NEWICK format). You can do so using [ageflu_getpairs.py](https://github.com/alvinxhan/ageflu/blob/master/scripts/ageflu_getpairs.py). 

* Additional inputs include: (i) FASTA alignment of all sequences (including identical nucleotide sequences), (ii) CD-HIT cluster file of non-redundant sequence clusters.
* The age limits of children and adults can be changed using the '--child' and '--adult' arguments.
* Note that the FASTA alignment containing reference sequences for HA-numbering ('H1pdm09_H3_FluB_NumberingRef.fa') must be in the same folder as the ageflu_getpairs.py script.
* Tab-delimited output: ageflu_evol-closest-pairs.*.txt

Here, we show the breakdown of the script:

### Importing modules and defining functions

In [28]:
from os.path import expanduser
from decimal import *
import re, sys, itertools
import ete3
import random

random.seed(666)

# parse fasta file
def parsefasta(file, check_alg=1, check_dna=1, check_codon=1):
    # check if file is of FASTA format
    fhandle = filter(None, open(file, 'rU').readlines())
    if not re.search('^>', fhandle[0]): sys.exit('\nERROR: Incorrect sequence file format.\n')

    result = {}
    for key, group in itertools.groupby(fhandle, lambda _: re.search('^>',_)):
        if key:
            header = group.next().strip().replace('>','')
        else:
            sequence = ''.join(map(lambda _:_.strip(),list(group)))
            # dna sequences
            if check_dna == 1 and set(list(sequence))&set(list('rhkdesqpvilmfyw')):
                sys.exit('\nERROR: Input must be DNA FASTA.\n')
            # codon-alignment
            if check_codon == 1 and len(sequence)%3 > 0:
                sys.exit('\nERROR: DNA sequence file given must be codon-aligned.\n')
            result[header] = sequence
    # alignment
    if check_alg == 1 and len(set(map(len, result.values()))) != 1:
        sys.exit('\nERROR: Input sequence file must be an alignment.\n')
    return result

# translate codon-aligned nucleotide sequences
def translatedDNA(dna):
    protein = []
    for _ in xrange(0,len(dna),3):
        codon = dna[_:_+3]
        if codon in dnacodontable:
            if dnacodontable[codon] == 'stop': break
            protein.append(dnacodontable[codon])
        elif re.match('---',codon): protein.append('-')
        else: protein.append('X')
    return ''.join(protein)

# get pairwise substitutions
def get_pairwise_substitutions(anc_seq, desc_seq):
    # nucleotide sequences input
    if set(list(anc_seq)) <= set(list('atgcn-')):
        transitions, transversions, nonsyn_sub, syn_sub = 0, 0, 0, 0
        for _ in xrange(0, len(anc_seq), 3):
            anc_codon = anc_seq[_:_+3]
            desc_codon = desc_seq[_:_+3]
            if anc_codon == desc_codon:
                continue
            else:
                unknown_res_binary = 0
                nuc_sub = 0
                for j in xrange(3):
                    if re.match('(n|-)', anc_codon[j]) or re.match('(n|-)', desc_codon[j]):
                        unknown_res_binary = 1
                        continue
                    elif anc_codon[j] != desc_codon[j]:
                        if re.search('(ag|ga|ct|tc)', ''.join([anc_codon[j], desc_codon[j]])):
                            transitions += 1
                            nuc_sub += 1
                        else:
                            transversions += 1
                            nuc_sub += 1

                if unknown_res_binary == 1:
                    continue
                if dnacodontable[anc_codon] != dnacodontable[desc_codon]:
                    nonsyn_sub += nuc_sub
                else:
                    syn_sub += nuc_sub
        return {'transitions':transitions, 'transversions':transversions, 'nonsyn':nonsyn_sub, 'syn':syn_sub}
    # protein sequences input
    else:
        mutation_list = []
        for _ in xrange(len(anc_seq)):
            if re.search('(-|X)', anc_seq[_]) or re.search('(-|X)', desc_seq[_]):
                continue
            elif anc_seq[_] != desc_seq[_]:
                mutation_list.append('{}{}{}'.format(anc_seq[_], _+1, desc_seq[_]))
        return mutation_list

def get_least_unknown_residue_sequences(seq_dict, dist_dict_to_edit={}):
    seqid_to_unknownrescount = {seqid:len(re.findall('(-|n|X)', sequence, re.I)) for seqid, sequence in seq_dict.items()}
    seqid_with_least_unknown_res = [seqid for seqid, count in seqid_to_unknownrescount.items() if count == min(seqid_to_unknownrescount.values())]
    if len(dist_dict_to_edit) > 0:
        for seqid in list(set(seq_dict.keys())-set(seqid_with_least_unknown_res)):
            del dist_dict_to_edit[seqid]
        return seqid_with_least_unknown_res, dist_dict_to_edit
    else:
        return seqid_with_least_unknown_res


### Define inputs 
This is defined via command line in the [ageflu_getpairs.py](https://github.com/alvinxhan/ageflu/blob/master/scripts/ageflu_getpairs.py) script. 

In [29]:
class params:
    tree = './files/H3N2-HA-nuc.rooted.nwk' # Phylogenetic tree file in NEWICK format 
    aln = './files/H3N2-HA-nuc.fa' # Nucleotide alignment of HA sequences (pre-CD-HIT)
    nr = './files/H3N2-HA-nuc_cd-hit-non-redundant.clstr' # CD-HIT identical sequence cluster file 
    child = [0, 5] # age range of children 
    adult = [35, 120] # age range of adults 
    maxmut = 5 # maxmimum number of amino acid substitutions allowable in an evolutionarily closest pair 
    patdist = 0.007 # maximum patristic distance between sequences in a closest pair
    order = 3 # max ordinal of pairs 

### Parse FASTA file and sort CD-HIT identical sequence clusters into age categories (child or adult) based on input children and adults age ranges

In [30]:
# query subtype analysed 
try:
    query_subtype = re.search('(H3N2|pH1N1|H1N1pdm09|BVic|BYam)', params.tree).group().upper()
    if query_subtype == 'PH1N1':
        query_subtype = 'H1N1PDM09'
except:
    query_subtype = raw_input('\nWARNING: Can\'t parsed query subtype from tree name. Enter subtype (H3N2/pH1N1/BVic/BYam): ')

# output filename
outfname = '{}C{}_{}A{}_PD{}_MM{}_CL{}_{}'.format(params.child[0], params.child[-1], params.adult[0], params.adult[-1], params.patdist, params.maxmut, params.order, re.sub('[^/]*/', '', params.tree))

# parse fasta alignment 
dnacodontable = {'ttt':'F', 'ttc':'F', 'tta':'L', 'ttg':'L', 'ctt':'L', 'ctc':'L', 'cta':'L', 'ctg':'L', 'att':'I', 'atc':'I', 'ata':'I', 'atg':'M', 'gtt':'V', 'gtc':'V', 'gta':'V', 'gtg':'V', 'tct':'S', 'tcc':'S', 'tca':'S', 'tcg':'S', 'cct':'P', 'ccc':'P', 'cca':'P', 'ccg':'P', 'act':'T', 'acc':'T', 'aca':'T', 'acg':'T', 'gct':'A', 'gcc':'A', 'gca':'A', 'gcg':'A', 'tat':'Y', 'tac':'Y', 'taa':'stop', 'tag':'stop', 'cat':'H', 'cac':'H', 'caa':'Q', 'cag':'Q', 'aat':'N', 'aac':'N', 'aaa':'K', 'aag':'K', 'gat':'D', 'gac':'D', 'gaa':'E', 'gag':'E', 'tgt':'C', 'tgc':'C', 'tga':'stop', 'tgg':'W', 'cgt':'R', 'cgc':'R', 'cga':'R', 'cgg':'R', 'agt':'S', 'agc':'S', 'aga':'R', 'agg':'R', 'ggt':'G', 'ggc':'G', 'gga':'G', 'ggg':'G'}

fdat_nuc = parsefasta(params.aln)
fdat_aa = {k:translatedDNA(v) for k,v in fdat_nuc.items()}
isolateid_to_fdatheader = {re.search('(GBISL|EPIISL|GB_ISL_|EPI_ISL_)\d+', header).group().replace('_', ''):header for header in fdat_nuc.keys()}
fdatheader_to_isolateid = {v:k for k,v in isolateid_to_fdatheader.items()}
isolateid_to_age = {isolateid:int(re.search('AGE(\d+)', header).group(1)) for isolateid, header in isolateid_to_fdatheader.items()}

# sequences to ignore (outside of child_min and adult_max age)
isolates_to_ignore = [isolateid for isolateid, age in isolateid_to_age.items() if age < params.child[0] or age > params.adult[-1]]

# parse clstr file and sort each identical sequence cluster (>1 member sequence) into children/adult age categories 
# isolates_to_ignore now include those of ambiguous age categories (between child_max and adult_min) 
# + 0 pat dist sequences with > minimal number of n/-
isolateid_to_representatives, isolateid_to_agecategory = {}, {} 
fhandle = filter(None, open(params.nr, 'rU').readlines())
for key, group in itertools.groupby(fhandle, lambda _: re.search('^>Cluster', _)):
    if not key:
        cluster = list(group)
        if len(cluster) > 1:
            cluster = [re.search('(GBISL|EPIISL|GB_ISL_|EPI_ISL_)\d+', header).group().replace('_', '') for header in cluster]
            cluster_age = [isolateid_to_age[isolateid] for isolateid in cluster]
            # all children or adults sequence clusters
            if all([params.child[0] <= age <= params.child[-1] for age in cluster_age]):
                isolateid_to_agecategory[isolateid] = 'C'
                for isolateid in cluster:
                    isolateid_to_representatives[isolateid] = cluster
            elif all([params.adult[0] <= age <= params.adult[-1] for age in cluster_age]):
                isolateid_to_agecategory[isolateid] = 'A'
                for isolateid in cluster:
                    isolateid_to_representatives[isolateid] = cluster
            # uncategorized age present in sequence cluster 
            elif all([params.child[0] <= age <= params.adult[-1] for age in cluster_age]) and any([params.child[-1] < age < params.adult[0] for age in cluster_age]):
                isolateid_to_agecategory[isolateid] = 'U'
                for isolateid in cluster:
                    isolateid_to_representatives[isolateid] = [member for _, member in enumerate(cluster) if params.child[-1] < cluster_age[_] < params.adult[0]]
            else:
                isolates_to_ignore = list(set(isolates_to_ignore)|set(cluster))

# print example
example_isolateid = random.choice(isolateid_to_agecategory.keys())
print ('EXAMPLE of the age category of a CD-HIT cluster:\nRepresentative sequence = {}'.format(isolateid_to_fdatheader[example_isolateid]))
print ('Age category: {}'.format(isolateid_to_agecategory[example_isolateid]))
print ('Member sequences:\n{}'.format('\n'.join([isolateid_to_fdatheader[_] for _ in isolateid_to_representatives[example_isolateid]])))

EXAMPLE of the age category of a CD-HIT cluster:
Representative sequence = A/SouthCarolina/22/2016_HA_EPIISL225062_2016.39767283_ORI/CLI_A/H3N2_LAT33.836081LON-81.163724_AGE18
Age category: U
Member sequences:
A/SouthCarolina/19/2016_HA_EPIISL225061_2016.39493498_ORI/CLI_A/H3N2_LAT33.836081LON-81.163724_AGE19
A/SouthCarolina/25/2016_HA_EPIISL225067_2016.39767283_ORI/CLI_A/H3N2_LAT33.836081LON-81.163724_AGE19
A/SouthCarolina/17/2016_HA_EPIISL225128_2016.39219713_ORI/CLI_A/H3N2_LAT33.836081LON-81.163724_AGE25
A/SouthCarolina/22/2016_HA_EPIISL225062_2016.39767283_ORI/CLI_A/H3N2_LAT33.836081LON-81.163724_AGE18


### Parsing phylogenetic tree using ete3

In [None]:
    try:
        tree = ete3.Tree(params.tree)
    except:
        sys.exit('\nERROR: Unable to parse tree using ete3.\n')
    tree.ladderize()
    # get nodes to isolate ids
    leaf_to_isolateid = {leaf:re.search('(EPI_ISL_|EPIISL|GB_ISL_|GBISL)\d+', leaf.get_leaf_names()[0]).group().replace('_', '') for leaf in tree.get_leaves()}
    isolateid_to_leaf = {v:k for k,v in leaf_to_isolateid.items()}