## README: Making input fasta files

This script wrangles multiple sequence alignments and fasta files from `Nextstrain/fauna` to make fasta files of sequences that we want to include specifically in our analysis of Zika in the Americas. A variety of fasta files are outputted, including ones that can be read in to BEAST, and ones that can be read in to `Nextstrain/augur`.

In [87]:
#### import libraries ####
from Bio import SeqIO
from Bio import AlignIO
import datetime

date = datetime.datetime.now().strftime ("%Y-%m-%d")


In [150]:
#### infile paths #### 
zika_msa_stripped = "/Users/alliblk/Desktop/gitrepos/augur-alliblk/zika/processed/zika_aligned_stripped.mfa"
fauna_file = "/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta"

#### outfile paths #### 
americas_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-zika-{}.fasta'.format(date)
americas_frenchpol_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-frenchPolyn-zika-{}.fasta'.format(date)

usvi_file = "/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-{}.fasta".format(date)
usvi_primary_clade_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-primary-clade-{}.fasta'.format(date)


In [148]:
#### geographic exclusion criteria #### 
regions_to_exclude1 = ['southeast_asia', 'oceania', 'japan_korea', 'china','europe'] #french polynesia out
regions_to_exclude2 = ['southeast_asia', 'japan_korea', 'china','europe'] #french polynesia in

#reference info
reference_strain_name = 'PF13/251013_18' #WHO reference strain for Zika, msa is stripped to this seq

## Input data: Make dictionaries that hold data in different structures

* `strain_header_dict` maps strain name in augur form (key) to full fauna fauna header (value)
* `fauna_dict` maps fauna taxon name (key) to the unaligned, unstripped sequence as it's stored in fauna (value)
* `msa_dict` maps strain name in augur form (key) to the aligned and stripped-to-reference sequence (value). All sequences in this dictionary should therefore be 10769 nucleotides long, and are in frame.

In [151]:


# dict that will allow matching of fauna header to MSA header
with open(fauna_file,'rU') as file:
    strain_header_dict={line.split('|')[0].replace('>',''):line.strip() for line in file if line.startswith('>')}

# dict that will allow matching of fauna header to fauna sequnces (need this for making augur-formatted infiles)
fauna_dict = SeqIO.to_dict(SeqIO.parse(fauna_file, 'fasta'))
#print len(fauna_dict)

# dict that has only strain name as key, but value is aligned, stripped to reference sequence
zika_msa = AlignIO.read(open(zika_msa_stripped),'fasta')
zika_msa_dict = {record.id:record.seq for record in zika_msa}

for key in zika_msa_dict_pruned.keys():
    assert len(zika_msa_dict_pruned[key]) == 10769

## Alignment QC:

I outline genome exclusion criteria in the README of the `data/fastas` directory. Genomes that should be excluded from my analysis are hardcoded in to Augur as `dropped_strains` in `zika.prepare.py`.

In [143]:
informative_bases = ['A','G','C','T']
num_informative_bases = {}

for key in fauna_dict.keys():
    inform_base_count = 0
    for base in fauna_dict[key].upper():
        if base in informative_bases:
            inform_base_count +=1
        else:
            continue
    #n_counts[key.split('|')[0]] = n_count
    num_informative_bases[key.split('|')[0]] = inform_base_count
    #n_counts[key] = n_count
#print n_counts


In [147]:
low_information_genomes = {key:value for key,value in num_informative_bases.items() if value < 5000}
print len(low_information_genomes)
print low_information_genomes.keys()
#these genomes will get added to the dropped_strains section of augur zika.prepare.py

15
['USA/2016/FL019', 'DOM/2016/MA_WGS16_009', 'DOM/2016/MA_WGS16_020', 'BRA/2016/FC_DQ12D1', 'USA/2016/FL035', 'HTI/2016/MA_WGS16_022', 'DOM/2016/BB_0428', 'HND/2016/HU_ME137', 'Bahia04', 'Bahia05', 'Bahia15', 'DOM/2016/MA_WGS16_031', 'JAM/2016/MA_WGS16_038', 'BRA/2016/FC_DQ68D1', 'Brazil/2016/ZBRY12']


## Making fasta files formatted for input into Augur pipeline.

Since Augur is a faster tool for building trees and doing ancestral state reconstruction than BEAST, I'm troubleshooting possible issues with the input alignment and looking for outliers (general dataset QC) via Augur builds. Augur input fasta files need to be formatted exactly the same was as the fauna output fasta file. So what I'm doing here is keeping header formatting the same as fauna, but subsampling down to exclude genomes from countries that I don't want to include, or samples that appear to be outliers etc.

In [92]:
# prune dicts geographically
# note that all of the samples that should be dropped except those specified in drop_augurPruned are 
# hardcoded into Augur as samples that should be dropped.
# therefore they do not need to be dropped here in this script
# I've entered them above mainly so I can keep stats on numbers of samples getting dropped and why
geoPruned_fauna_dict_americasOnly = {key:value for key,value in fauna_dict.items() if key.split('|')[4] not in regions_to_exclude1 and key.split('|')[0] not in drop_augurPruned}
geoPruned_fauna_dict_includeOceania = {key:value for key,value in fauna_dict.items() if key.split('|')[4] not in regions_to_exclude2 and key.split('|')[0] not in drop_augurPruned}

#print out fauna-format files for input into augur

# Americas only
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-americas-only.fasta','w') as file:
    for key in geoPruned_fauna_dict_americasOnly.keys():
        file.write(str('>' + geoPruned_fauna_dict_americasOnly[key].description + '\n' + geoPruned_fauna_dict_americasOnly[key].seq + '\n'))
        
#Americas + french polynesia
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-americas-andfp.fasta','w') as file:
    for key in geoPruned_fauna_dict_includeOceania.keys():
        file.write(str('>' + geoPruned_fauna_dict_includeOceania[key].description + '\n' + geoPruned_fauna_dict_includeOceania[key].seq + '\n'))
        

## Combining the augur processed multiple sequence alignment with the fauna-output fasta

Here I want to combine attributes of both the `Nextstrain/augur` processed Zika MSA with the fasta output from `Nextstrain/fauna`. The Fauna download has the strain information in the desired fasta format, with all necessary metadata (sampling date, geography) in the header. The processed multiple sequence alignment however has been aligned with mafft and stripped to the WHO ZIKV reference genome, and therefore represents the sequence alignment that we want.

The header from the MSA contains the strain name of the sample, which is also in the fauna header. Therefore I will use key matching to make a new fasta file that combines the header from the fauna file with the sequences from the augur msa. The fauna header will be trimmed down when writing out the fastas to remove superfluous information in the header and to ensure that the headers can be read in to FigTree.

In [93]:
output_dict = {}
for key in zika_msa_dict_pruned.keys():
    header = strain_header_dict[key]
    seq = zika_msa_dict_pruned[key]
    output_dict[header] = seq

americas_count = 0
oceania_count = 0
non_americas_non_oceania_count = 0 

for key in output_dict.keys():
    if key.split('|')[4] not in regions_to_exclude1:
        americas_count += 1
    elif key.split('|')[4] == 'oceania':
        oceania_count += 1
    elif key.split('|')[4] in regions_to_exclude1:
        non_americas_non_oceania_count += 1

print len(output_dict.keys())
print americas_count
print oceania_count
print non_americas_non_oceania_count

325
299
26
0


In [94]:
#print out Americas only multiple sequence alignment
with open(americas_file,'w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[4] in regions_to_exclude1:
            continue
        else:
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))

# Uncomment to print out alignment that still contains french polynesian sequences
#print out Americas and french polynesia multiple sequence alignment
'''
with open(americas_frenchpol_file,'w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[4] in regions_to_exclude2:
            continue
        else:
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5] 
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))
'''
#print out USVI only multiple sequence alignment
with open(usvi_file,'w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[5] == 'usvi':
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))
        else:
            continue
        