## README: Making input fasta files

This script wrangles multiple sequence alignments and fasta files from `Nextstrain/fauna` to make fasta files of sequences that we want to include specifically in our analysis of Zika in the Americas. A variety of fasta files are outputted, including ones that can be read in to BEAST, and ones that can be read in to `Nextstrain/augur`.

In [45]:
#### import libraries ####
from Bio import SeqIO
from Bio import AlignIO
import datetime
import pandas as pd
import numpy as np
import csv
date = datetime.datetime.now().strftime ("%Y-%m-%d")


## Step 1: Remove sequences from geographic areas we do not want to include (this is everything other than Americas and French Polynesia).

In [63]:
fauna_file = "../data/zika-fauna-with-correct-usvi-seqs-2019-11-20.fasta"
fauna_dict = SeqIO.to_dict(SeqIO.parse(fauna_file, 'fasta'))
print('{} sequences were available from Fauna.'.format(len(fauna_dict)))

regions_to_exclude = ['southeast_asia', 'japan_korea', 'south_asia', 'africa', 'china','europe']
reference_strain_name = 'PF13/251013_18' 

geoPruned_fauna_dict = {fauna_dict[key].description:fauna_dict[key].seq for key in fauna_dict.keys() if key.split('|')[4] not in regions_to_exclude}
print ('{} sequences meet the geographic criteria.'.format(len(geoPruned_fauna_dict)))


738 sequences were available from Fauna.
568 sequences meet the geographic criteria.


## Step 2: Change region naming so that caribbean and central_america are not counted as "north_america" (which is fauna's default).

In [64]:
central_america = ['panama', 'costa_rica', 'honduras', 'el_salvador', 'nicaragua', 'guatemala', 'belize']
caribbean = ['barbados', 'usvi', 'puerto_rico', 'dominican_republic', 'martinique', 'guadeloupe', 'saint_barthelemy', 'jamaica', 'haiti', 'cuba']

regionallyNamed_dict = {}

for key in geoPruned_fauna_dict.keys():
    country = key.split('|')[5]
    if country in central_america:
        new_key = key.replace('north_america', 'central_america')
    elif country in caribbean:
        new_key = key.replace('north_america', 'caribbean')
    else:
        new_key = key
    regionallyNamed_dict[new_key] = geoPruned_fauna_dict[key]

## Step 3: Remove sequences that lack sufficient numbers of informative bases to align well. 

Here I want sequences to have at least 80% of their bases be informative calls. I'm taking the sequences that were from the correct region (saved in the `regionallyNamed_dict`), selecting all USVI sequences and then any other sequence in that dictionary that is 80% or more informative base calls, and saving that trimmed down dataset to a new dictionary called `high_qual_seqs`.

In [65]:
## replace n's with gaps, and count n's in sequences (alignment checks)
def proportion_n(sequence):
    "counts numbers of N's in a sequence to perform QC and see how many non-informative sites exist"
    counter = 0
    denominator = float(len(sequence))
    for base in sequence:
        if base in ['N','n']:
            counter +=1
    return (counter/denominator)

In [66]:
high_qual_seqs = {}
for key in regionallyNamed_dict.keys():
    if key.split('|')[5] == 'usvi': #keep all USVI sequences.
        high_qual_seqs[key] = regionallyNamed_dict[key]
    else:
        n_fraction = proportion_n(regionallyNamed_dict[key])
        if n_fraction < 0.2:
            high_qual_seqs[key] = regionallyNamed_dict[key]

print("There are {} sequences that are sampled from a relevant region and have over 70% informative bases".format(len(high_qual_seqs.keys())))

There are 476 sequences that are sampled from a relevant region and have over 70% informative bases


## Step 4: Remove any sequences for which we do not have permission to include in published analyses.

From the set of sequences that meet my quality and region thresholds I need to filter out any that are in genbank, but that were unpublished that that we don't have permission to include. The input dictionary here is `high_qual_seqs` and the output dictionary is `publishable_seqs_dict`.

In [67]:
# first step, print out all of the accessions and authors for the sequences that are still left in the preliminary dataset!
# I'll then go through this list and look for which sequences have been published on, which are in genbank but unpublished,
# And that can guide 
with open('../data/sequence-accessions-author-info.csv', 'w') as file:
    for key in high_qual_seqs.keys():
        file.write("%s,%s,%s\n"%(key.split('|')[0], key.split('|')[2], key.split('|')[10]))

In [69]:
# read in the dataframe that has the permissions information,
# then parse that to select out all strains that can be included in a publishable analysis
# these are the strains that should be used, and form the fauna subset we want.
genome_permissions = pd.read_csv('../data/sequence-permissions-info-manually-verified.csv', delimiter =',')

publishable_strains = []
for i in range(len(genome_permissions)):
    record = genome_permissions.iloc[i]
    if record['preliminarily_include'] == 'yes':
        publishable_strains.append(record['strain_name'])

# using the strains in the publishable_strains list, pull out the full fauna headers (and sequences)
# for each strain that can be published on.
# also making sure we are only dealing with sequences within the countries of interest (Americans and Oceania)
publishable_seqs_dict = {}
for strain in publishable_strains:
    for key in high_qual_seqs.keys():
        if key.startswith(strain):
            publishable_seqs_dict[key] = high_qual_seqs[key]
            
print("There are {} genomes that we can include in published analyses.".format(len(publishable_seqs_dict.keys())))


There are 394 genomes that we can include in published analyses.


## Step 5: Remove known sequences which are all cell culture passages of the same clinical isolate

There is a large number of Colombian sequences that are all sequences of various passages of a single strain. Here, I want to keep the one sequence that was directly sequenced from the clinical isolate, and remove all of the other sequences. The output dictionary here (with the sequences I want to keep) is `publishable_seqs_no_cell_culture_passage_dict`.

In [70]:
col_flr_strains_to_remove = [strain for strain in publishable_strains if strain.startswith("COL/FLR") and strain != 'COL/FLR/2015']
print("There were {} sequences from different culture passages of COL/FLR/2015 that should be removed.".format(len(col_flr_strains_to_remove)))

publishable_seqs_no_cell_culture_passage_dict = {}
for key in publishable_seqs_dict.keys():
    if key.split('|')[0] not in col_flr_strains_to_remove:
        publishable_seqs_no_cell_culture_passage_dict[key] = publishable_seqs_dict[key]

assert(len(publishable_seqs_no_cell_culture_passage_dict.keys()) == (len(publishable_seqs_dict.keys()) -(len(col_flr_strains_to_remove))))
print("There are now {} sequences that should be included in the build.".format(len(publishable_seqs_no_cell_culture_passage_dict)))

There were 33 sequences from different culture passages of COL/FLR/2015 that should be removed.
There are now 361 sequences that should be included in the build.


## Now, write out the sequences that should be part of the build to an input fasta.

In [71]:
with open("../data/usvi-and-high-qual-americas-input-data.fasta","w") as file:
    for key in publishable_seqs_no_cell_culture_passage_dict.keys():
        file.write(str(">" + key + "\n" + publishable_seqs_no_cell_culture_passage_dict[key].lower() + "\n" ))

In [20]:
import json
from pprint import pprint

with open('/Users/alliblk/Desktop/gitrepos/augur/builds/zika/prepared/zika.json') as zika_json_file:    
    zika_json_data = json.load(zika_json_file)
#pprint(data)
#print data.keys()
print(zika_json_data['sequences']['USVI/45/2016']['attributes']['raw_date'])

2016-08-03


In [39]:
print(zika_json_data['sequences']['USVI/45/2016']['attributes']['country'])

usvi


Quick map of how the JSON will get parsed!

Layer 1: `data.keys()`info, colors, lat_longs, reference, sequences (things we care about here are sequences)

Layer 2: `data[sequences].keys()`: Brings up all strains, with canonicized strain names ( e.g. USVI/45/2016).

Layer 3: `data[sequences][<strain>].keys()`: Brings up attributes and seq. The sequences is the full, unaligned sequence, attributes have further keys.

Layer 4: `data[sequences][<strain>][attributes].keys()`: Brings up division, title, url, country, regio, db, accession, raw_date, strain, authors, paper_url, and journal.

/Users/alliblk/Desktop/gitrepos/augur/builds/zika/processed/zika_aligned_stripped.mfa


In [40]:
#### infile paths #### 
zika_msa_stripped = "/Users/alliblk/Desktop/gitrepos/augur/builds/zika/processed/zika_aligned_stripped.mfa"
fauna_file = "/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta"

#### outfile paths #### 
americas_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-zika-{}.fasta'.format(date)
americas_frenchpol_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-frenchPolyn-zika-{}.fasta'.format(date)

usvi_file = "/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-{}.fasta".format(date)
usvi_primary_clade_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-primary-clade-{}.fasta'.format(date)


Step 1: Load in all available sequences in from Fauna. This should be all Zika genomes that have been made publicly available on GenBank, ViPR, and GitHub. Note that this does NOT mean that all of the genomes have been published on.

Note that you will need to clone the `nextstrain/fauna` git repo and run fauna to have this file. 

Download zika sequences using the command call in the `builds/ZIKA.README` file:

`python vdb/zika_download.py -db vdb -v zika --fstem zika --resolve_method choose_genbank`

This is important because the reference strain PF13/251013_18 that we're using as our outgroup has been sequenced twice. The version of that sequence that we want is Troesemeier et al's. 

Step 2: Make a subset of the fauna sequences that only includes the geographic regions of interest. Here, I want only Zika sequences from the Americas. I need an outgroup sequence to properly root the tree however, so I'm also going to include a single genome from French Polynesia. 

In [134]:
old_permission_info = pd.read_csv('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/sequences_and_permissions.txt',delimiter='\t')

In [135]:
#print [key.split('|')[2] for key in geoPruned_fauna_dict.keys()]
new_accessions_to_check = [key for key in geoPruned_fauna_dict.keys() if key.split('|')[2] not in pd.Series(old_permission_info['accession_number']).tolist()]

strain_names = [accession.split('|')[0] for accession in new_accessions_to_check]
ncbi_id = [accession.split('|')[2] for accession in new_accessions_to_check]
lead_author =  [accession.split('|')[10].replace('et al','') for accession in new_accessions_to_check]


new_accessions_df = pd.DataFrame(np.column_stack([strain_names, ncbi_id, lead_author]), columns= ['strain_name', 'accession_number','lead_author'])
new_accessions_df.to_csv('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/new_accesions_between_08152017_and_03212018.csv', index=False)
#new_accessions_df = pd.DataFrame(strain_names, ncbi_id, lead_author, columns = ['strain_name', 'accession_number','lead_author','permission_type','publication_pubmed_id','permission_to_use','preliminarily_include'])


In [165]:
VIseqs = [key.split('|')[0] for key in geoPruned_fauna_dict.keys() if key.split('|')[0].startswith('USVI') or key.split('|')[0].startswith('USIV')]
print len(VIseqs)  
ordered_VI_strain_numbers = sorted([int(strain.split('/')[1]) for strain in VIseqs])
print ordered_VI_strain_numbers



31
[1, 2, 3, 4, 5, 6, 7, 11, 12, 13, 19, 20, 21, 22, 23, 24, 25, 27, 28, 30, 32, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]


Step 3: Filter sequences so that we only include sequences that we are allowed to publish on. These include any sequences that have been previously released during publication, or any sequences for which the authors have given us permission to use them in our analysis.

This information has been collated in a table that I'm going to import in here. Note that the table is NOT exhaustive for all genomes available from Fauna. Rather, it only includes sequences from the Americas. Sequence accessions have been manually checked and matched with publications if they are published on (with pubmed ID information). If not published on, sequences have a flag that denotes whether author permission was received or not.

I have an indicator in this table as well that is labeled `premliminarily_include`. If a record says `no` in this column, it is because it needed permissions to be requested, but would likely not add more to the analysis, and therefore was deemed not necessary to pursue permissions on.

Below I have lists of all the publications as they're listed in the Fauna headers, which represent the title given in GenBank but _not necessarily the title of the actual published manuscript_. They are divided into publications that are published, ones that are not published, and ones that are in fact published but the actual manuscript name is different than the GenBank publication name, and therefore they appear to be unpublished in GenBank.

---------------

`published_submissions` = ['First Complete Genome Sequence of Zika Virus (Flaviviridae, Flavivirus) from an Autochthonous Transmission in Brazil',
'Genetic characterization of the Zika virus epidemic in the US Virgin Islands',
'Isolation of Infective Zika Virus from Urine and Saliva of Patients in Brazil',
'Full-length infectious cDNA clone of Zika virus from 2015 epidemic in Brazil: development and characterization of recombinant viruses in cell lines from human placenta, testis, and brain',
'Zika virus complete genome from Salvador, Bahia, Brazil',
'Complete Genome Sequences of Three Historically Important, Spatiotemporally Distinct, and Genetically Divergent Strains of Zika Virus: MR-766, P6-740, and PRVABC-59',
'Isolation of infectious Zika virus from saliva and prolonged viral RNA shedding in a traveller returning from the Dominican Republic to Italy, January 2016',
'Fatal Zika Virus Infection with Secondary Nonsexual Transmission',
'Quasispecies composition and evolution of a typical Zika virus clinical isolate from Suriname',
'First Complete Genome Sequences of Zika Virus Isolated from Febrile Patient Sera in Ecuador',
'Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Genomic epidemiology reveals multiple introductions of Zika virus into the United States',
'Coinfection With Zika and Dengue-2 Viruses in a Traveler Returning From Haiti, 2016: Clinical Presentation and Genetic Analysis',
'Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'Zika Virus Targets Different Primary Human Placental Cells, Suggesting Two Routes for Vertical Transmission',
'Establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Zika Virus Outbreak in Haiti in 2014: Molecular and Clinical Data',
'Prolonged Detection of Zika Virus in Vaginal Secretions and Whole Blood',
'Complete Genome Sequences of Identical Zika virus Isolates in a Nursing Mother and Her Infant',
'?',
'Zika virus in the Americas: early epidemiological and genetic findings',
'Full Genome Sequence and sfRNA Interferon Antagonist Activity of Zika Virus from Recife, Brazil',
'Distinct Zika Virus Lineage in Salvador, Bahia, Brazil',
'Genome sequence of a candidate World Health Organization reference strain for Zika virus for nucleic acid testing',
'Zika virus evolution and spread in the Americas',
'Zika virus genome from the Americas',
'Complete coding sequence of Zika virus from Martinique outbreak in 2015',
'Zika Virus Associated with Microcephaly',
'Characterization of a Zika Virus Isolate from Colombia',
'Detection and sequencing of Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case study',
'Phylogeny of Zika Virus in Western Hemisphere, 2015',
'Complete genomic sequence of Zika virus isolated from a clinical semen sample',
'Infection dynamics in a traveller with persistent shedding of Zika virus RNA in semen for six months after returning from Haiti to Italy, January 2016',                        
                  ]

`published_submissions_with_incorrect_publication_name` = [ 'Complete Zika Virus Genome Sequences from a Serum Sample and after Isolation in Vero Cells',
'Full coding sequence of the Zika Virus isolate HS-2015-BA-01',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'The isolation and characterization of Zika virus imported into Guangdong, China',
'Full genomic characterization of a Zika virus isolate from Colombia',
'Molecular characterization of Zika virus in Mexico',
'Repurposed drug candidates to treat ZIKV infection in pregnancy',
'Full-genome amplification and sequencing of Zika viruses using a targeted amplification approach',
'Zika virus infection with prolonged maternal viremia and fetal brain abnormalities',
'Complete genome of Zika virus GZ01 isolate from China',
'Complete genomic sequence of Zika virus isolated from a traveler to Colombia in 2016',
                                                       
                                                         
]

`unpublished_submissions` = [
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Complete Genome of a Zika Virus Isolate, Amazonas, Brazil, 2016',
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Zika viral sequence identified in Fortaleza, Brazil',
'Comparative analysis between the genomes of intra-host and cell cultured Zika virus obtained from a Mexican symptomatic patient'
'Zika virus from breast milk',
'Genome sequences of Zika viruses isolated in Nicaragua',              
'Emergence of Zika virus in Brazil: Detection of Asian genotype',
'Brazilian Zika virus genome sequence',
'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014', #white et al sequences
'Zika virus sequence from Cuba traveler diagnosed in the United States (Miami, FL)', #nate
'Zika virus genomes from human cases in Florida, USA',
'Isolation of infectious Zika virus from sera in a traveller returning from Brazil to Italy',
'Zika virus infection in pregnant mice',
'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)',
'Genome sequences of four Zika virus isolates from Brazil',
'Direct Submission'
                          ]

In [125]:
permission_new = pd.read_csv('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/sequences_and_permissions_03212018.txt',delimiter='\t')
permission_new

Unnamed: 0,strain_name,accession_number,lead_author,permission_type,publication_pubmed_id,permission_to_use,preliminarily_include,Unnamed: 7
0,PHE_Guadeloupe,KX673530,Atkinson,previously_published,27738033.0,not_necessary,yes,
1,MEX_I_7,KX247632,Barrows,previously_published,27476412.0,not_necessary,yes,
2,Dominican_Republic/2016/PD1,KU853012,Barzon,previously_published,26987769.0,not_necessary,yes,
3,Haiti/2016/PD,KX269878,Barzon,previously_published,27542178.0,not_necessary,yes,
4,VEN/UF_1/2016,KX702400,Blohm,previously_published,28450510.0,not_necessary,yes,
5,VEN/UF_2/2016,KX893855,Blohm,previously_published,28450510.0,not_necessary,yes,
6,Rio_U1,KU926309,Bonaldo,previously_published,27341420.0,not_necessary,yes,
7,Rio_S1,KU926310,Bonaldo,previously_published,27341420.0,not_necessary,yes,
8,MEX_CIENI551,KY120348,Boukadida,previously_published,28336600.0,not_necessary,yes,
9,Brazil_ZKV2015,KU497555,Calvet,previously_published,26897108.0,not_necessary,yes,


In [167]:
strains_with_permission = []

for i in range(len(permission_new)):
    record = permission_new.iloc[i]
    if record['permission_to_use'] != 'permission_not_received' and record['preliminarily_include'] == 'yes':
        strains_with_permission.append(record['strain_name'])

print len(strains_with_permission)

324


In [168]:
useable_genomes_dict_faunaHeader = {}
useable_genomes_dict_augurHeader = {}
#making a dict for each kind of header here because it's really nice to have both when troubleshooting.
for strain in strains_with_permission:
    for key in geoPruned_fauna_dict.keys():
        if key.startswith(strain):
            useable_genomes_dict_faunaHeader[key] = geoPruned_fauna_dict[key]
            useable_genomes_dict_augurHeader[key.split('|')[0]] = geoPruned_fauna_dict[key]

print len(useable_genomes_dict_faunaHeader)

324


In [169]:
print len(useable_genomes_dict_augurHeader[reference_strain_name])

10769


Step 4: Now that we have our set of genomes that we can use, it's time to do some preliminary alignment quality control. I outline genome exclusion criteria in the README of the `data/fastas` directory. Genomes that should be excluded from my analysis are hardcoded in to Augur as `dropped_strains` in `zika.prepare.py`. Most of the process of determining outlier clades is done iteratively based on Augur builds of the tree, however filtering out strains with less that 5000 informative sites is something that I do in this notebook (below).

In [170]:
#write out files that Augur will take in
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-publishedAmericanSeqs-and-ref-03302018.fasta','w') as file:
    for key in useable_genomes_dict_faunaHeader.keys():
        file.write(str('>' + key + '\n' + useable_genomes_dict_faunaHeader[key] + '\n'))

In [151]:
informative_bases = ['A','G','C','T']

fraction_informative_bases = {}

for key in fauna_dict.keys():
    inform_base_count = 0
    for base in fauna_dict[key].upper():
        if base in informative_bases:
            inform_base_count +=1
        else:
            continue
    fraction_informative_bases[key.split('|')[0]] = float(inform_base_count)/len(fauna_dict[key])

poor_qual_genomes = {key:value for key,value in fraction_informative_bases.items() if value < 0.5}

print len(poor_qual_genomes)
print poor_qual_genomes.keys()
#wanting to get a sense of how many genomes we could consider poor quality (>50% N's)
#we may want to exclude these from the analysis

22
['DOM/2016/MA_WGS16_009', 'DOM/2016/MA_WGS16_020', 'HND/2016/Honduras_1138', 'HND/2016/HU_ME137', 'MEX/2016/mex27', 'BRA/2016/FC_DQ60D1', 'GTM/2016/Guatemala_3294', 'Brazil/2016/ZBRY12', 'BRA/2016/FC_DQ75D1', 'DOM/2016/BB_0085', 'USA/2016/FL019', 'BRA/2016/FC_DQ12D1', 'USA/2016/FL035', 'MEX/2016/mex31', 'MEX/2016/mex50', 'Bahia05', 'COL/FH01/2016', 'Bahia04', 'DOM/2016/BB_0428', 'Bahia15', 'DOM/2016/MA_WGS16_031', 'MEX/2016/Mexico_4227']


... Interlude here to do all the exploration in Augur ...

STEPS: 

* `cd` in to augur/builds/zika
* run `zika_prepare.py` with subsampling off, and pulling from the specific fasta that we wrote out earlier that excludes all non-American isolates except for the french polynesian outgroup. `python zika.prepare.py -v 0 --sequences /Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-publishedAmericanSeqs-and-ref-03302018.fasta`
* look at which sequences got excluded (dropped strains) in the `prepared` directory.
* iterate through excluding any additional strains by adding them to the `dropped_strains` list in `zika.prepare.py`.

Prepare clock filter dropped the following strains:
* Bahia05
* PF13/251013_18 ... this is the outgroup. What's going on?
* Brazil/2016/ZBRC18
* DOM/2016/MA_WGS16_020
* USA/2016/FLUR001
* USA/2016/FLUR057
* BRA/2016/FC_DQ12D1

Done :)

Step 5: Now that I have the alignment that I want to use, including a good sequence set that is aligned and stripped to reference, I want to make fasta files that I can read in to BEAST that make use of sequences from the multiple sequence aligment, but have fully informative headers (which I'll need to grab from Fauna). 

Basically, I'll need to use these three dictionaries to do this:

* `strain_header_dict` maps strain name in augur form (key) to full fauna fauna header (value).

* `msa_dict` maps strain name in augur form (key) to the aligned and stripped-to-reference sequence (value). All sequences in this dictionary should therefore be 10769 nucleotides long, and are in frame.

In [174]:
#make strain_header_dict
with open(fauna_file,'rU') as file:
    strain_header_dict={line.split('|')[0].replace('>',''):line.strip() for line in file if line.startswith('>')}

#make msa dict
zika_msa = AlignIO.read(open(zika_msa_stripped),'fasta')
zika_msa_dict = {record.id:record.seq for record in zika_msa}

print '{} sequences pulled in from multiple sequence alignment.'.format(len(zika_msa_dict))

320 sequences pulled in from multiple sequence alignment.


In [201]:
output_dict = {}
non_fauna_seqs = []
for key in zika_msa_dict.keys():
    assert len(zika_msa_dict[key]) == 10769
    try: 
        strain_header_dict[key]
    except KeyError:
        non_fauna_seqs.append(key)
        header = key #grab full fauna header from strain_header_dict
        seq = zika_msa_dict[key] #get sequence from msa
    else:
        header = strain_header_dict[key] #grab full fauna header from strain_header_dict
        seq = zika_msa_dict[key] #get sequence from msa
    output_dict[header] = seq
#ensure that everything was trimmed to reference
    
print len(output_dict)
print non_fauna_seqs
#print zika_msa_dict['USVI/15/2016']

320
['USVI/15/2016', 'USVI/35/2016', 'USVI/45/2016', 'USVI/46/2016']


In [261]:
## MAKE BDMM dataset (sample dates need to be in same window as USVI sampling 2016 to 2017)
## Have all Brazil, downsample USA, Mexico not needed.

from Bio import SeqIO
analysis_seq_dict  = SeqIO.to_dict(SeqIO.parse("/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/beast-americasZika-04012018.fasta", "fasta"))
#print analysis_seq_dict.keys()

In [264]:
from collections import defaultdict
country_binning_dict = defaultdict(dict)

for key in analysis_seq_dict.keys():
    country = key.split('|')[3]
    country_binning_dict[country].update({key:analysis_seq_dict[key].seq})
    
#print country_binning_dict['brazil']
for key in country_binning_dict.keys():
    print "{0}: has {1} sequences.".format(key, len(country_binning_dict[key]))
    

{'Brazil/2016/ZBRX100|2016-05-19|south_america|brazil': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN', SingleLetterAlphabet()), 'BRA/2016/FC_6863|2016-04-12|south_america|brazil': Seq('NNNNNNNNNNNNNNNNNNNNNAGTTTGAAGCGAAAGCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), 'BeH819015|2015-XX-XX|south_america|brazil': Seq('GAATCAGACTGCGACAGTTCGAGTTTGAAGCGAAAGCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), 'Brazil/2016/ZBRY4|2016-01-15|south_america|brazil': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN', SingleLetterAlphabet()), 'Brazil/2016/ZBRY14|2016-01-XX|south_america|brazil': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), 'Brazil_ZKV2015|2015-11-30|south_america|brazil': Seq('GAATCAGACTGCGACAGTTCGAGTTTGAAGCGAAAGCTAGCAACAGTATCAACA...CGG', SingleLetterAlphabet()), 'Brazil/2016/ZBRX2|2016-04-18|south_america|brazil': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNN', SingleLette

In [299]:
country_binned_dict2016 = defaultdict(dict)

for key in analysis_seq_dict.keys():
    country = key.split('|')[3]
    date = key.split('|')[1]
    if date.startswith('2016') and 'X' not in date:
        country_binned_dict2016[country].update({key:analysis_seq_dict[key].seq})
    else:
        continue

for key in country_binned_dict2016.keys():
    print "{0}: has {1} sequences.".format(key, len(country_binned_dict2016[key]))

brazil: has 59 sequences.
jamaica: has 4 sequences.
suriname: has 1 sequences.
honduras: has 22 sequences.
dominican_republic: has 31 sequences.
usa: has 40 sequences.
mexico: has 7 sequences.
nicaragua: has 8 sequences.
guatemala: has 1 sequences.
el_salvador: has 1 sequences.
puerto_rico: has 5 sequences.
guadeloupe: has 1 sequences.
colombia: has 10 sequences.
peru: has 2 sequences.
martinique: has 2 sequences.
haiti: has 3 sequences.
usvi: has 34 sequences.
venezuela: has 5 sequences.
saint_barthelemy: has 1 sequences.


### BDMM genomes subset - criteria:

* All samples need to follow similar time window to USVI (ie. sampling date in 2016).
* Have full sampling date known.
* Want to downsample USA (outbreak density is going to confuse BD rates and multi-type designation).
* Remove Colombian highly sequenced Barranquilla outbreak.
* See no interdigitation of USVI samples with samples from: Honduras, Mexico 

####  POSSIBLE DATASETS
* What about just a USVI-Brazil dataset?
* USVI complete, but everything else subsampled down?
* USVI + Caribbean + Brazil + Venezuela + Suriname + Puerto Rico + Subsampled USA + Dominican Republic (close interdigitating countries).



In [300]:
def is_sampling_date_known(taxon_name, date_index_position, delimiter = '|'):
    """ returns bool True if year, month, and day are known for a sample given the taxon name, else False """
    flag = True
    date_string = taxon_name.split(delimiter)[date_index_position].upper()
    if 'X' in date_string:
        flag = False
    return flag

In [312]:
print country_binned_dict2016['martinique']

{'Martinique/2016/FL001|2016-03-22|north_america|martinique': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACAGTATCAACA...NNN', SingleLetterAlphabet()), 'Martinique/2016/FL001Sa|2016-03-22|north_america|martinique': Seq('NNNNNNNNNNNNGACAGTTCGAGTTTGAAGCGAAAGCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet())}


In [316]:
#print country_seq_dict['brazil'].items()
import random
countries_to_exclude = ['honduras', 'mexico', 'el_salvador', 'nicaragua', 'guatemala' ,'panama']
countries_to_subsample = ['usa', 'dominican_republic', 'brazil']

bdmm_alignment = {}

for key in country_binned_dict2016.keys():
    if key in countries_to_exclude:
        continue
    elif key in countries_to_subsample:
        sample = random.sample(country_binned_dict2016[key].items(), 10) #produces a list of tuples of (name,sequence) that were randomly sampled
        bdmm_alignment[key] = sample
    else:
        bdmm_alignment[key] = [(strain,country_binned_dict2016[key][strain]) for strain in country_binned_dict2016[key].keys()]
        
print len(bdmm_alignment)

14


In [317]:
from datetime import datetime as dt

def decimalDate(date,date_format="%Y-%m-%d"):
    """ Converts calendar dates in specified format to decimal date. """
    sampling_date = dt.strptime(date,date_format)
    year =  sampling_date.year
    beginning_of_year = datetime.datetime(year,1,1) #year, month, day, so year 1,1 == beginning of year
    end_of_year = datetime.datetime(year+1,1,1) #next year 1,1 is a full year away from beginning of sampling year
    #return fraction of the full year (in seconds) that occurs between jan 1st and your sampling date
    return year + ((sampling_date - beginning_of_year).total_seconds() / ((end_of_year - beginning_of_year).total_seconds()))

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/bdmm_alignment_04052018.fasta','w') as file:
    for key in bdmm_alignment.keys():
        if key == 'usvi':
            deme = 'usvi'
        else:
            deme = 'nonusvi'
        for i in range(len(bdmm_alignment[key])): #have a list of tuples (taxon,sequence)
            header = bdmm_alignment[key][i][0]#first element in the tuple ins the taxon name
            date = decimalDate(header.split('|')[1])
            assert len(bdmm_alignment[key][i][1]) == 10769
            file.write(str('>'+ header.split('|')[0] +'|'+ str(date) + '|'+ header.split('|')[3] + '|' +deme+ '\n' + str(bdmm_alignment[key][i][1]) + '\n'))
    

In [199]:
#write output dict to fasta

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/beast-americasZika-04012018.fasta','w') as out_file:
    for key in output_dict.keys():
        
        if key in non_fauna_seqs:
            print key
            out_file.write(str('>' + key + '\n' + output_dict[key] + '\n'))
        else:
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))
        

USVI/35/2016
USVI/15/2016
USVI/45/2016
USVI/46/2016


In [103]:
#as inferred with subtree parsing in Baltic
usvi_primary_clade = ['USVI/3/2016', 'USVI/5/2016', 'USVI/34/2016', 'USVI/13/2016', 'USVI/38/2016', 'USVI/43/2016', 
                      'USVI/6/2016', 'USVI/21/2016', 'USVI/25/2016', 'USVI/27/2016', 'USIV/35/2016', 'USVI/2/2016',
                      'USVI/1/2016', 'USVI/37/2016', 'USVI/12/2016', 'USVI/39/2016', 'USVI/30/2016', 'USVI/32/2016', 
                      'USVI/23/2016']

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi_primary_clade.fasta','w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[0].replace('>','') in usvi_primary_clade:
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))
            
usvi_secondary_clade = ['USVI/41/2016', 'USVI/19/2016', 'USVI/20/2016', 'USVI/42/2016', 'USVI/28/2016']

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi_secondary_clade.fasta','w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[0].replace('>','') in usvi_secondary_clade:
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))

NameError: name 'output_dict' is not defined

In [28]:
from datetime import datetime as dt

def decimalDate(date,date_format="%Y-%m-%d"):
    """ Converts calendar dates in specified format to decimal date. """
    sampling_date = dt.strptime(date,date_format)
    year =  sampling_date.year
    beginning_of_year = datetime.datetime(year,1,1) #year, month, day, so year 1,1 == beginning of year
    end_of_year = datetime.datetime(year+1,1,1) #next year 1,1 is a full year away from beginning of sampling year
    #return fraction of the full year (in seconds) that occurs between jan 1st and your sampling date
    return year + ((sampling_date - beginning_of_year).total_seconds() / ((end_of_year - beginning_of_year).total_seconds()))

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi_primary_clade_decimalDate.fasta','w') as out_file:
    for key in output_dict.keys():
        if key.split('|')[0].replace('>','') in usvi_primary_clade:
            split_name = key.split('|')
            date = decimalDate(split_name[3])
            header = split_name[0] +'|'+ str(date) + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/beast-americasZika-outgrouped_decimalDate.fasta','w') as out_file:
    for key in output_dict.keys():
            split_name = key.split('|')
            calendarDate = split_name[3]
            if 'XX' in calendarDate.split('-'):
                date = float(calendarDate[:4])
            else:
                date = decimalDate(calendarDate)
            header = split_name[0] +'|'+ str(date) + '|'+ split_name[4] + '|'+ split_name[5]
            print header
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))

>BRA/2016/FC_DQ107D1|2016.28142077|south_america|brazil
>BRA/2016/FC_DQ62D1|2016.2431694|south_america|brazil
>Colombia/2016/ZC204Se|2016.0136612|south_america|colombia
>Z16019|2016.15300546|south_america|venezuela
>BRA/2016/FC_DQ5D1|2016.19945355|south_america|brazil
>Brazil/2016/ZBRC319|2016.52185792|south_america|brazil
>GZ02/2016|2016.15027322|south_america|venezuela
>BRA/2016/FC_DQ28D1|2016.21857923|south_america|brazil
>USVI/36/2016|2016.69945355|north_america|usvi
>USA/2016/FLUR008|2016.59016393|north_america|usa
>Nica2_16|2016.03278689|north_america|nicaragua
>USA/2016/FLUR022|2016.66393443|north_america|usa
>PAN/CDC_259359_V1_V3/2015|2015.96164384|north_america|panama
>GZ01|2016.12021858|south_america|venezuela
>USVI/38/2016|2016.81420765|north_america|usvi
>HND/2016/HU_ME167|2016.43715847|north_america|honduras
>DOM/2016/BB_0180|2016.29508197|north_america|dominican_republic
>FHT1175/HON/2016|2016.65027322|north_america|honduras
>Brazil/2016/ZBRX6|2016.29781421|south_america|

## Make fasta files that represent more equitably-subsampled demes

There is a large range in how many zika genomes have been sampled from each country. To ensure that this sampling bias isn't driving the phylogeographic dynamics we infer, I'm subsampling the alignment to make the sampling per country more even. Aiming to have 10 genomes per country.

In [14]:
from collections import defaultdict

nested_dict = defaultdict(dict) #nested dict e.g. {brazil: {seq1:ACCT, seq2:ACGGGTT}}
for key in output_dict.keys():
    split_name = key.split('|')
    nested_dict[split_name[5]].update({key:output_dict[key]})

country_seq_dict = dict(nested_dict) #avoid continually adding to default dict accidentally.

In [16]:
#print country_seq_dict['brazil'].items()
import random

samp = random.sample(country_seq_dict['brazil'].items(), 10) #produces a list of tuples of (name,sequence) that were randomly sampled
#print samp[0][1]
#print 'There are {} countries have at least one genome.'.format(len(subsampled_demes.keys()))
#print subsampled_demes['french_polynesia']
#print subsampled_demes['honduras'][0]


In [53]:

print new_glm_set_subsampled.keys()
print new_glm_set_all_samples.keys()
print country_seq_dict.keys()

['brazil', 'jamaica', 'honduras', 'dominican_republic', 'usa', 'mexico', 'venezuela', 'panama', 'puerto_rico', 'guatemala', 'colombia', 'martinique', 'haiti', 'usvi', 'nicaragua']
['brazil', 'jamaica', 'honduras', 'dominican_republic', 'usa', 'mexico', 'venezuela', 'panama', 'puerto_rico', 'guatemala', 'colombia', 'martinique', 'haiti', 'usvi', 'nicaragua']
['jamaica', 'honduras', 'dominican_republic', 'mexico', 'french_guiana', 'guatemala', 'colombia', 'martinique', 'haiti', 'nicaragua', 'brazil', 'suriname', 'usa', 'panama', 'cuba', 'el_salvador', 'puerto_rico', 'guadeloupe', 'peru', 'french_polynesia', 'usvi', 'venezuela', 'ecuador']


In [61]:
new_glm_set_all_samples = {}
new_glm_set_subsampled = {}
new_glm_set_subsampled_except_usvi= {}

for country in country_seq_dict.keys():
    if len(country_seq_dict[country]) > 2: #to be included in analysis a country needs at least 3 genomes
        new_glm_set_all_samples[country] = country_seq_dict[country]
            
        if country == 'usvi':
            new_glm_set_subsampled_except_usvi[country] = [(key,country_seq_dict[key]) for key in country_seq_dict.keys()]
        else:     
            if len(country_seq_dict[country]) > 10:
                sample = random.sample(country_seq_dict[country].items(), 10)
                new_glm_set_subsampled[country] = sample
                new_glm_set_subsampled_except_usvi[country] = sample
                
            else:
                new_glm_set_subsampled[country] = [(key,country_seq_dict[key]) for key in country_seq_dict.keys()]
                new_glm_set_subsampled_except_usvi[country] = [(key,country_seq_dict[key]) for key in country_seq_dict.keys()]


In [54]:
print len(new_glm_set_all_samples['usvi'])
print len(new_glm_set_subsampled['usvi'])
print new_glm_set_all_samples['usvi']
print new_glm_set_subsampled['usvi']

31
10
{'>USVI/43/2016|zika|VI43|2016-07-19|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black et al|https://github.com/blab/zika-usvi/tree/master/data|Genetic characterization of the Zika virus epidemic in the US Virgin Islands|?|?': Seq('NNNNNNNNNNNNGACAGTTCGAGTTTGAAGCGAAAGCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), '>USVI/7/2016|zika|VI7|2016-10-27|north_america|usvi|saint_thomas|saint_thomas|fh|genome|Black et al|https://github.com/blab/zika-usvi/tree/master/data|Genetic characterization of the Zika virus epidemic in the US Virgin Islands|?|?': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAGCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), '>USVI/28/2016|zika|VI28|2016-11-28|north_america|usvi|saint_croix|saint_croix|fh|genome|Black et al|https://github.com/blab/zika-usvi/|Genetic characterization of the Zika virus epidemic in the US Virgin Islands|?|?': Seq('NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTAGCAACAGTATCAACA...NNN', SingleLetterAlphabet()), '>USVI/12/2016|zika|VI1

0
>BRA/2016/FC_DQ62D2|zika|KY785455|2016-04-06|south_america|brazil|brazil|brazil|genbank|genome|Metsky et al|https://www.ncbi.nlm.nih.gov/nuccore/KY785455|Zika virus evolution and spread in the Americas|Nature 546 (7658), 411-415 (2017)|https://www.ncbi.nlm.nih.gov/pubmed/28538734
1
>BRA/2016/FC_DQ28D1|zika|KY014317|2016-03-21|south_america|brazil|brazil|brazil|genbank|genome|Metsky et al|https://www.ncbi.nlm.nih.gov/nuccore/KY014317|Zika virus evolution and spread in the Americas|Nature 546 (7658), 411-415 (2017)|https://www.ncbi.nlm.nih.gov/pubmed/28538734
2
>Brazil/2016/ZBRX102|zika|KY559018|2016-02-25|south_america|brazil|brazil|brazil|genbank|genome|Faria et al|https://www.ncbi.nlm.nih.gov/nuccore/KY559018|Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas|Unpublished|https://www.ncbi.nlm.nih.gov/pubmed/
3
>Brazil/2016/ZBRX127|zika|KY559021|2016-03-10|south_america|brazil|brazil|brazil|genbank|genome|Faria et al|https://www.ncbi.nlm.nih.gov/n

In [59]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/min3_max10_seqs_augurFormat.fasta','w') as file:
    for country in new_glm_set_subsampled.keys():
    #print len(subsampled_demes[country])
        for i in range(len(new_glm_set_subsampled[country])): #have a list of tuples (taxon,sequence)
            header = new_glm_set_subsampled[country][i][0]#first element in the tuple ins the taxon name
            file.write(str(header + '\n' + str(new_glm_set_subsampled[country][i][1]) + '\n'))

In [None]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/min3_max10_seqs_augurFormat.fasta','w') as file:
    for country in new_glm_set_subsampled.keys():
    #print len(subsampled_demes[country])
        for i in range(len(new_glm_set_subsampled[country])): #have a list of tuples (taxon,sequence)
            header = subsampled_demes[country][i][0]#first element in the tuple ins the taxon name
            file.write(str(header + '\n' + subsampled_demes[country][i][1] + '\n'))

# with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/subsampled_seqs_max10.fasta','w') as file:
#     for country in subsampled_demes.keys():
#     #print len(subsampled_demes[country])
#         for i in range(len(subsampled_demes[country])): #have a list of tuples (taxon,sequence)
#             #print subsampled_demes[country][i][0]
#             split_name = subsampled_demes[country][i][0].split('|') #first element in the tuple ins the taxon name
#             #print split_name
#             header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
#             file.write(str(header + '\n' + subsampled_demes[country][i][1] + '\n'))

In [118]:
subsampled_demes = {}
for country in country_seq_dict.keys():
    if len(country_seq_dict[country]) > 10:
        subsampled_demes[country] = random.sample(country_seq_dict[country].items(), 10)  
    else:
        subsampled_demes[country] = [(key,country_seq_dict[country][key]) for key in country_seq_dict[country].keys()]
        
print subsampled_demes['usvi'][0][0]
print subsampled_demes.keys()
print len(subsampled_demes.keys())

#for key in subsampled_demes.keys():
    #print len(subsampled_demes[key])

>USVI/30/2016|zika|VI30|2016-08-07|north_america|usvi|saint_croix|saint_croix|fh|genome|Black et al|https://github.com/blab/zika-usvi/|Genetic characterization of the Zika virus epidemic in the US Virgin Islands|?|?
['jamaica', 'honduras', 'mexico', 'french_guiana', 'colombia', 'nicaragua', 'brazil', 'peru', 'usa', 'panama', 'suriname', 'french_polynesia', 'venezuela', 'ecuador', 'dominican_republic', 'cuba', 'guatemala', 'martinique', 'haiti', 'guadeloupe', 'el_salvador', 'puerto_rico', 'usvi']
23


In [119]:
#with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/subsampled_seqs_augurFormat.fasta','w') as file:
for country in subsampled_demes.keys():
    print country
    for i in range(len(subsampled_demes[country])):
        print i
        taxon_name = subsampled_demes[country][i][0]
        print taxon_name
        sequence = subsampled_demes[country][i][1]
        #print sequence
            #file.write(str(taxon_name) + '\n' + str(sequence) + '\n')

jamaica
0
>JAM/2016/MA_WGS16_039|zika|KY785430|2016-09-28|north_america|jamaica|jamaica|jamaica|genbank|genome|Metsky et al|https://www.ncbi.nlm.nih.gov/nuccore/KY785430|Zika virus evolution and spread in the Americas|Nature 546 (7658), 411-415 (2017)|https://www.ncbi.nlm.nih.gov/pubmed/28538734
1
>JAM/2016/MA_WGS16_025|zika|KY785424|2016-07-10|north_america|jamaica|jamaica|jamaica|genbank|genome|Metsky et al|https://www.ncbi.nlm.nih.gov/nuccore/KY785424|Zika virus evolution and spread in the Americas|Nature 546 (7658), 411-415 (2017)|https://www.ncbi.nlm.nih.gov/pubmed/28538734
2
>JAM/2016/MA_WGS16_041|zika|KY785432|2016-10-12|north_america|jamaica|jamaica|jamaica|genbank|genome|Metsky et al|https://www.ncbi.nlm.nih.gov/nuccore/KY785432|Zika virus evolution and spread in the Americas|Nature 546 (7658), 411-415 (2017)|https://www.ncbi.nlm.nih.gov/pubmed/28538734
honduras
0
>HND/2016/HU_ME136|zika|KY785461|2016-05-31|north_america|honduras|honduras|honduras|genbank|genome|Metsky et al|h

In [120]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/subsampled_seqs_augurFormat.fasta','w') as file:
    for country in subsampled_demes.keys():
    #print len(subsampled_demes[country])
        for i in range(len(subsampled_demes[country])): #have a list of tuples (taxon,sequence)
            header = subsampled_demes[country][i][0]#first element in the tuple ins the taxon name
            file.write(str(header + '\n' + subsampled_demes[country][i][1] + '\n'))

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/subsampled_seqs_max10.fasta','w') as file:
    for country in subsampled_demes.keys():
    #print len(subsampled_demes[country])
        for i in range(len(subsampled_demes[country])): #have a list of tuples (taxon,sequence)
            #print subsampled_demes[country][i][0]
            split_name = subsampled_demes[country][i][0].split('|') #first element in the tuple ins the taxon name
            #print split_name
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            file.write(str(header + '\n' + subsampled_demes[country][i][1] + '\n'))

----------

Aside:
Below is the code I used to initally start looking at accessions and publications to determine which genomes were freely available to include in the analysis.

In [None]:
# want dict where {publication: [accession, accession]}
publication_accessions_dict = {}
for pub in list_of_pubs:
    accessions = []
    for key in fauna_dict.keys():
        taxa = fauna_dict[key].description
        if taxa.split('|')[12] == pub:
            accessions.append(taxa.split('|')[2])
    publication_accessions_dict[pub] = accessions

In [None]:
pub_seq_count = 0
for pub in published_submissions:
    print len(publication_accessions_dict[pub])

In [None]:
from collections import defaultdict
published_seqs = defaultdict(int)

list_of_pubs = [strain_header_dict[key].split('|')[12] for key in zika_msa_dict.keys()]

for pub in list_of_pubs:
    published_seqs[pub] += 1


In [None]:
published_submissions= ['First Complete Genome Sequence of Zika Virus (Flaviviridae, Flavivirus) from an Autochthonous Transmission in Brazil',
'Genetic characterization of the Zika virus epidemic in the US Virgin Islands',
'Isolation of Infective Zika Virus from Urine and Saliva of Patients in Brazil',
'Full-length infectious cDNA clone of Zika virus from 2015 epidemic in Brazil: development and characterization of recombinant viruses in cell lines from human placenta, testis, and brain',
'Zika virus complete genome from Salvador, Bahia, Brazil',
'Complete Genome Sequences of Three Historically Important, Spatiotemporally Distinct, and Genetically Divergent Strains of Zika Virus: MR-766, P6-740, and PRVABC-59',
'Isolation of infectious Zika virus from saliva and prolonged viral RNA shedding in a traveller returning from the Dominican Republic to Italy, January 2016',
'Fatal Zika Virus Infection with Secondary Nonsexual Transmission',
'Quasispecies composition and evolution of a typical Zika virus clinical isolate from Suriname',
'First Complete Genome Sequences of Zika Virus Isolated from Febrile Patient Sera in Ecuador',
'Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Genomic epidemiology reveals multiple introductions of Zika virus into the United States',
'Coinfection With Zika and Dengue-2 Viruses in a Traveler Returning From Haiti, 2016: Clinical Presentation and Genetic Analysis',
'Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'Zika Virus Targets Different Primary Human Placental Cells, Suggesting Two Routes for Vertical Transmission',
'Establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Zika Virus Outbreak in Haiti in 2014: Molecular and Clinical Data',
'Prolonged Detection of Zika Virus in Vaginal Secretions and Whole Blood',
'Complete Genome Sequences of Identical Zika virus Isolates in a Nursing Mother and Her Infant',
'?',
'Zika virus in the Americas: early epidemiological and genetic findings',
'Full Genome Sequence and sfRNA Interferon Antagonist Activity of Zika Virus from Recife, Brazil',
'Distinct Zika Virus Lineage in Salvador, Bahia, Brazil',
'Genome sequence of a candidate World Health Organization reference strain for Zika virus for nucleic acid testing',
'Zika virus evolution and spread in the Americas',
'Zika virus genome from the Americas',
'Complete coding sequence of Zika virus from Martinique outbreak in 2015',
'Zika Virus Associated with Microcephaly',
'Characterization of a Zika Virus Isolate from Colombia',
'Detection and sequencing of Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case study',
'Phylogeny of Zika Virus in Western Hemisphere, 2015',
'Complete genomic sequence of Zika virus isolated from a clinical semen sample',
'Infection dynamics in a traveller with persistent shedding of Zika virus RNA in semen for six months after returning from Haiti to Italy, January 2016',                        
                  ]

published_submissions_with_incorrect_publication_name = [ 'Complete Zika Virus Genome Sequences from a Serum Sample and after Isolation in Vero Cells',
'Full coding sequence of the Zika Virus isolate HS-2015-BA-01',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'The isolation and characterization of Zika virus imported into Guangdong, China',
'Full genomic characterization of a Zika virus isolate from Colombia',
'Molecular characterization of Zika virus in Mexico',
'Repurposed drug candidates to treat ZIKV infection in pregnancy',
'Full-genome amplification and sequencing of Zika viruses using a targeted amplification approach',
'Zika virus infection with prolonged maternal viremia and fetal brain abnormalities',
'Complete genome of Zika virus GZ01 isolate from China',
'Complete genomic sequence of Zika virus isolated from a traveler to Colombia in 2016',
                                                       
                                                         
]

unpublished_submissions = [
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Complete Genome of a Zika Virus Isolate, Amazonas, Brazil, 2016',
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Zika viral sequence identified in Fortaleza, Brazil',
'Comparative analysis between the genomes of intra-host and cell cultured Zika virus obtained from a Mexican symptomatic patient'
'Zika virus from breast milk',
'Genome sequences of Zika viruses isolated in Nicaragua',              
'Emergence of Zika virus in Brazil: Detection of Asian genotype',
'Brazilian Zika virus genome sequence',
'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014', #white et al sequences
'Zika virus sequence from Cuba traveler diagnosed in the United States (Miami, FL)', #nate
'Zika virus genomes from human cases in Florida, USA',
'Isolation of infectious Zika virus from sera in a traveller returning from Brazil to Italy',
'Zika virus infection in pregnant mice',
'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)',
'Genome sequences of four Zika virus isolates from Brazil',
'Direct Submission'
                          ]

unpublished_seqs_to_pursue = [
                              'Genome sequences of Zika viruses isolated in Nicaragua',
                              'Imported cases of Zika virus disease in Russia in 2016-2017', #dominican republic and mexico
                              'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014',
                              'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)', #cuba
                              'Genome sequences of four Zika virus isolates from Brazil' #brazil early...published on, but not by the submitting authors... 
]

In [None]:
published_fauna_seqs = {key:value for key,value in fauna_dict.items() if fauna_dict[key].description.split('|')[12] in published_submissions}

In [None]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/published-sequences-in-analysis.tsv','w') as file:
    file.write('{}\t{}\t{}\n'.format('strain_name', 'accession_number', 'lead_author'))
    for key in published_fauna_seqs.keys():
        file.write('{}\t{}\t{}\n'.format(key.split('|')[0], key.split('|')[2], key.split('|')[-1]))

In [1]:
print len(published_fauna_seqs)

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-published-seqs-only.fasta','w') as file:
    for key in published_fauna_seqs.keys():
        if published_fauna_seqs[key].description.split('|')[4] in regions_to_exclude1: #don't write out anything except American seqs:
            continue
        else: 
            file.write(str('>' + published_fauna_seqs[key].description + '\n' + published_fauna_seqs[key].seq + '\n'))
    file.write(str('>' + published_fauna_seqs['PF13/251013_18|zika|KX369547|2013-10-25|oceania|french_polynesia|french_polynesia|french_polynesia|genbank|genome|Troesemeier'].description+ '\n' + published_fauna_seqs[key].seq + '\n'))

NameError: name 'published_fauna_seqs' is not defined

NameError: name 'output_dict' is not defined