## README: Making input fasta files

This script wrangles multiple sequence alignments and fasta files from `Nextstrain/fauna` to make fasta files of sequences that we want to include specifically in our analysis of Zika in the Americas. A variety of fasta files are outputted, including ones that can be read in to BEAST, and ones that can be read in to `Nextstrain/augur`.

In [1]:
#### import libraries ####
from Bio import SeqIO
from Bio import AlignIO
import datetime
import pandas as pd
date = datetime.datetime.now().strftime ("%Y-%m-%d")


In [55]:
#### infile paths #### 
zika_msa_stripped = "/Users/alliblk/Desktop/gitrepos/augur-alliblk/zika/processed/zika_aligned_stripped.mfa"
fauna_file = "/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta"

#### outfile paths #### 
americas_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-zika-{}.fasta'.format(date)
americas_frenchpol_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/american-frenchPolyn-zika-{}.fasta'.format(date)

usvi_file = "/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-{}.fasta".format(date)
usvi_primary_clade_file = '/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/usvi-primary-clade-{}.fasta'.format(date)


Step 1: Load in all available sequences in from Fauna. This should be all Zika genomes that have been made publicly available on GenBank, ViPR, and GitHub. Note that this does NOT mean that all of the genomes have been published on.

Note that you will need to clone the `nextstrain/fauna` git repo and run fauna to have this file. 

In [16]:
fauna_file = "/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta"
fauna_dict = SeqIO.to_dict(SeqIO.parse(fauna_file, 'fasta'))
print '{} sequences available from Fauna'.format(len(fauna_dict))

542 sequences available from Fauna


Step 2: Make a subset of the fauna sequences that only includes the geographic regions of interest. Here, I want only Zika sequences from the Americas. I need an outgroup sequence to properly root the tree however, so I'm also going to include a single genome from French Polynesia. 

In [47]:
regions_to_exclude = ['southeast_asia', 'oceania', 'japan_korea', 'china','europe']
reference_strain_name = 'PF13/251013_18' # outgroup. This is the WHO reference strain for Zika, msa will also be stripped to this seq.


geoPruned_fauna_dict = {fauna_dict[key].description:fauna_dict[key].seq for key in fauna_dict.keys() if key.split('|')[4] not in regions_to_exclude or key.split('|')[0] == reference_strain_name}
print '{} sequences meet the geographic and outgroup criteria'.format(len(geoPruned_fauna_dict))

374 sequences meet the geographic and outgroup criteria


Step 3: Filter sequences so that we only include sequences that we are allowed to publish on. These include any sequences that have been previously released during publication, or any sequences for which the authors have given us permission to use them in our analysis.

This information has been collated in a table that I'm going to import in here. Note that the table is NOT exhaustive for all genomes available from Fauna. Rather, it only includes sequences from the Americas. Sequence accessions have been manually checked and matched with publications if they are published on (with pubmed ID information). If not published on, sequences have a flag that denotes whether author permission was received or not.

I have an indicator in this table as well that is labeled `premliminarily_include`. If a record says `no` in this column, it is because it needed permissions to be requested, but would likely not add more to the analysis, and therefore was deemed not necessary to pursue permissions on.

Below I have lists of all the publications as they're listed in the Fauna headers, which represent the title given in GenBank but _not necessarily the title of the actual published manuscript_. They are divided into publications that are published, ones that are not published, and ones that are in fact published but the actual manuscript name is different than the GenBank publication name, and therefore they appear to be unpublished in GenBank.

---------------

`published_submissions` = ['First Complete Genome Sequence of Zika Virus (Flaviviridae, Flavivirus) from an Autochthonous Transmission in Brazil',
'Genetic characterization of the Zika virus epidemic in the US Virgin Islands',
'Isolation of Infective Zika Virus from Urine and Saliva of Patients in Brazil',
'Full-length infectious cDNA clone of Zika virus from 2015 epidemic in Brazil: development and characterization of recombinant viruses in cell lines from human placenta, testis, and brain',
'Zika virus complete genome from Salvador, Bahia, Brazil',
'Complete Genome Sequences of Three Historically Important, Spatiotemporally Distinct, and Genetically Divergent Strains of Zika Virus: MR-766, P6-740, and PRVABC-59',
'Isolation of infectious Zika virus from saliva and prolonged viral RNA shedding in a traveller returning from the Dominican Republic to Italy, January 2016',
'Fatal Zika Virus Infection with Secondary Nonsexual Transmission',
'Quasispecies composition and evolution of a typical Zika virus clinical isolate from Suriname',
'First Complete Genome Sequences of Zika Virus Isolated from Febrile Patient Sera in Ecuador',
'Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Genomic epidemiology reveals multiple introductions of Zika virus into the United States',
'Coinfection With Zika and Dengue-2 Viruses in a Traveler Returning From Haiti, 2016: Clinical Presentation and Genetic Analysis',
'Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'Zika Virus Targets Different Primary Human Placental Cells, Suggesting Two Routes for Vertical Transmission',
'Establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Zika Virus Outbreak in Haiti in 2014: Molecular and Clinical Data',
'Prolonged Detection of Zika Virus in Vaginal Secretions and Whole Blood',
'Complete Genome Sequences of Identical Zika virus Isolates in a Nursing Mother and Her Infant',
'?',
'Zika virus in the Americas: early epidemiological and genetic findings',
'Full Genome Sequence and sfRNA Interferon Antagonist Activity of Zika Virus from Recife, Brazil',
'Distinct Zika Virus Lineage in Salvador, Bahia, Brazil',
'Genome sequence of a candidate World Health Organization reference strain for Zika virus for nucleic acid testing',
'Zika virus evolution and spread in the Americas',
'Zika virus genome from the Americas',
'Complete coding sequence of Zika virus from Martinique outbreak in 2015',
'Zika Virus Associated with Microcephaly',
'Characterization of a Zika Virus Isolate from Colombia',
'Detection and sequencing of Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case study',
'Phylogeny of Zika Virus in Western Hemisphere, 2015',
'Complete genomic sequence of Zika virus isolated from a clinical semen sample',
'Infection dynamics in a traveller with persistent shedding of Zika virus RNA in semen for six months after returning from Haiti to Italy, January 2016',                        
                  ]

`published_submissions_with_incorrect_publication_name` = [ 'Complete Zika Virus Genome Sequences from a Serum Sample and after Isolation in Vero Cells',
'Full coding sequence of the Zika Virus isolate HS-2015-BA-01',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'The isolation and characterization of Zika virus imported into Guangdong, China',
'Full genomic characterization of a Zika virus isolate from Colombia',
'Molecular characterization of Zika virus in Mexico',
'Repurposed drug candidates to treat ZIKV infection in pregnancy',
'Full-genome amplification and sequencing of Zika viruses using a targeted amplification approach',
'Zika virus infection with prolonged maternal viremia and fetal brain abnormalities',
'Complete genome of Zika virus GZ01 isolate from China',
'Complete genomic sequence of Zika virus isolated from a traveler to Colombia in 2016',
                                                       
                                                         
]

`unpublished_submissions` = [
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Complete Genome of a Zika Virus Isolate, Amazonas, Brazil, 2016',
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Zika viral sequence identified in Fortaleza, Brazil',
'Comparative analysis between the genomes of intra-host and cell cultured Zika virus obtained from a Mexican symptomatic patient'
'Zika virus from breast milk',
'Genome sequences of Zika viruses isolated in Nicaragua',              
'Emergence of Zika virus in Brazil: Detection of Asian genotype',
'Brazilian Zika virus genome sequence',
'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014', #white et al sequences
'Zika virus sequence from Cuba traveler diagnosed in the United States (Miami, FL)', #nate
'Zika virus genomes from human cases in Florida, USA',
'Isolation of infectious Zika virus from sera in a traveller returning from Brazil to Italy',
'Zika virus infection in pregnant mice',
'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)',
'Genome sequences of four Zika virus isolates from Brazil',
'Direct Submission'
                          ]

In [25]:
permission_info = pd.read_excel('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/sequences_and_permissions.xlsx')
permission_info

Unnamed: 0,strain_name,accession_number,lead_author,permission_type,publication_pubmed_id,permission_to_use,preliminarily_include
0,PHE_Guadeloupe,KX673530,Atkinson,previously_published,27738033.0,not_necessary,yes
1,MEX_I_7,KX247632,Barrows,previously_published,27476412.0,not_necessary,yes
2,Dominican_Republic/2016/PD1,KU853012,Barzon,previously_published,26987769.0,not_necessary,yes
3,Haiti/2016/PD,KX269878,Barzon,previously_published,27542178.0,not_necessary,yes
4,USVI/1/2016,VI1,Black,published_here,,not_necessary,yes
5,USVI/11/2016,VI11,Black,published_here,,not_necessary,yes
6,USVI/12/2016,VI12,Black,published_here,,not_necessary,yes
7,USVI/13/2016,VI13,Black,published_here,,not_necessary,yes
8,USVI/19/2016,VI19,Black,published_here,,not_necessary,yes
9,USVI/2/2016,VI2,Black,published_here,,not_necessary,yes


In [26]:
strains_with_permission = []

for i in range(len(permission_info)):
    record = permission_info.iloc[i]
    if record['permission_to_use'] != 'permission_not_received' and record['preliminarily_include'] == 'yes':
        strains_with_permission.append(record['strain_name'])

print len(strains_with_permission)

262


In [48]:
useable_genomes_dict_faunaHeader = {}
useable_genomes_dict_augurHeader = {}
#making a dict for each kind of header here because it's really nice to have both when troubleshooting.
for strain in strains_with_permission:
    for key in geoPruned_fauna_dict.keys():
        if key.startswith(strain):
            useable_genomes_dict_faunaHeader[key] = geoPruned_fauna_dict[key]
            useable_genomes_dict_augurHeader[key.split('|')[0]] = geoPruned_fauna_dict[key]

print len(useable_genomes_dict_faunaHeader)

262


Step 4: Now that we have our set of genomes that we can use, it's time to do some preliminary alignment quality control. I outline genome exclusion criteria in the README of the `data/fastas` directory. Genomes that should be excluded from my analysis are hardcoded in to Augur as `dropped_strains` in `zika.prepare.py`. Most of the process of determining outlier clades is done iteratively based on Augur builds of the tree, however filtering out strains with less that 5000 informative sites is something that I do in this notebook (below).

In [51]:
#write out files that Augur will take in
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-publishedAmericanSeqs-and-ref.fasta','w') as file:
    for key in useable_genomes_dict_faunaHeader.keys():
        file.write(str('>' + key + '\n' + useable_genomes_dict_faunaHeader[key] + '\n'))

In [34]:
informative_bases = ['A','G','C','T']
num_informative_bases = {}

for key in fauna_dict.keys():
    inform_base_count = 0
    for base in fauna_dict[key].upper():
        if base in informative_bases:
            inform_base_count +=1
        else:
            continue
    num_informative_bases[key.split('|')[0]] = inform_base_count

low_information_genomes = {key:value for key,value in num_informative_bases.items() if value < 5000}
print len(low_information_genomes)
print low_information_genomes.keys() #these genomes will get added to the dropped_strains section of augur zika.prepare.py

15
['USA/2016/FL019', 'DOM/2016/MA_WGS16_009', 'DOM/2016/MA_WGS16_020', 'BRA/2016/FC_DQ12D1', 'USA/2016/FL035', 'HTI/2016/MA_WGS16_022', 'DOM/2016/BB_0428', 'HND/2016/HU_ME137', 'Bahia04', 'Bahia05', 'Bahia15', 'DOM/2016/MA_WGS16_031', 'JAM/2016/MA_WGS16_038', 'BRA/2016/FC_DQ68D1', 'Brazil/2016/ZBRY12']


... Interlude here to do all the exploration in Augur ...

Done :)

Step 5: Now that I have the alignment that I want to use, including a good sequence set that is aligned and stripped to reference, I want to make fasta files that I can read in to BEAST that make use of sequences from the multiple sequence aligment, but have fully informative headers (which I'll need to grab from Fauna). 

Basically, I'll need to use these three dictionaries to do this:

* `strain_header_dict` maps strain name in augur form (key) to full fauna fauna header (value).
* `fauna_dict` (which I made earlier in the script) maps fauna taxon name (key) to the unaligned, unstripped sequence as it's stored in fauna (value).
* `msa_dict` maps strain name in augur form (key) to the aligned and stripped-to-reference sequence (value). All sequences in this dictionary should therefore be 10769 nucleotides long, and are in frame.

In [57]:
#make strain_header_dict
with open(fauna_file,'rU') as file:
    strain_header_dict={line.split('|')[0].replace('>',''):line.strip() for line in file if line.startswith('>')}

#make msa dict
zika_msa = AlignIO.read(open(zika_msa_stripped),'fasta')
zika_msa_dict = {record.id:record.seq for record in zika_msa}

#make sure the alignment is the same length as the reference sequence that it should be mapped to.
for key in zika_msa_dict.keys():
    assert len(zika_msa_dict[key]) == 10769

In [58]:
output_dict = {}
for key in zika_msa_dict.keys():
    header = strain_header_dict[key] #grab full fauna header from strain_header_dict
    seq = zika_msa_dict[key] #get sequence from msa
    output_dict[header] = seq

print len(output_dict.keys())


240


In [59]:
#write output dict to fasta
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/beast-americasZika-outgrouped.fasta','w') as out_file:
    for key in output_dict.keys():
            split_name = key.split('|')
            header = split_name[0] +'|'+ split_name[3] + '|'+ split_name[4] + '|'+ split_name[5]
            out_file.write(str(header + '\n' + output_dict[key] + '\n'))
        

----------

Aside:
Below is the code I used to initally start looking at accessions and publications to determine which genomes were freely available to include in the analysis.

In [None]:
# want dict where {publication: [accession, accession]}
publication_accessions_dict = {}
for pub in list_of_pubs:
    accessions = []
    for key in fauna_dict.keys():
        taxa = fauna_dict[key].description
        if taxa.split('|')[12] == pub:
            accessions.append(taxa.split('|')[2])
    publication_accessions_dict[pub] = accessions

In [None]:
pub_seq_count = 0
for pub in published_submissions:
    print len(publication_accessions_dict[pub])

In [None]:
from collections import defaultdict
published_seqs = defaultdict(int)

list_of_pubs = [strain_header_dict[key].split('|')[12] for key in zika_msa_dict.keys()]

for pub in list_of_pubs:
    published_seqs[pub] += 1


In [None]:
published_submissions= ['First Complete Genome Sequence of Zika Virus (Flaviviridae, Flavivirus) from an Autochthonous Transmission in Brazil',
'Genetic characterization of the Zika virus epidemic in the US Virgin Islands',
'Isolation of Infective Zika Virus from Urine and Saliva of Patients in Brazil',
'Full-length infectious cDNA clone of Zika virus from 2015 epidemic in Brazil: development and characterization of recombinant viruses in cell lines from human placenta, testis, and brain',
'Zika virus complete genome from Salvador, Bahia, Brazil',
'Complete Genome Sequences of Three Historically Important, Spatiotemporally Distinct, and Genetically Divergent Strains of Zika Virus: MR-766, P6-740, and PRVABC-59',
'Isolation of infectious Zika virus from saliva and prolonged viral RNA shedding in a traveller returning from the Dominican Republic to Italy, January 2016',
'Fatal Zika Virus Infection with Secondary Nonsexual Transmission',
'Quasispecies composition and evolution of a typical Zika virus clinical isolate from Suriname',
'First Complete Genome Sequences of Zika Virus Isolated from Febrile Patient Sera in Ecuador',
'Epidemic establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Genomic epidemiology reveals multiple introductions of Zika virus into the United States',
'Coinfection With Zika and Dengue-2 Viruses in a Traveler Returning From Haiti, 2016: Clinical Presentation and Genetic Analysis',
'Multiplex PCR method for MinION and Illumina sequencing of Zika and other virus genomes directly from clinical samples',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'Zika Virus Targets Different Primary Human Placental Cells, Suggesting Two Routes for Vertical Transmission',
'Establishment and cryptic transmission of Zika virus in Brazil and the Americas',
'Zika Virus Outbreak in Haiti in 2014: Molecular and Clinical Data',
'Prolonged Detection of Zika Virus in Vaginal Secretions and Whole Blood',
'Complete Genome Sequences of Identical Zika virus Isolates in a Nursing Mother and Her Infant',
'?',
'Zika virus in the Americas: early epidemiological and genetic findings',
'Full Genome Sequence and sfRNA Interferon Antagonist Activity of Zika Virus from Recife, Brazil',
'Distinct Zika Virus Lineage in Salvador, Bahia, Brazil',
'Genome sequence of a candidate World Health Organization reference strain for Zika virus for nucleic acid testing',
'Zika virus evolution and spread in the Americas',
'Zika virus genome from the Americas',
'Complete coding sequence of Zika virus from Martinique outbreak in 2015',
'Zika Virus Associated with Microcephaly',
'Characterization of a Zika Virus Isolate from Colombia',
'Detection and sequencing of Zika virus from amniotic fluid of fetuses with microcephaly in Brazil: a case study',
'Phylogeny of Zika Virus in Western Hemisphere, 2015',
'Complete genomic sequence of Zika virus isolated from a clinical semen sample',
'Infection dynamics in a traveller with persistent shedding of Zika virus RNA in semen for six months after returning from Haiti to Italy, January 2016',                        
                  ]

published_submissions_with_incorrect_publication_name = [ 'Complete Zika Virus Genome Sequences from a Serum Sample and after Isolation in Vero Cells',
'Full coding sequence of the Zika Virus isolate HS-2015-BA-01',
'Large returning population of oversea Chinese of Guangdong province combined with tourists increase the risks of ZIKV transmission in Guangdong, China',
'The isolation and characterization of Zika virus imported into Guangdong, China',
'Full genomic characterization of a Zika virus isolate from Colombia',
'Molecular characterization of Zika virus in Mexico',
'Repurposed drug candidates to treat ZIKV infection in pregnancy',
'Full-genome amplification and sequencing of Zika viruses using a targeted amplification approach',
'Zika virus infection with prolonged maternal viremia and fetal brain abnormalities',
'Complete genome of Zika virus GZ01 isolate from China',
'Complete genomic sequence of Zika virus isolated from a traveler to Colombia in 2016',
                                                       
                                                         
]

unpublished_submissions = [
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Complete Genome of a Zika Virus Isolate, Amazonas, Brazil, 2016',
'Imported cases of Zika virus disease in Russia in 2016-2017',
'Zika viral sequence identified in Fortaleza, Brazil',
'Comparative analysis between the genomes of intra-host and cell cultured Zika virus obtained from a Mexican symptomatic patient'
'Zika virus from breast milk',
'Genome sequences of Zika viruses isolated in Nicaragua',              
'Emergence of Zika virus in Brazil: Detection of Asian genotype',
'Brazilian Zika virus genome sequence',
'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014', #white et al sequences
'Zika virus sequence from Cuba traveler diagnosed in the United States (Miami, FL)', #nate
'Zika virus genomes from human cases in Florida, USA',
'Isolation of infectious Zika virus from sera in a traveller returning from Brazil to Italy',
'Zika virus infection in pregnant mice',
'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)',
'Genome sequences of four Zika virus isolates from Brazil',
'Direct Submission'
                          ]

unpublished_seqs_to_pursue = [
                              'Genome sequences of Zika viruses isolated in Nicaragua',
                              'Imported cases of Zika virus disease in Russia in 2016-2017', #dominican republic and mexico
                              'Zika viruses in Haiti during an outbreak of Chikungunya Fever in mid 2014',
                              'Zika virus polyprotein sequence isolated from a fetus-brain of Cuba pregnant traveler diagnosed in Spain (Madrid)', #cuba
                              'Genome sequences of four Zika virus isolates from Brazil' #brazil early...published on, but not by the submitting authors... 
]

In [None]:
published_fauna_seqs = {key:value for key,value in fauna_dict.items() if fauna_dict[key].description.split('|')[12] in published_submissions}

In [None]:
with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/published-sequences-in-analysis.tsv','w') as file:
    file.write('{}\t{}\t{}\n'.format('strain_name', 'accession_number', 'lead_author'))
    for key in published_fauna_seqs.keys():
        file.write('{}\t{}\t{}\n'.format(key.split('|')[0], key.split('|')[2], key.split('|')[-1]))

In [None]:
print len(published_fauna_seqs)

with open('/Users/alliblk/Desktop/gitrepos/zika-usvi/data/fastas/augur-published-seqs-only.fasta','w') as file:
    for key in published_fauna_seqs.keys():
        if published_fauna_seqs[key].description.split('|')[4] in regions_to_exclude1: #don't write out anything except American seqs:
            continue
        else: 
            file.write(str('>' + published_fauna_seqs[key].description + '\n' + published_fauna_seqs[key].seq + '\n'))
    file.write(str('>' + published_fauna_seqs['PF13/251013_18|zika|KX369547|2013-10-25|oceania|french_polynesia|french_polynesia|french_polynesia|genbank|genome|Troesemeier'].description+ '\n' + published_fauna_seqs[key].seq + '\n'))