  ### Dataset curation: making the zika-colombia fasta files for main build and supplemental analyses
  
This notebook contains code needed to go from a raw download of all Zika genomes in `nextstrain/fauna` to the input fasta file for the zika-colombia specific analyses (which are custom `nextstrain/augur` builds). 

In [1]:
#import libraries
from Bio import SeqIO
import pandas as pd
import numpy as np
import random
import os

First thing that I am going to do is curate the dataset for the primary analysis. Curations steps that I would like to perform include:

1. Removing sequences from any geographic areas that I do not want included in the analysis (e.g. Singapore)

2. Removing any sequences that were up on GenBank, but were not actually published on, and for which we didn't receive the author's permission to include in our analysis.

In [2]:
#write any functions that I want to have for this work.

## replace n's with gaps, and count n's in sequences (alignment checks)
def count_n(sequence):
    "counts numbers of N's in a sequence to perform QC and see how many non-informative sites exist"
    counter = 0
    for base in sequence:
        if base == 'n':
            counter +=1
    return counter

def sample_fasta_dict_without_replacement(dictionary,n_samples_to_draw):
    """randomly samples a Seq IO dictionary without replacement."""
    samples = random.sample(dictionary.items(), k=n_samples_to_draw)
    return samples

In [5]:
# Paths to files, keeping relational so that paths should work if someone downloads the repo as is.
fauna_seqs_dict = SeqIO.to_dict(SeqIO.parse('../data/zika-fauna-2018-09-06.fasta', 'fasta'))
print('There are {} sequences downloaded from Fauna.'.format(len(fauna_seqs_dict)))

#Geographic pruning
regions_to_exclude = ['southeast_asia', 'japan_korea', 'china', 'europe', 'africa']
print('Genomes from the following regions will be excluded: {0}, {1}, {2}, and {3}.'.format(regions_to_exclude[0],regions_to_exclude[1],regions_to_exclude[2],regions_to_exclude[3]))

geoPruned_seqs_dict = {fauna_seqs_dict[key].description:fauna_seqs_dict[key].seq for key in fauna_seqs_dict.keys() if key.split('|')[4] not in regions_to_exclude}
print('There are {} sequences meet the geographic criteria.'.format(len(geoPruned_seqs_dict)))



There are 694 sequences downloaded from Fauna.
Genomes from the following regions will be excluded: southeast_asia, japan_korea, china, and europe.
There are 504 sequences meet the geographic criteria.


In [6]:
#at this point, print out all the genomes are that are left, and ensure that permissions are available.
accessions_to_check = [key for key in geoPruned_seqs_dict.keys()]

strain_names = [accession.split('|')[0] for accession in accessions_to_check]
ncbi_id = [accession.split('|')[2] for accession in accessions_to_check]
lead_author =  [accession.split('|')[10].replace('et al','') for accession in accessions_to_check]


accessions_df = pd.DataFrame(np.column_stack([strain_names, ncbi_id, lead_author]), columns= ['strain_name', 'accession_number','lead_author'])
accessions_df.to_csv('../data/all_included_accessions.csv', index=False)

In [8]:
# read in the dataframe that has the permissions information,
# then parse that to select out all strains that can be included in a publishable analysis
# these are the strains that should be used, and form the fauna subset we want.

genome_permissions = pd.read_csv('../data/genome-permissions-2018-09-06.txt', delimiter ='\t')

publishable_strains = []
for i in range(len(genome_permissions)):
    record = genome_permissions.iloc[i]
    if record['permission_to_use'] != 'permission_not_received' and record['preliminarily_include'] == 'yes':
        publishable_strains.append(record['strain_name'])

print("There are {} genomes that we can include in published analyses.".format(len(publishable_strains)))

There are 431 genomes that we can include in published analyses.


In [9]:
# using the strains in the publishable_strains list, pull out the full fauna headers (and sequences)
# for each strain that can be published on.
# then make a new fauna-formatted fasta file that can be read in to Augur for analysis.

publishable_seqs_dict = {}
for strain in publishable_strains:
    for key in geoPruned_seqs_dict.keys():
        if key.startswith(strain):
            publishable_seqs_dict[key] = geoPruned_seqs_dict[key]

In [11]:
#write sequences to file
with open('../data/publishable-zika-fauna-2018-09-06.fasta','w') as file:
    for key in publishable_seqs_dict.keys():
        file.write(str('>' + key + '\n' + publishable_seqs_dict[key] + '\n'))

Next I want to check the quality of the sequences that I will use. Specifically, I want to know what proportion of the genome we have informative sequence calls for each sample. The following code blocks do things like count numbers of N's in a sequence, and separate out high quality sequences from low/medium quality sequences given different thresholds for the numbers of N's that are acceptable.

In [12]:
sequences = SeqIO.to_dict(SeqIO.parse('../data/publishable-zika-fauna-2018-09-06.fasta', 'fasta'))

In [13]:
n_counts_dict = {}
for key in sequences.keys():
    n_count = count_n(sequences[key].seq)
    n_counts_dict[key] = n_count

gaps_not_n_seqs = {}
for key in sequences.keys():
    n_seq = str(sequences[key].seq)
    gap_seq = n_seq.replace('n','-')
    gaps_not_n_seqs[key] = gap_seq
    
high_qual_seqs = {}
for key in n_counts_dict.keys():
    if key.split('|')[5] == 'colombia':
        high_qual_seqs[key] = sequences[key].seq
    else:
        if float(n_counts_dict[key])/10769 < 0.2:
            high_qual_seqs[key] = sequences[key].seq
            
medium_qual_seqs = {}
for key in n_counts_dict.keys():
    if key.split('|')[5] == 'colombia':
        medium_qual_seqs[key] = sequences[key].seq
    else:
        if float(n_counts_dict[key])/10769 < 0.5:
            medium_qual_seqs[key] = sequences[key].seq

In [15]:
with open('../data/publishable-zika-fauna-2018-09-06-high-quality.fasta','w') as file:
    for key in high_qual_seqs.keys():
        file.write(str('>' + key + '\n' + high_qual_seqs[key] + '\n'))
        
with open('../data/publishable-zika-fauna-2018-09-06-medium-quality.fasta','w') as file:
    for key in medium_qual_seqs.keys():
        file.write(str('>' + key + '\n' + medium_qual_seqs[key] + '\n'))

In [16]:
with open('../data/publishable-zika-fauna-2018-09-06-gapped.fasta','w') as file:
    for key in gaps_not_n_seqs.keys():
        file.write(str('>' + key + '\n' + gaps_not_n_seqs[key] + '\n'))

# Make input data files for rarefaction curve supplemental analysis

Next up, I want to make a little additional dataset for a supplemental analysis I'm doing, making rarefaction curves to investigate how many introductions one observes given the numbers of sequences sampled. I'm going to do this analysis for sequences from Colombia and sequences from Mexico, and it involves subsampling them down as well. 

In addition to this, I will need to make a "background sequences" file. This fasta will contain all of the sequences from the Americas that are used in the main analysis build EXCEPT for the country for which the subsampling is occurring (e.g. background file for Mexican subsampling analysis will contain all other American sequences in the main build, including all Colombian, but won't have any Mexican seuquences). Later on the in the build subsampled fastas will be concatenated with background sequences fasta in order to make Augur build input files.

In [27]:
all_main_build_sequences = SeqIO.to_dict(SeqIO.parse('../data/publishable-zika-fauna-2018-10-15.fasta', 'fasta'))

background_seqs_no_mexico = {all_main_build_sequences[key].description:all_main_build_sequences[key].seq for key in all_main_build_sequences.keys() if key.split('|')[5] != 'mexico'}

background_seqs_no_colombia = {all_main_build_sequences[key].description:all_main_build_sequences[key].seq for key in all_main_build_sequences.keys() if key.split('|')[5]!= 'colombia'}


In [33]:
with open("../supplemental-analysis/rarefaction-curves/mexico/data/background_seqs_no_mexico.fasta","w") as file:
    for key in background_seqs_no_mexico.keys():
        file.write(str(">" + key + "\n" + background_seqs_no_mexico[key] + "\n" ))
        
with open("../supplemental-analysis/rarefaction-curves/colombia/data/background_seqs_no_colombia.fasta","w") as file:
    for key in background_seqs_no_colombia.keys():
        file.write(str(">"+ key + "\n" + background_seqs_no_colombia[key] + "\n"))

In [5]:
rarefaction_seqs_dict = SeqIO.to_dict(SeqIO.parse('/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta', 'fasta'))
print ('There are {} sequences downloaded from Fauna in the rarefaction import.'.format(len(rarefaction_seqs_dict)))

#mexico_seqs_dict = {key:rarefaction_seqs_dict[key] for key in rarefaction_seqs_dict.keys if key.split('|')[5] == 'mexico')}
mexico_seqs_dict = {}
for key in rarefaction_seqs_dict.keys():
    if key.split('|')[5] == 'mexico':
        mexico_seqs_dict[key] = rarefaction_seqs_dict[key]
    
#notably not all of these sequences can be published on, so the following author's sequences should be dropped from the analysis
drop_authors = ['Sevilla-Reyes','Balaraman','Izquierdo', 'Valdespino-Vazquez']

mexican_seqs_for_use = {key:mexico_seqs_dict[key] for key in mexico_seqs_dict.keys() if key.split('|')[10] not in drop_authors}

#I also don't want to include sequences that are 50% N, so checking quality. 
# I'm going to say that the sequences need to be high quality: they need to have at least 80% informative bases.
samples_to_exclude_due_to_quality = []
for key in mexican_seqs_for_use.keys():
    n_count = count_n(mexican_seqs_for_use[key])
    if n_count > (10769*0.2):
        samples_to_exclude_due_to_quality.append(key)
        
high_qual_mexican_seqs_for_use = {key:mexican_seqs_for_use[key] for key in mexican_seqs_for_use.keys() if key not in samples_to_exclude_due_to_quality}
print(len(high_qual_mexican_seqs_for_use))

#Okay, high_qual_mexican_seqs is the set that should be used/subsampled for rarefaction analyses.

There are 745 sequences downloaded from Fauna in the rarefaction import.
51


Now I'm subsampling down both my Colombian and my Mexican sequence dictionaries to make the input datasets for the rarefaction curve analysis, which looks at how many introductions into a country (either Mexico or Colombia) are observed given x numbers of sequences sampled from that country. The idea is to see when this relationship asymptotes, i.e. how many sequences do you need to observe most of the introductions to a country that occurred.

To get the data sets for this analysis, I need to subsample down one countries sequences, and re-run the augur pipelines with the rest of the build the same, and just look at how introductions to that country changed given numbers of sequences from that country obtained. I'll do this both for Colombia and for Mexico, but separately (i.e. any subsampled Mexican sequences will be run with all Colombian sequences and vice versa).

The subsampling scheme is as follows. Try 1 sequence, 2 sequences, 3 sequences ... all the sequences, and look at numbers of introductions observed. For each subsample amount (e.g. 4 sequences) there will be 5 trials. 

In [100]:
random.seed(123456) #setting a seed so that sampling process is reproducible
n_trials_per_subsampling = 5
#okay now I want to randomly sample mexican sequences, where number of trials increments by 1 until the whole dataset is grabbed
#e.g. 
for i in range(1,len(high_qual_mexican_seqs_for_use)): #start at 1 seq, not zero, but doesn't need to do 51 because that would be no sampling.
    print("\nsubsampling ", i, " genomes.\n")
    os.mkdir("../supplemental-analysis/rarefaction-curves/mexico/mexico_{}_seqs".format(i))#make a directory for each subsample number that will hold the 5 trial data.
    for k in range(1,n_trials_per_subsampling+1): #doing this just for readibility, so that trial number is 1 through 5, rather than 0 through 4
        print("working on trial number: ", k)
        subsample = sample_fasta_dict_without_replacement(high_qual_mexican_seqs_for_use,i)
        with open("../supplemental-analysis/rarefaction-curves/mexico/mexico_{}_seqs/mex_{}_seqs_trial_{}.fasta".format(i,i,k),'w') as file:
            for sequence in subsample:
                file.write(str(">"+sequence[0] + '\n' + sequence[1].seq + '\n'))
        
    
#triall=sample_fasta_dict_without_replacement(high_qual_mexican_seqs_for_use,3)
#print(triall[0][1].seq)


subsampling  1  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  5

subsampling  2  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  5

subsampling  3  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  5

subsampling  4  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  5

subsampling  5  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  5

subsampling  6  genomes.

working on trial number:  1
working on trial number:  2
working on trial number:  3
working on trial number:  4
working on trial number:  