Skip to content
Permalink
Browse files

WIP on rarefaction curves, looking at Mexican sequences

  • Loading branch information...
alliblk committed Jul 3, 2019
1 parent d2c0a79 commit ab8bdd088f051e43093eb7f93ac5278ca4c61f8e
Showing with 79 additions and 9 deletions.
  1. +79 −9 scripts/curate-dataset.ipynb
@@ -4,11 +4,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
" ### Dataset curation: making the zika-colombia input fasta dataset\n",
" ### Dataset curation: making the zika-colombia fasta files for main build and supplemental analyses\n",
" \n",
"This notebook contains code needed to go from a raw download of all Zika genomes in `nextstrain/fauna` to the input fasta file for the zika-colombia specific analysis (which is done as a custom `nextstrain/augur` build. \n",
"\n",
"Here I am removing sequences from geographic areas that I don't want included in this analysis (e.g. Singapore), as well as ensuring that I only keep genomes that I have permission to include in a published analysis."
"This notebook contains code needed to go from a raw download of all Zika genomes in `nextstrain/fauna` to the input fasta file for the zika-colombia specific analyses (which are custom `nextstrain/augur` builds). "
]
},
{
@@ -23,6 +21,17 @@
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First thing that I am going to do is curate the dataset for the primary analysis. Curations steps that I would like to perform include:\n",
"\n",
"1. Removing sequences from any geographic areas that I do not want included in the analysis (e.g. Singapore)\n",
"\n",
"2. Removing any sequences that were up on GenBank, but were not actually published on, and for which we didn't receive the author's permission to include in our analysis."
]
},
{
"cell_type": "code",
"execution_count": 36,
@@ -39,7 +48,7 @@
}
],
"source": [
"# Paths to files, keeping relational show that paths should work if someone downloads the repo as is.\n",
"# Paths to files, keeping relational so that paths should work if someone downloads the repo as is.\n",
"fauna_seqs_dict = SeqIO.to_dict(SeqIO.parse('../data/zika-fauna-2018-09-06.fasta', 'fasta'))\n",
"print 'There are {} sequences downloaded from Fauna.'.format(len(fauna_seqs_dict))\n",
"\n",
@@ -120,9 +129,16 @@
" file.write(str('>' + key + '\\n' + publishable_seqs_dict[key] + '\\n'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next I want to check the quality of the sequences that I will use. Specifically, I want to know what proportion of the genome we have informative sequence calls for each sample. The following code blocks do things like count numbers of N's in a sequence, and separate out high quality sequences from low/medium quality sequences given different thresholds for the numbers of N's that are acceptable."
]
},
{
"cell_type": "code",
"execution_count": 32,
"execution_count": 33,
"metadata": {},
"outputs": [],
"source": [
@@ -215,13 +231,67 @@
" for key in gaps_not_n_seqs.keys():\n",
" file.write(str('>' + key + '\\n' + gaps_not_n_seqs[key] + '\\n'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next up, I want to make a little additional dataset for a supplemental analysis I'm doing, making rarefaction curves to investigate how many introductions one observes given the numbers of sequences sampled. I'm going to do this analysis for sequences from Colombia and sequences from Mexico, and it involves subsampling them down as well. "
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"There are 745 sequences downloaded from Fauna in the rarefaction import.\n",
"51\n"
]
}
],
"source": [
"rarefaction_seqs_dict = SeqIO.to_dict(SeqIO.parse('/Users/alliblk/Desktop/gitrepos/fauna/data/zika.fasta', 'fasta'))\n",
"print ('There are {} sequences downloaded from Fauna in the rarefaction import.'.format(len(rarefaction_seqs_dict)))\n",
"\n",
"#mexico_seqs_dict = {key:rarefaction_seqs_dict[key] for key in rarefaction_seqs_dict.keys if key.split('|')[5] == 'mexico')}\n",
"mexico_seqs_dict = {}\n",
"for key in rarefaction_seqs_dict.keys():\n",
" if key.split('|')[5] == 'mexico':\n",
" mexico_seqs_dict[key] = rarefaction_seqs_dict[key]\n",
"\n",
"#for key in mexico_seqs_dict.keys():\n",
" #print(key.split('|')[10] + ' ' + key.split('|')[2]) \n",
"\n",
" \n",
" #notably not all of these sequences can be published on, so the following author's sequences should be dropped from the analysis\n",
"drop_authors = ['Sevilla-Reyes','Balaraman','Izquierdo', 'Valdespino-Vazquez']\n",
"\n",
"mexican_seqs_for_use = {key:mexico_seqs_dict[key] for key in mexico_seqs_dict.keys() if key.split('|')[10] not in drop_authors}\n",
"\n",
"#I also don't want to include sequences that are 50% N, so checking quality. \n",
"# I'm going to say that the sequences need to be high quality: they need to have at least 80% informative bases.\n",
"samples_to_exclude_due_to_quality = []\n",
"for key in mexican_seqs_for_use.keys():\n",
" n_count = count_n(mexican_seqs_for_use[key])\n",
" if n_count > (10769*0.2):\n",
" samples_to_exclude_due_to_quality.append(key)\n",
" \n",
"high_qual_mexican_seqs_for_use = {key:mexican_seqs_for_use[key] for key in mexican_seqs_for_use.keys() if key not in samples_to_exclude_due_to_quality}\n",
"print(len(high_qual_mexican_seqs_for_use))\n",
"\n",
"#Okay, high_qual_mexican_seqs is the set that should be used/subsampled for rarefaction analyses."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
@@ -233,7 +303,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
"version": "2.7.15"
}
},
"nbformat": 4,

0 comments on commit ab8bdd0

Please sign in to comment.
You can’t perform that action at this time.