Skip to content
This repository has been archived by the owner on Jul 20, 2021. It is now read-only.

Commit

Permalink
Merge 5ee3394 into e9d34ad
Browse files Browse the repository at this point in the history
  • Loading branch information
2grep committed Oct 29, 2014
2 parents e9d34ad + 5ee3394 commit 3ba3f7c
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 46 deletions.
4 changes: 2 additions & 2 deletions Index.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:62c6618d5908cea56f04b909c1a6fb35ca5582bb105a3453a26769c28eca29bb"
"signature": "sha256:d437bbcfe323460025114f4e677669ba63db29d1e089b04aff4572c28c793d9a"
},
"nbformat": 3,
"nbformat_minor": 0,
Expand Down Expand Up @@ -39,7 +39,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**This project is in very early development stage.** It's not ready for prime-time by any means, but I fall firmly into the \"publish early, publish often\" mindset, hence its public availability. I am very interested in feedback. The best way to get feedback to me is through [the IAB issue tracker](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues), and the best way to get me contributions is through [pull requests](https://help.github.com/articles/using-pull-requests).\n",
"**This project is in very early development stage.** It's not ready for prime-time by any means, but I fall firmly into the \"publish early, publish often\" mindset, hence its public availability. I am very interested in feedback. The best way to get feedback to me is through [the IAB issue tracker](https://github.com/gregcaporaso/An-Introduction-To-Applied-Bioinformatics/issues), and the best way to contribute is through [pull requests](https://help.github.com/articles/using-pull-requests).\n",
"\n",
"The code in the iab module is **not sufficiently tested, documented, or optimized for production use**. As code reaches those quality standards it will be ported to [scikit-bio](http://www.scikit-bio.org). I do not recommend using the code in the iab module outside of these notebooks. In other words, don't `import iab` outside of the notebooks - if you want access to the functionality in your own code, you should `import skbio`.\n",
"\n",
Expand Down
87 changes: 47 additions & 40 deletions algorithms/1-pairwise-alignment.ipynb
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"metadata": {
"name": "",
"signature": "sha256:dc3eddd8daa8cf7da1bbb568625d52d5ba5f557347006cd85faffd1cac6e944a"
"signature": "sha256:caa1c89042d2242d1f07d7ae70c1b36815953648caca1be7a381c9e86e83aa87"
},
"nbformat": 3,
"nbformat_minor": 0,
Expand Down Expand Up @@ -41,7 +41,7 @@
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
"prompt_number": 4
},
{
"cell_type": "code",
Expand Down Expand Up @@ -96,7 +96,7 @@
]
}
],
"prompt_number": 2
"prompt_number": 5
},
{
"cell_type": "code",
Expand All @@ -113,17 +113,24 @@
"stream": "stdout",
"text": [
"0.0508474576271\n",
"0.254237288136\n"
"0.254237288136"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 3
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, ``q1`` is clearly more similar to ``r1`` than ``q2`` is. But it's not always that simple. Here we're assuming that only substitution events have occured. Let's define ``q3``, which is the same as ``q1`` except that a single base deletion (with respect to ``r1``) is present toward the beginning of the sequence, and a single base addition at the end of the sequence. (Note: it's required that if we have a deletion we also have an insertion, because ``hamming`` is only defined for sequences of equal length.)"
"In this case, ``q1`` is clearly more similar to ``r1`` than ``q2`` is. But it's not always that simple. Here we're assuming that only substitution events have occurred. Let's define ``q3``, which is the same as ``q1`` except that a single base deletion (with respect to ``r1``) is present toward the beginning of the sequence, and a single base addition at the end of the sequence. (Note: it's required that if we have a deletion we also have an insertion, because ``hamming`` is only defined for sequences of equal length.)"
]
},
{
Expand All @@ -144,13 +151,13 @@
]
}
],
"prompt_number": 4
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This one base change had a big effect on the distance between the two sequences. If this is a protein coding sequence, maybe that's reasonable, but given what we know about how biological sequences evolve this doesn't seem biologically justified. In this case, it seems that an insertion or deletion (i.e., an **indel**) event has shifted one sequence relative to the other, which resulted in many of the bases \"downstream\" of the indel being different. What we'd really want to do is have a way to indicate that an indel seems to have occured in one of the sequences. For example, let's define ``q4``, where we use a ``-`` character to indicate a deletion with respect to ``r1``. This results in what seems like a more reasonable distance between the two sequences (though given what you know about how indel events can disrupt protein coding sequences, you may feel that this distance is too close to the distance between ``r1`` and ``q1``):"
"This one base change had a big effect on the distance between the two sequences. If this is a protein coding sequence, maybe that's reasonable, but given what we know about how biological sequences evolve this doesn't seem biologically justified. In this case, it seems that an insertion or deletion (i.e., an **indel**) event has shifted one sequence relative to the other, which resulted in many of the bases \"downstream\" of the indel being different. What we'd really want to do is have a way to indicate that an indel seems to have occurred in one of the sequences. For example, let's define ``q4``, where we use a ``-`` character to indicate a deletion with respect to ``r1``. This results in what seems like a more reasonable distance between the two sequences (though given what you know about how indel events can disrupt protein coding sequences, you may feel that this distance is too close to the distance between ``r1`` and ``q1``):"
]
},
{
Expand All @@ -171,7 +178,7 @@
]
}
],
"prompt_number": 5
"prompt_number": 8
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -199,7 +206,7 @@
]
}
],
"prompt_number": 6
"prompt_number": 9
},
{
"cell_type": "markdown",
Expand All @@ -212,7 +219,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The problem of **pairwise sequence alignment** is, **given two sequences, generate a hypothesis about which bases were derived from a common ancestor**. In other words, align one on top of the other, inserting gaps as necessary, in a way that maximizes their similarity. Sequence alignment is tricky, in part, because of insertion/deletion mutations, and in part because particularly as sequences get long, there may be many possible ways to align them and we need to figure out which of those alignments is the best hypothesis in light of what we know about the (very messy) underlying biological systems. (When the sequences get very long, sequence alignment also becomes a very computationally expensive problem - we'll come back to this part.)\n",
"The problem of **pairwise sequence alignment** is, **given two sequences, generate a hypothesis about which bases were derived from a common ancestor**. In other words, align one on top of the other, inserting gaps as necessary, in a way that maximizes their similarity. Sequence alignment is tricky, in part, because of insertion/deletion mutations, and in part because, particularly as sequences get long, there may be many possible ways to align them and we need to figure out which of those alignments is the best hypothesis in light of what we know about the (very messy) underlying biological systems. When the sequences get very long, sequence alignment also becomes a very computationally expensive problem - we'll come back to this part.\n",
"\n",
"In the next section we'll work through one algorithm for aligning a pair of sequences. As you work through this exercise, try to make a list of the assumptions that we're making that violate what you know about how sequences evolve. "
]
Expand Down Expand Up @@ -242,7 +249,7 @@
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
"prompt_number": 10
},
{
"cell_type": "markdown",
Expand All @@ -262,7 +269,7 @@
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 8
"prompt_number": 11
},
{
"cell_type": "code",
Expand Down Expand Up @@ -303,7 +310,7 @@
]
}
],
"prompt_number": 9
"prompt_number": 12
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -358,7 +365,7 @@
]
}
],
"prompt_number": 10
"prompt_number": 13
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -417,7 +424,7 @@
]
}
],
"prompt_number": 11
"prompt_number": 14
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -480,7 +487,7 @@
]
}
],
"prompt_number": 12
"prompt_number": 15
},
{
"cell_type": "markdown",
Expand All @@ -507,7 +514,7 @@
"\n",
"**Remember that an alignment represents a hypothesis about the evolutionary history of a sequence. Which of these hypotheses do you think is more likely to be true based on what you know about sequence evolution?**\n",
"\n",
"**As an exercise**, now scroll back to where we defined `seq1` and `seq2` and redefine one or both of those as other sequences. Execute that cell, and the one up to the previous cell, and transcribe the highest scoring alignment. "
"**As an exercise**, scroll back to where we defined `seq1` and `seq2` and redefine one or both of those as other sequences. Execute that cell, and the one up to the previous cell, and transcribe the highest scoring alignment. "
]
},
{
Expand All @@ -522,9 +529,9 @@
"\n",
"3. When searching a novel sequence against a database, you may have billions of bases to search against (which would correspond to billions of columns in these matrices). How can this be done efficiently? How can you determine if a hit is statistically meaningful or the result of chance?\n",
"\n",
"All scoring schemes have limitations, and you should consider alignments that come back from systems such as BLAST as hypotheses. You still need to do your due diligence to decide if you agree with the result that a computational systems gives you. They are there to help you do your work, but their answers are based on models and the models are not perfect. Be skeptical!\n",
"All scoring schemes have limitations, and you should consider alignments that come back from systems such as BLAST as hypotheses. You still need to do your due diligence to decide if you agree with the result that a computational system gives you. They are there to help you do your work, but their answers are based on models and the models are not perfect. Be skeptical!\n",
"\n",
"Over the next several sections we'll explore ways of addressing each of these complexities. This notebook covers solutions to address the first and second. We'll introduce the problem of the third in this notebook, but begin exploring solutions in the next chapter."
"Over the next several sections we'll explore ways of addressing each of these complexities. This notebook covers solutions to address the first and second. We'll introduce the problem of the third in this notebook, but save exploring solutions for the next chapter."
]
},
{
Expand Down Expand Up @@ -561,7 +568,7 @@
]
}
],
"prompt_number": 13
"prompt_number": 16
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -604,7 +611,7 @@
]
}
],
"prompt_number": 14
"prompt_number": 17
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -659,7 +666,7 @@
]
}
],
"prompt_number": 15
"prompt_number": 18
},
{
"cell_type": "heading",
Expand Down Expand Up @@ -690,7 +697,7 @@
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 16
"prompt_number": 19
},
{
"cell_type": "code",
Expand Down Expand Up @@ -720,7 +727,7 @@
]
}
],
"prompt_number": 17
"prompt_number": 20
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -764,7 +771,7 @@
]
}
],
"prompt_number": 18
"prompt_number": 21
},
{
"cell_type": "code",
Expand Down Expand Up @@ -794,7 +801,7 @@
]
}
],
"prompt_number": 19
"prompt_number": 22
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -854,7 +861,7 @@
]
}
],
"prompt_number": 20
"prompt_number": 23
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -899,7 +906,7 @@
]
}
],
"prompt_number": 21
"prompt_number": 24
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -1040,7 +1047,7 @@
]
}
],
"prompt_number": 22
"prompt_number": 25
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -1083,7 +1090,7 @@
]
}
],
"prompt_number": 23
"prompt_number": 26
},
{
"cell_type": "code",
Expand All @@ -1110,7 +1117,7 @@
]
}
],
"prompt_number": 24
"prompt_number": 28
},
{
"cell_type": "markdown",
Expand All @@ -1121,7 +1128,7 @@
"We can now read the dynamic programming and traceback matrices to transcribe and score the alignment of sequences 1 and 2. To do this, we start at the bottom right of the matrices and traceback the arrows to cell $(0, 0)$. \n",
"\n",
"* Every time we hit a vertical arrow (represented by `|`), we consume a character from sequence 2 (the vertical sequence) and add a gap to sequence 1. \n",
"* Every time we hit a horizontal arrow (represented by `-`), we consume a character from sequence 1 (the vertical sequence) and add a gap to sequence 2.\n",
"* Every time we hit a horizontal arrow (represented by `-`), we consume a character from sequence 1 (the horizontal sequence) and add a gap to sequence 2.\n",
"* Every time we hit a diagonal arrow (represented by `\\`), we consume a character from sequence 1 and sequence 2.\n",
"\n",
"As you transcribe the alignment, write sequence 1 on top of sequence 2, and work from right to left (since you are working backwards through the matrix).\n",
Expand Down Expand Up @@ -1210,7 +1217,7 @@
]
}
],
"prompt_number": 25
"prompt_number": 29
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -1242,7 +1249,7 @@
]
}
],
"prompt_number": 26
"prompt_number": 30
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -1361,7 +1368,7 @@
]
}
],
"prompt_number": 27
"prompt_number": 31
},
{
"cell_type": "code",
Expand Down Expand Up @@ -1409,7 +1416,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The alignment we just constructed is what's known as a *global alignment*, meaning we align both sequences from the beginning through the end. This has some important specific applications: for example, if we have two full-length protein sequences, and we have a crystal structure for one of them, we may want to have a direct mapping between all positions in both sequences. Probably more common however is local alignment: where we have a pair of sequences that we suspect may partially overlap each other, and we want to know what the best possible alignment of all or part of one sequence is with all or part of the other sequences. Perhaps the most widely used application of this is in the BLAST search algorithm, where we a query sequence and we want to know what the closest match (or matches) are in a reference database containing many different gene sequences. In this case, the whole reference database could be represented as a single sequence, as we could perform a local alignment against it to find the region that contains the highest scoring match. "
"The alignment we just constructed is what's known as a *global alignment*, meaning we align both sequences from the beginning through the end. This has some important specific applications: for example, if we have two full-length protein sequences, and we have a crystal structure for one of them, we may want to have a direct mapping between all positions in both sequences. Probably more common however, is local alignment: where we have a pair of sequences that we suspect may partially overlap each other, and we want to know what the best possible alignment of all or part of one sequence is with all or part of the other sequences. Perhaps the most widely used application of this is in the BLAST search algorithm, where we a query sequence and we want to know what the closest match (or matches) are in a reference database containing many different gene sequences. In this case, the whole reference database could be represented as a single sequence, as we could perform a local alignment against it to find the region that contains the highest scoring match. "
]
},
{
Expand Down Expand Up @@ -1470,7 +1477,7 @@
]
}
],
"prompt_number": 29
"prompt_number": 32
},
{
"cell_type": "code",
Expand Down Expand Up @@ -1504,7 +1511,7 @@
]
}
],
"prompt_number": 30
"prompt_number": 33
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -2438,7 +2445,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's define a loop where we align randomly pairs of sequences of increasing length, and compile the time it took to align the sequences. Here we're going to use a faster version of pairwise alignment that's implemented in scikit-bio, to faciliate testing with more alignments."
"Next, let's define a loop where we align, randomly, pairs of sequences of increasing length, and compile the time it took to align the sequences. Here we're going to use a faster version of pairwise alignment that's implemented in scikit-bio, to faciliate testing with more alignments."
]
},
{
Expand Down
Loading

0 comments on commit 3ba3f7c

Please sign in to comment.