Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
178 lines (160 sloc) 12 KB
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<title>Tutorial: Smith-Waterman</title>
<body><style>
.codehilite .hll { background-color: #ffffcc }
.codehilite { background: #f0f0f0; }
.codehilite .c { color: #60a0b0; font-style: italic } /* Comment */
.codehilite .err { border: 1px solid #FF0000 } /* Error */
.codehilite .k { color: #007020; font-weight: bold } /* Keyword */
.codehilite .o { color: #666666 } /* Operator */
.codehilite .cm { color: #60a0b0; font-style: italic } /* Comment.Multiline */
.codehilite .cp { color: #007020 } /* Comment.Preproc */
.codehilite .c1 { color: #60a0b0; font-style: italic } /* Comment.Single */
.codehilite .cs { color: #60a0b0; background-color: #fff0f0 } /* Comment.Special */
.codehilite .gd { color: #A00000 } /* Generic.Deleted */
.codehilite .ge { font-style: italic } /* Generic.Emph */
.codehilite .gr { color: #FF0000 } /* Generic.Error */
.codehilite .gh { color: #000080; font-weight: bold } /* Generic.Heading */
.codehilite .gi { color: #00A000 } /* Generic.Inserted */
.codehilite .go { color: #808080 } /* Generic.Output */
.codehilite .gp { color: #c65d09; font-weight: bold } /* Generic.Prompt */
.codehilite .gs { font-weight: bold } /* Generic.Strong */
.codehilite .gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.codehilite .gt { color: #0040D0 } /* Generic.Traceback */
.codehilite .kc { color: #007020; font-weight: bold } /* Keyword.Constant */
.codehilite .kd { color: #007020; font-weight: bold } /* Keyword.Declaration */
.codehilite .kn { color: #007020; font-weight: bold } /* Keyword.Namespace */
.codehilite .kp { color: #007020 } /* Keyword.Pseudo */
.codehilite .kr { color: #007020; font-weight: bold } /* Keyword.Reserved */
.codehilite .kt { color: #902000 } /* Keyword.Type */
.codehilite .m { color: #40a070 } /* Literal.Number */
.codehilite .s { color: #4070a0 } /* Literal.String */
.codehilite .na { color: #4070a0 } /* Name.Attribute */
.codehilite .nb { color: #007020 } /* Name.Builtin */
.codehilite .nc { color: #0e84b5; font-weight: bold } /* Name.Class */
.codehilite .no { color: #60add5 } /* Name.Constant */
.codehilite .nd { color: #555555; font-weight: bold } /* Name.Decorator */
.codehilite .ni { color: #d55537; font-weight: bold } /* Name.Entity */
.codehilite .ne { color: #007020 } /* Name.Exception */
.codehilite .nf { color: #06287e } /* Name.Function */
.codehilite .nl { color: #002070; font-weight: bold } /* Name.Label */
.codehilite .nn { color: #0e84b5; font-weight: bold } /* Name.Namespace */
.codehilite .nt { color: #062873; font-weight: bold } /* Name.Tag */
.codehilite .nv { color: #bb60d5 } /* Name.Variable */
.codehilite .ow { color: #007020; font-weight: bold } /* Operator.Word */
.codehilite .w { color: #bbbbbb } /* Text.Whitespace */
.codehilite .mf { color: #40a070 } /* Literal.Number.Float */
.codehilite .mh { color: #40a070 } /* Literal.Number.Hex */
.codehilite .mi { color: #40a070 } /* Literal.Number.Integer */
.codehilite .mo { color: #40a070 } /* Literal.Number.Oct */
.codehilite .sb { color: #4070a0 } /* Literal.String.Backtick */
.codehilite .sc { color: #4070a0 } /* Literal.String.Char */
.codehilite .sd { color: #4070a0; font-style: italic } /* Literal.String.Doc */
.codehilite .s2 { color: #4070a0 } /* Literal.String.Double */
.codehilite .se { color: #4070a0; font-weight: bold } /* Literal.String.Escape */
.codehilite .sh { color: #4070a0 } /* Literal.String.Heredoc */
.codehilite .si { color: #70a0d0; font-style: italic } /* Literal.String.Interpol */
.codehilite .sx { color: #c65d09 } /* Literal.String.Other */
.codehilite .sr { color: #235388 } /* Literal.String.Regex */
.codehilite .s1 { color: #4070a0 } /* Literal.String.Single */
.codehilite .ss { color: #517918 } /* Literal.String.Symbol */
.codehilite .bp { color: #007020 } /* Name.Builtin.Pseudo */
.codehilite .vc { color: #bb60d5 } /* Name.Variable.Class */
.codehilite .vg { color: #bb60d5 } /* Name.Variable.Global */
.codehilite .vi { color: #bb60d5 } /* Name.Variable.Instance */
.codehilite .il { color: #40a070 } /* Literal.Number.Integer.Long */
</style>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-377063-3']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<h1>Tutorial: Smith-Waterman</h1>
<p><em>This is a work-in-progress page and probably not ready for general consumption yet</em></p>
<p>Skywriting is a task-parallel language and execution engine for programming clusters of computing resources.
It has a familiar Javascript-like syntax, but with a few changes to adapt it to a parallel environment.</p>
<p>In this intermediate tutorial, we describe how to implement the <a href="http://en.wikipedia.org/wiki/Smith–Waterman_algorithm">Smith-Waterman</a> string matching algorithm.</p>
<h2>How it works</h2>
<p>Smith-Waterman uses dynamic programming to obtain the optimal alignment between two strings (useful for example in gene splicing in bioinformatics).
However for inputs of length <em>m</em> and <em>n</em>, it requires <em>O(mn)</em> time, which limits its usefulness for long DNA sequences.</p>
<p>The algorithm computes a cost matrix <em>H</em> of which each element <em>H<sub>i,j</sub></em> depends on <em>H<sub>i-1,j-1</sub></em>, <em>H<sub>i-1,j</sub></em> and <em>H<sub>i,j-1</sub></em>.
At the start of the algorithm, no parallelism is possible, but as the algorithm progresses, a growing "wavefront" of independent values becomes calculable.
The cost matrix can be divided into sub-matrices, which exhibit the same data dependency pattern.</p>
<p>You can see this visualised as a live animation with a varying task size and number of parallel workers running <em>(uses HTML5, so a modern Safari, Firefox or Chrome needed)</em>:</p>
<table cellspacing="5">
<tr><th>Grid Size</th><th colspan="3">Number of Workers</th></tr>
<tr><td>10</td><td><a href="data/skylight.html?grid=10&workers=10" target="_blank">10</a></td><td><a href="data/skylight.html?grid=10&workers=20" target="_blank">20</a></td><td><a href="data/skylight.html?grid=10&workers=50" target="_blank">50</a></td></tr>
<tr><td>12</td><td><a href="data/skylight.html?grid=12&workers=10" target="_blank">10</a></td><td><a href="data/skylight.html?grid=12&workers=20" target="_blank">20</a></td><td><a href="data/skylight.html?grid=12&workers=50" target="_blank">50</a></td></tr>
<tr><td>15</td><td><a href="data/skylight.html?grid=15&workers=10" target="_blank">10</a></td><td><a href="data/skylight.html?grid=15&workers=20" target="_blank">20</a></td><td><a href="data/skylight.html?grid=15&workers=50" target="_blank">50</a></td></tr>
<tr><td>17</td><td><a href="data/skylight.html?grid=17&workers=10" target="_blank">10</a></td><td><a href="data/skylight.html?grid=17&workers=20" target="_blank">20</a></td><td><a href="data/skylight.html?grid=17&workers=50" target="_blank">50</a></td></tr>
<tr><td>20</td><td><a href="data/skylight.html?grid=20&workers=10" target="_blank">10</a></td><td><a href="data/skylight.html?grid=20&workers=20" target="_blank">20</a></td><td><a href="data/skylight.html?grid=20&workers=50" target="_blank">50</a></td></tr>
</table>
<p>Each node in the graph represents an execution of the serial Smith-Waterman algorithm on portions of the two strings.
A task progresses through being constructed (red node), becoming runnable (green node), to completing (grayscale node). The colour of the completed nodes represents the <em>efficiency</em> of the cluster at the time the task was completed, with white meaning fully utilised and black unused.</p>
<p>Look at the <a href="data/50x50-50w-result.jpg">completed 50x50 graph</a> which shows how the utilisation is at its best at the diagonal of the grid for Smith Waterman.
This is a nice way of visualising how well your algorithms are using the resources you give it.</p>
<h2>Initializing</h2>
<p>We start off the algorithm by defining the size of the grid, and the location of the source data.</p>
<div class="codehilite"><pre>num_rows = 10;
num_cols = 10;
horiz_source = ref(&quot;http://www.cl.cam.ac.uk/~dgm36/horizontal_string_random&quot;);
vert_source = ref(&quot;http://www.cl.cam.ac.uk/~dgm36/vertical_string_random&quot;);
java_lib = [ref(&quot;http://www.cl.cam.ac.uk/~dgm36/dp.jar&quot;)];</pre></div>
<p>The <code>ref()</code> function constructs a reference, which Skywriting uses to name large amounts of data without loading them.
Passing a reference as a parameter to a function copies the reference but does not copy the referenced data.
A Skywriting reference and the data to which it refers are immutable. Once constructed, a reference and its underlying data may be cached across the cluster, to improve efficiency.</p>
<div class="codehilite"><pre>horiz_chunks = spawn_exec(&quot;java&quot;, { &quot;inputs&quot; : [horiz_source], &quot;lib&quot; : java_lib,
&quot;class&quot; : &quot;tests.dp.PartitionInputString&quot;, &quot;argv&quot;:[] }, num_cols);
vert_chunks = spawn_exec(&quot;java&quot;, {&quot;inputs&quot;:[vert_source], &quot;lib&quot;:java_lib,
&quot;class&quot; : &quot;tests.dp.PartitionInputString&quot;, &quot;argv&quot;:[] }, num_rows);</pre></div>
<p>In a Skywriting program, data- and CPU-intensive tasks are delegated to external code.
The <code>exec()</code> function synchronously executes the code and the <em>spawn()</em> constructs a new parallel task.
The convenience function <code>spawn_exec()</code> used above performs both, and returns a reference to the newly formed task.</p>
<p>All calls to <code>exec()</code> specify the name of an executor plugin to define how to run the code, and the arguments typically include one or more references.
In the above code, the Java plugin executes a method call that partitions the input strings into 10 chunks.
Meanwhile, the script continues executing, and <code>horiz_chunks</code> and <code>vert_chunks</code> are references to the output for when it completes.</p>
<div class="codehilite"><pre>blocks = [];
blocks[0] = [];
blocks[0][0] = spawn_exec(&quot;java&quot;,
{ &quot;argv&quot; : [&quot;tl&quot;, -1, -1, -1, 2], &quot;lib&quot; : java_lib,
&quot;class&quot; : &quot;tests.dp.SmithWaterman&quot;, &quot;inputs&quot; : [horiz_chunks[0], vert_chunks[0] ]
}, 3);
for (j in range(1, num_cols)) {
blocks[0][j] = spawn_exec(&quot;java&quot;,
{ &quot;argv&quot;:[&quot;t&quot;, -1, -1, -1, 2], &quot;lib&quot; : java_lib,
&quot;class&quot; : &quot;tests.dp.SmithWaterman&quot;,
&quot;inputs&quot; : [horiz_chunks[j], vert_chunks[0], blocks[0][j-1][2]]
}, 3);
}</pre></div>
<p>explain for loop</p>
<div class="codehilite"><pre>i = 0;
j = 0;
for (i in range(1, num_rows)) {
blocks[i] = [];
blocks[i][0] = spawn_exec(&quot;java&quot;,
{ &quot;argv&quot; : [&quot;l&quot;, -1, -1, -1, 2], &quot;lib&quot;:java_lib,
&quot;class&quot; : &quot;tests.dp.SmithWaterman&quot;,
&quot;inputs&quot; : [ horiz_chunks[0], vert_chunks[i], blocks[i-1][0][1] ]
}, 3);
for (j in range(1, num_cols)) {
blocks[i][j] = spawn_exec(&quot;java&quot;,
{ &quot;argv&quot; : [&quot;i&quot;, -1, -1, -1, 2], &quot;lib&quot; : java_lib,
&quot;class&quot; : &quot;tests.dp.SmithWaterman&quot;,
&quot;inputs&quot; : [ horiz_chunks[j], vert_chunks[i], blocks[i-1][j-1][0],
blocks[i-1][j][1], blocks[i][j-1][2] ]
}, 3);
}
}</pre></div>
<p>more</p>
<div class="codehilite"><pre>ignore = exec(&quot;stdinout&quot;,
{ &quot;inputs&quot; : [blocks[i][j][0]],
&quot;command_line&quot; : [&quot;cat&quot;]
}, 1);
return ignore;</pre></div>
<p>Finally, we gather the results and pipe them through the UNIX <code>stdinout</code> executor, which passes the references to a shell script.
In this case, the <code>cat</code> command concatenates all the outputs into a single file.</p></body>