Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
528 lines (473 sloc) 15.4 KB
<html>
<head>
<title>Propose Changes to clojure.contrib.str-utils</title>
<style>
.code-block{
background-color:#FFFFCC;
border-color:#CCCCCC;
border-width:2px;
border-style:solid;
width:35em;
margin-left:auto;
margin-right:auto;
margin-top:10px;
margin-bottom:10px;
padding:5px;
}
body{
width:40em;
margin-left:auto;
margin-right:auto;
background-color:#3333FF;
margin-top:0px;
margin-bottom:0px;
}
#page-content{
border-color:#990000;
border-width:0px 2px 0px 2px;
border-style:solid;
background-color:#FFFFFF;
padding:5px;
margin:0px;
}
.ns{
color:green;
font-weight:bold;
}
</style>
</head>
<body>
<div id="page-content">
<h1 style="text-align:center">My Proposed changes to str-utils </h1>
<h4 style="text-align:center">Sean Devlin</h4>
<h4 style="text-align:center">May 13, 2009</h4>
<p>
I've been reviewing the str-utils package, and I'd like to propose a few changes to the library. I've included the code at the bottom.
</p>
<h2>Use Multi-Methods</h2>
<p>
I'd like to propose re-writing the following methods to used multi-methods. Every single method will take an input called input-string,
and a variable set of inputs called remaining-inputs. The mutli-dispatch will make decide what to do based on the remaining inputs.
Specifically, I've used
</p>
<div class="code-block">
<code>
(class (first remaining-inputs))<br>
</code>
</div>
<p>
repeatedly. The two most interesting classes are <code class="ns">java.util.regex.Pattern</code>, and <code class="ns">clojure.lang.PersistentList</code>. I deliberately decided to <b>not</b>
use sequences, because I believed order was important. One method takes a map as an input, but this is so that a tuple could be passed as an options
hash.
</p>
<h2><code>re-partion[input-string & remaining-inputs](...)</code></h2>
<div class="function-description">
This methods behaves like the original re-partition method, with the remaining-inputs being able to a list or a pattern. It returns a lazy sequence, and
is used as a basis for for several other methods.
</div>
<h2><code>re-split[input-string & remaining-inputs](...)</code></h2>
<div class="function-description">
The remaining inputs can be dispatched based on a regex pattern, a list of patterns, or a map. The regex method is the basis, and does the actual work.<br>
<h4>Regex</h4>
This method splits a string into a list based on a regex. It depends on the re-partition method, and returns a lazy sequence.<br>
<div class="code-block">
<code>
(re-split "1 2 3\n4 5 6" #"\n") => ("1 2 3" "4 5 6")<br>
</code>
</div>
<h4>Map</h4>
This splits each element based on the inputs of the map. It is how options are passed to the method.<br>
<div class="code-block">
<code>
(re-split "1 2 3" {:pattern #"\s+" :offset 1}) => (2.0 3.0)<br>
(re-split "1 2 3" {:pattern #"\s+" :length 2}) => (1.0 2.0)<br>
(re-split "1 2 3" {:pattern #"\s+" :marshal-fn #(java.lang.Double/parseDouble %)}) => (1.0 2.0 3.0)<br>
</code>
</div>
The <code>:pattern</code>, <code>:offset</code>, and <code>:length</code> options are relatively straightforward. The :marshal-fn is mapped after the string is split.<br>
<h4>List</h4>
This splits each element either like a map (datatype) or a regex. The map operator is applied recursively to each element<br>
<div class="code-block">
<code>
(re-split "1 2 3\n4 5 6" (list #"\n" #"\s+")) => (("1" "2" "3") ("4" "5" "6"))<br>
</code>
</div>
<h4>Chaining</h4>
These items can be chained together, as the following example shows<br>
<div class="code-block">
<code>
(re-split "1 2 3\n4 5 6" <br>
(list #"\n" {:pattern #"\s+" <br>
:length 2 <br>
:marshal-fn #(java.lang.Double/parseDouble %)}))<br>
=> ((1.0 2.0) (4.0 5.0))<br>
</code>
</div>
In my opinion, the <code>:marshal-fn</code> is best used at the end of the list. However, it could be used earlier in the list, but a exception will most likely be thrown.
</div>
<h2><code>re-gsub[input-string & remaining-inputs](...)</code></h2>
<div class="function-description">
This method can take a list or two atoms as the remaining inputs.<br>
Two atoms<br>
<div class="code-block">
<code>
(re-gsub "1 2 3 4 5 6" #"\s" "") => "123456"<br>
</code>
</div>
A paired list<br>
<div class="code-block">
<code>
(re-gsub "1 2 3 4 5 6" '((#"\s" " ) (#"\d" "D"))) => "DDDDDD"<br>
</code>
</div>
</div>
<h2><code>re-sub[input-string & remaining-inputs](...)</code></h2>
<div class="function-description">
Again, this method can take a list or two atoms as the remaining inputs.<br>
Two atoms<br>
<div class="code-block">
<code>
(re-sub "1 2 3 4 5 6" #"\d" "D") => "D 2 3 4 5 6"<br>
</code>
</div>
A paired list<br>
<div class="code-block">
<code>
(re-sub "1 2 3 4 5 6" '((#"\d" "D") (#"\d" "E"))) => "D E 3 4 5 6"<br>
</code>
</div>
</div>
<h2>The <code>nearby</code> Function</h2>
<p>
The nearby function is designed to assist with a spell checker, inspired by the example from Peter Norvig.
<h3>Signatures</h3>
<code>
<ul>
<li><b>nearby</b> input-string</li>
<li><b>nearby</b> input-string seq</li>
</ul>
</code>
Here's an example.
<div class="code-block">
<code>
(nearby "cat" (seq "abc")) => <br>("act" "atc"<br>
"acat" "aat" "caat" "cat" "caat" "caa" "cata"<br>
"bcat" "bat" "cbat" "cbt" "cabt" "cab" "catb"<br>
"ccat" "cat" "ccat" "cct" "cact" "cac" "catc")
</code>
</div>
The resulting sequence is lazy. In order to use it in a spellchecker, try using it like this:
<div class="code-block">
<code>
(set (take <i>number</i> (nearby "cat" (seq "abc"))))
</code>
</div>
If the function is called with only one argument, it behaves like this.
<div class="code-block">
<code>
(nearby "cat") =><br> (nearby "cat" (cons "" "etaoinshrdlcumwfgypbvkjxqz"))
</code>
</div>
The strange order was chosen because that is the english alphabet sorted by frequency. This way the earliest entries will have the
highest chance of being a valid word.
</p>
<h2>String Seq Utils</h2>
<div class="function-description">
The contrib version of str-utils contains the <code>str-join</code> function. This is a string specific version of the more general <cod>interpose</code> function. It inspired the creation of
four other functions, <code>str-take, str-drop, str-rest & str-reverse</code>. The mimic the behavior of the regular sequence operations, with the exception that they return strings instead of
a sequence. Also, some of them can alternately take a regex as an input.<br>
<h2><code>str-take</code></h2>
<p>
This function is designed to be similar to the <code>take</code> function from the core. It specifically applies the <code>str</code> function to the resulting sequence. Also, it can take a regex
instead of an integer, and will take everything before the regex. Be careful not to combine a regex and a sequence, as this will cause an error. Finally, an optional <code>:include</code> parameter
can be passed to include the matched regex.
</p>
<div class="code-block">
<code>
<table>
<tr>
<td>(str-take 7 "Clojure Is Awesome")</td>
<td>=></td>
<td>"Clojure"</td>
</tr>
<tr>
<td>(str-take 2 ["Clojure" "Is" "Awesome"])</td>
<td>=></td>
<td>"ClojureIs"</td>
</tr>
<tr>
<td>(str-take #"\s+" "Clojure Is Awesome")</td>
<td>=></td>
<td>"Clojure"</td>
</tr>
<tr>
<td>(str-take #"\s+" "Clojure Is Awesome" <br>&nbsp&nbsp {:include true})</td>
<td>=></td>
<td>"Clojure "</td>
</tr>
<tr>
<td>(str-take #"\s+" ["Clojure" "Is" "Awesome"])</td>
<td>=></td>
<td style="color:red;"><b>error</b></td>
</tr>
</table>
</code>
</div>
<h2><code>str-drop</code></h2>
<p>
This function is designed to be similar to the <code>drop</code> function from the core. It specifically applies the <code>str</code> function to the resulting sequence. Also, it can take a regex
instead of an integer, and will take everything after the regex. Be careful not to combine a regex and a sequence, as this will cause an error. Finally, an optional <code>:include</code> parameter
can be passed to include the matched regex.
</p>
<div class="code-block">
<code>
<table>
<tr>
<td>(str-drop 8 "Clojure Is Awesome")</td>
<td>=></td>
<td>"Is Awesome"</td>
</tr>
<tr>
<td>(str-drop 1 ["Clojure" "Is" "Awesome"])</td>
<td>=></td>
<td>"IsAwesome"</td>
</tr>
<tr>
<td>(str-drop #"\s+" "Clojure Is Awesome")</td>
<td>=></td>
<td>"Is Awesome"</td>
</tr>
<tr>
<td>(str-drop #"\s+" "Clojure Is Awesome" <br>&nbsp&nbsp {:include true})</td>
<td>=></td>
<td>" Is Awesome"</td>
</tr>
<tr>
<td>(str-drop #"\s+" ["Clojure" "Is" "Awesome"])</td>
<td>=></td>
<td style="color:red;"><b>error</b></td>
</tr>
</table>
</code>
</div>
<h2><code>str-rest</code></h2>
<p>This function applies <code>str</code> to the <code>rest</code> of the input. It is equivalent to <code>(str-drop 1 <i>input</i>)</code></p>
<div class="code-block">
<code>
<table>
<tr>
<td>(str-rest (str :Clojure))</td>
<td>=></td>
<td>"Clojure"</td>
</tr>
</table>
</code>
</div>
<!--
<table>
<tr>
<td></td>
<td>=></td>
<td></td>
</tr>
</table> -->
<h2><code>str-reverse</code></h2>
<div class="brief-function-description">
This methods reverses a string<br>
<div class="code-block">
<code>
(str-reverse "Clojure") => "erujolC"
</code>
</div>
</div>
<h3>An Example</h3>
These methods can be used to help parse strings, such as below.<br>
<div class="code-block">
<code>
(str-take "&gt" (str-drop #"&lt h4" "&lt h4 ... &gt")) <br>=> ;the stuff in the middle
</code>
</div>
</div>
<h2>
New Inflectors
</h2>
I've added a few inflectors that I am familiar with from Rails. My apologies if their origin is anther language. I'd be interested in knowing where the method originated
<h4>trim</h4>
<div class="brief-function-description">
This is a convenience wrapper for the trim method java supplies<br>
<div class="code-block">
<code>
(trim " Clojure ") => "Clojure"
</code>
</div>
</div>
<h4>strip</h4>
<div class="brief-function-description">
This is an alias for trim. I accidently switch between <code>trim</code> and <code>strip</code> all the time.<br>
<div class="code-block">
<code>
(strip " Clojure ") => "Clojure"
</code>
</div>
</div>
<h4>ltrim</h4>
<div class="brief-function-description">
This method removes the leading whitespace<br>
<div class="code-block">
<code>
(ltrim " Cloure ") => "Clojure "
</code>
</div>
</div>
<h4>rtrim</h4>
<div class="brief-function-description">
This method removes the trailing whitespace<br>
<div class="code-block">
<code>
(ltrim " Cloure ") => " Clojure"
</code>
</div>
</div>
<h4>downcase</h4>
<div class="brief-function-description">
This is a convenience wrapper for the toLowerCase method java supplies<br>
<div class="code-block">
<code>
(downcase "Clojure") => "clojure"
</code>
</div>
</div>
<h4>upcase</h4>
<div class="brief-function-description">
This is a convenience wrapper for the toUpperCase method java supplies<br>
<div class="code-block">
<code>
(upcase "Clojure") => "CLOJURE"
</code>
</div>
</div>
<h4>capitalize</h4>
<div class="brief-function-description">
This method capitalizes a string<br>
<div class="code-block">
<code>
(capitalize "clojure") => "Clojure"
</code>
</div>
</div>
<h4>titleize, camelize, dasherize, underscore</h4>
<div class="brief-function-description">
These methods manipulate "sentences", producing a consistent output. Check the unit tests for more examples<br>
<div class="code-block">
<code>
<table>
<tr>
<td>(titleize "clojure iS Awesome")</td>
<td>=></td>
<td>"Clojure Is Awesome"</td>
</tr>
<tr>
<td>(camleize "clojure iS Awesome")</td>
<td>=></td>
<td>"clojureIsAwesome"</td>
</tr>
<tr>
<td>(dasherize "clojure iS Awesome")</td>
<td>=></td>
<td>"clojure-is-awesome"</td>
</tr>
<tr>
<td>(underscore "clojure iS Awesome")</td>
<td>=></td>
<td>"clojure_is_awesome"</td>
</tr>
</table>
</code>
</div>
</div>
<h4>pluralize</h4>
<div>
This is an early attempt at Rails' <code>pluralaize</code> function. The code for the <code>pluralize</code>
function was based on functions contributed by Brian Doyle.
<div class="code-block">
<code>
<table>
<tr>
<td>(pluralize "foo")</td>
<td>=></td>
<td>"foos"</td>
</tr>
<tr>
<td>(pluralize "beach")</td>
<td>=></td>
<td>"beaches"</td>
</tr>
<tr>
<td>(pluralize "baby")</td>
<td>=></td>
<td>"babies"</td>
</tr>
<tr>
<td>(pluralize "bus")</td>
<td>=></td>
<td>"buses"</td>
</tr>
</table>
</code>
</div>
</div>
<h4>singularize</h4>
<div>
This is an early attempt at Rails' <code>singularize</code> function. The code for the <code>singulaize</code>
function was based on functions contributed by Brian Doyle.
<div class="code-block">
<code>
<table>
<tr>
<td>(singularize "foos")</td>
<td>=></td>
<td>"foo"</td>
</tr>
<tr>
<td>(singularize "beaches")</td>
<td>=></td>
<td>"beach"</td>
</tr>
<tr>
<td>(singularize "babies")</td>
<td>=></td>
<td>"baby"</td>
</tr>
<tr>
<td>(singularize "stops")</td>
<td>=></td>
<td>"stop"</td>
</tr>
</table>
</code>
</div>
</div>
<h2>Closing thoughts</h2>
<p>
There are three more methods, <code>str-join, chop, & chomp</code> that were already in str-utils. I changed the implementation of the methods, but the behavior should be the same.
</p>
<p>
There is a big catch with my proposed change. The signature of re-split, re-partition, re-gsub and re-sub changes. They will not be backwards compatible, and will break code. However, I think the flexibility is worth it.
</p>
<h2>TO-DOs</h2>
There are a few more things I'd like to add, but that could done at a later date.
<ul>
<li>Add more inflectors</li>
</ul>
The following additions become pretty easy if the propsed re-gsub is included:
<ul>
<li>Add HTML-escape function (like Rails' <code>h</code> method)</li>
<li>Add Javascript-escape function (like Rails' <code>escape_javascript</code> method)</li>
<li>Add SQL-escape function</li>
</ul>
<p>
Okay, that's everything I can think of for now. I'd like to thank the Stuart Sierra, and all of the contributors to this library. This is possible because I'm standing on their shoulders.
</p>
</div>
</body>
</html>