Skip to content
A simple wrapper of the Lucene text tokenizer
Clojure
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
src/clj_tokenizer
test/clj_tokenizer/test
.gitignore
README.md
project.clj

README.md

clj-tokenizer

A simple Clojure wrapper around the Lucene text tokenizer. A wrapper for the Lucene StandardAnalyzer and Lucene StandardTokenizer are provided.

For a proper Clojure library for NLP see clojure-nlp.

The project can run from the command line and will tokenize each line of stdin, remove stopwords and write to stdout.

Usage

First clone the project. Then set up your

  lein deps
  lein compile; lein uberjar

For example, to use the tokenizer from the command line use java -jar

  curl http://www.gutenberg.org/cache/epub/2701/pg2701.txt | java -jar clj-tokenizer-0.1.0-SNAPSHOT-standalone.jar | head -100

will tokenize Herman Melville's Moby Dick.

To use the tokenizer within Clojure first add the dependency to project.clj

  [clj-tokenizer "0.1.0"]

To create a token stream:

  (token-seq (token-stream "This is a string."))
  ;; ("This" "is" "a" "string")

To convert to lowercase and remove stopwords:

  (token-seq (token-stream-without-stopwords "This is a string, without the stopwords."))
  ;; ("string" "without" "stopwords")

To stem the words using the Snowball stemmer:

  (token-seq (stemmed (token-stream-without-stopwords "Going to be Stemming some lemmings.")))
  ;; ("go" "stem" "some" "lem")

License

Copyright (C) 2010 Erik Andrejko

Distributed under the Eclipse Public License, the same as Clojure.

You can’t perform that action at this time.