# andyroberts/jTokenizer

A Java library for splitting text into constituent words. This can be tricky for non-trivial examples, therefore the jTokenizer package was designed to combine a set of tokenizers that range from basic whitespace tokenizers to more complex ones that deal intuitively with natural language.
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.

jTokenizer - v.2.0 - README

Andrew Roberts (16-Jul-2006)

http://www.andy-roberts.net/coding/jtokenizer/

Overview
========

Tokenizing strings into its constituent words/tokens can prove tricky for
non-trivial examples. In particular, when dealing with natural language, you
must take into consideration punctuation too in order to isolate the words. The
jTokenizer package was designed to combine a set of tokenizers that range from
basic whitespace tokenizers to more complex ones that deal intuitively with
natural language.

Each of the tokenizers adopt a similar structure to java.util.StringTokenizer in
terms of how to instantiate the classes and extract the tokens. This means they
are simple to use.

What's new in 2.0?
====================

* A new GUI front-end to the jTokenizer library. You can type in, copy and
paste, or even load a text file into the application. You must select your
tokenizer of choice (and any options of interest) and then hit the Tokenize
button. Your results will be displayed as soon as they are processed and you
have the option to save the results to file, if you choose.

The GUI is particularly useful for experimenting with tokenization methods
in a teaching environment (such as an NLP course). It will also be of
interest to those wishing to use the jTokenizer library but don't have the
Java programming experience to utilise the code directly.

NB There have been no changes to the core tokenizer libraries and the API
remain fully compatible with prior versions.

Features
========

jTokenizer comprises of six tokenizers that all extend from an abstract
Tokenizer class:

* WhiteSpaceTokenizer - this splits a string on all occurrences of whitespace,
which include spaces, newlines, tabs and linefeeds.

* StringTokenizer - this is basically the same as java.util.StringTokenizer
with some extra methods (and extends from Tokenizer). Its default behaviour
is to act as a WhiteSpaceTokenizer, however, you can specify a set of
characters that are to be used to indicate word delimiters.

* RegexTokenizer - this tokenizer is much more flexible as you can use regular
expressions to define a what a token is. So, "\\w+" means whenever it matches
one or more letters, it will consider that a word. By default, it uses a
regular expression equivalent to a whitespace tokenizer.

* RegexSeparatorTokenizer - this can be thought of as an advanced
StringTokenizer. Whereas StringTokenizer is limited to defining delimiters
as a set of individual characters, RegexSeparatorTokenizer can utilise
regular expressions for a richer and more flexible approach.

* BreakIteratorTokenizer - the most sophisticated of the four, although should
only be used on natural language strings to isolate words. It also comes with
built-in rules about how to find words, knowing how to disregard punctuation,
etc.

* SentenceTokenizer - this also uses a BreakIterater like the above, but tuned
towards finding sentence boundaries. The "tokens" in this tokenizer are in
fact individual sentences.

Installation
============

The jTokenizer package doesn't need installing as such. You simply have to
virtual machine can "find" it.

To uncompress the file, there are many utilities. On Windows, a popular utility
is WinZip. On most platforms, there are command-line tools, such as 'unzip'
that can also be used.

It contains the following:

./jTokenizer-2.0.jar
./lib/swing-layout-1.0.jar  (additional library required for the GUI if Java
version is less than v6.0)

Important note:
In order to use jTokenizer, you need to have the Java Runtime Environment
installed. It requires Java 5.0 or above.

To obtain Java (or update to the latest version) goto http://www.java.com and
it will automatically detect the version that you need to download and install.

Running the jTokenizer GUI
==========================

On Windows:
When you install the Java Runtime, it normally associates .jar files
with a jar-runner program. Therefore, just double-clicking the
jTokenizer-2.0.jar file and the GUI should load promptly.

On all platforms:
At the command-line. change to the directory with the jar file and type:
java -jar jTokenizer-2.0.jar

Using the jTokenizer library in your programs
=============================================

The package is bundled together a JAR file, with is a Java archive containing
all the classes. JAR is actually compressed using the well known zip
algorithms. The advantage of using JARs are that you can keep lots of related
classes together in a single file, rather than having to uncompress them.

All Java needs to know is where the JAR file is, and there are a couple of
ways of achieving this. Imagine you have a class that uses a tokenizer from
this package called ClassThatTokenizes.java. To compile and run:

1. Specifying at the command-line

javac -classpath /path/to/jTokenizer-2.0.jar ClassThatTokenizes.java
java -classpath /path/to/jTokenizer-2.0.jar ClassThatTokenizes

NB in Windows, the path would be more like c:\path\to\jTokenizer-2.0.jar

2. Setting the CLASSPATH environment variable.

In Linux:

export CLASSPATH=$CLASSPATH:/path/to/jTokenizer-2.0.jar (for bash) setenv CLASSPATH$CLASSPATH:/path/to/jTokenizer-2.0.jar  (for csh)

javac ClassThatTokenizes.java
java ClassThatTokenizes

In Windows:

set CLASSPATH=%CLASSPATH%;c:\path\to\jTokenizer-2.0.jar

NB you can set the CLASSPATH via Control Panel/System/Advanced/Environment
Variables