Home

Victor edited this page Aug 30, 2013 · 3 revisions
Clone this wiki locally

The Xerox Part-of-Speech Tagger was developed by Doug Cutting and Jan Pedersen of the Xerox Palo Alto Research Center. It was released somewhere in the early 1990-ies, probably 1993.

The tagger is written in ANSI Common Lisp. Development was done in Franz Allegro Common Lisp version 4.1, though the code should port more or less unchanged to other ANSI Common Lisp implementations. The sources in this repository were proven to work on modern SBCL and Clozure CL implementations.

Distribution includes (in the src/common-lisp/ subdirectory) a compatibility package intended to bring non-ANSI Common Lisp implementations into conformance.

Installation and Loading

Until this project is added to the Quicklisp repository installation must be performed manually in several steps (considering that you've got Quicklisp installed already):

  1. Download project sources (either by cloning the repository or downloading it in an archive).
  2. Unpack them to some directory, remember it.
  3. cd to the ~/quicklisp/local-projects directory.
  4. Create symbolic links to the .asd files in the directory from step 2.

Now it is possible to download the application either in parts or entirely:

(ql:quickload "tagger")

Running the Tagger

In addition to the source code, distribution includes:

  • a tokenizer for plain ASCII english;
  • a lexicon for English induced from the Brown corpus;
  • a table mapping word suffixes to likely ambiguity classes (to handle words not in the lexicon) inferred from the Brown corpus; and
  • an HMM (hidden Markov model) trained on the odd numbered sentences in the Brown corpus.

These are provided in the system :tag-english, used in the examples in this section. With this lexicon and HMM, correct tags are assigned to 96% of the words in the even numbered sentences of the Brown corpus. Note however that, when training this HMM, no unknown forms were encountered, providing no training data for forms assigned the open class (i.e., unknown words with unknown suffixes). One should consider retraining on representative text to defeat this problem. An HMM trained on the full text of the novel Moby Dick is included in the subdirectory data/moby-dick.

The functions tag-analysis:tag-file and tag-analysis:tag-string provide convenient diagnostic interfaces to the tagger. Both functions print an annotated version of the text passed to them. Examples of their use follows.

? (tag-analysis:tag-file "copy.txt")
A copying machine is a device that produces copies of original documents.
at/2 vbg nn bez at nn wps/3 vbz nns in jj/2 nns
? (tag-analysis:tag-string "I saw the man on the hill with the telescope.")
I saw the man on the hill with the telescope.
ppss/2 vbd/3 at nn in at nn in/2 at nn/2

(The number following the tag is the arity of the ambiguity class assigned by the lexicon. Words without a number are unambiguous.)

Programmatic Tagging

To use the tagger in a program, create a tagging-ts and use the values of calls to the generic function next-token. Note that reinitialize-instance redirects tagging to a new text with minimal initialization overhead.

For example, the following function, my-tag-files, calls my-process-token-and-tag on each token/tag pair generated by tagging each file in the argument files:

(use-package :tdb)
(use-package :tag-analysis)

(defun my-tag-files (files)
  (let ((token-stream (make-instance 'tagging-ts)))
    (dolist (file files)
      (with-open-file (char-stream file)
	(reinitialize-instance token-stream :char-stream char-stream)
	(loop (multiple-value-bind (token tag)
		  (next-token token-stream)
		(unless token (return))
		(my-process-token-and-tag token tag)))))))

Modifying the Tagger

In this section we describe how to retrain the tagger on different text, and how to change the tokenizer, lexicon, and various other parameters of the tagger.

All these parameters are specified by setting the variable *tokenizer-class* to a class of token stream which, when given text, returns tokens paired with ambiguity classes. This is typically done by defining a subclass of basic-tag-ts which specifies values for its parameters as follows:

(use-package :tag-basics)
(defclass my-tag-tokenizer (basic-tag-ts) ()
  (:default-initargs
   :tagger-pathname #p"/users/me/my-tagger/my.hmm"
   :tokenizer-fsa *my-tokenizer-fsa*
   :lexicon (or *my-lexicon*
		(setq *my-lexicon* (make-instance 'my-lexicon)))
   :transition-biases *my-transition-biases*
   :symbol-biases *my-symbol-biases*))

(setq *tokenizer-class* (find-class 'my-tag-tokenizer))

These parameters are described further in the following sections.

Training

The parameter :tagger-pathname indicates where the HMM will be read from and written to.

To train a tagger, use the function tag-trainer:train-on-files, for example:

? (train-on-files (directory "*.txt"))

The Tokenizer FSA

The parameter :tokenizer-fsa specifies the finite state automaton to be used to break up the text to be tagged into tokens.

These are usually defined with the macro fsa-tokenizer:def-tokenizer-fsa, for example:

(use-package :fsa-tokenizer)
(def-tokenizer-fsa *my-tokenizer-fsa*
    (let* ((digit `(/ #\0 #\1 #\2 #\3 #\4 #\5 #\6 #\7 #\8 #\9))
	   (small `(/ #\a #\b #\c #\d #\e #\f #\g #\h #\i #\j #\k #\l #\m #\n
		      #\o #\p #\q #\r #\s #\t #\u #\v #\w #\x #\y #\z))
	   (cap `(/ #\A #\B #\C #\D #\E #\F #\G #\H #\I #\J #\K #\L #\M #\N
		    #\O #\P #\Q #\R #\S #\T #\U #\V #\W #\X #\Y #\Z))
	   (alpha `(/ ,small ,cap)))
      `((:word (+ ,alpha))
	(:number (! (+ ,digit) (* ,alpha)))
	(:sent (! (/ #\. #\? #\!)
		  (+ (/ #\newline #\space))
		  ,cap)
	       1))))

A tokenizer is specified as a list of rules. Each rule consists of three parts: a type (named by a keyword), a regular expression and, optionally a backoff.

The types :word and :sent are treated specially by the tagger. Tokens of type :word will be looked up in the lexicon. All others will not. Tokens of type :sent identify sentence boundaries. The tagger processes text one sentence at a time.

All other types identify tokens as members of singleton ambiguity classes, i.e., unambiguously tag a token with that type.

Regular expressions are recursively composed of strings and characters with the following operators:

  • * kleene star, matches zero or more occurrences of its argument;
  • + matches one or more occurrences of its argument;
  • ? matches zero or one occurence of its argument;
  • / matches any of its arguments;
  • ! matches the sequence of its arguments; and
  • - matches those strings which match its first argument, but not its second argument.

Starting at a some point in the input, the tokenizer finds the longest string which matches a rule, returning the string and the type of the rule. The input is advanced to the next character after this match, and the tokenizer restarts. If no rule matches the input is advanced one character and the tokenizer restarts. If two rules match the same string, the one earlier in the definition is used.

If the matching rule has a backup specified, then the input is backed off that number of characters when that rule fires. This allows rules to specify right context.

The function tag-analysis:tokenize-file tokenizes a file, producing output in a format similar to that of tag-analysis:tag-file.

Specifying a lexicon

A lexicon may be specified by defining a class subclassing vector-lexicon and suffix-lexicon. The class vector-lexicon allows the specification of lexica by exhaustive enumeration, while suffix-lexicon infers part-of-speech ambiguity classes based upon word suffixes. The ordering of these superclasses determines which lexicon is consulted first.

For example:

(use-package :vector-lexicon)
(use-package :class-guesser)

(defvar *my-open-class* '(:noun :proper-noun :adjective :verb))
(defclass my-lexicon (vector-lexicon suffix-lexicon) ()

(:default-initargs
 :lexicon-open-class *my-open-class*
  :lexicon-pathname #p"/users/me/my-tagger/lexicon.txt"
  :lexicon-size 15000
  :suffix-pathname #p"/users/me/my-tagger/suffix.trie")

The lexicon parameter :lexicon-open-class lists the open class tags. This is used when all other lookups fail.

The lexicon parameter :lexicon-pathname specifies a file containing the entries for the vector-lexicon. This file should consist of lines containing three forms, separated by spaces: surface, tag and lexical. The file must be lexicographically sorted, as by the Unix program sort. The surface forms are used to match input, while lexical forms are returned (with tag) by the tagger.

The lexicon parameter :lexicon-size provides a hint to vector-lexicon of the number of distinct surface forms in the file. Accurate specification of this improves efficiency, but is not required.

The lexicon parameter :suffix-pathname names the file where suffix-lexicon will store its suffix table. To create such a file, based on a lexicon and some training text, use the function class-guesser:train-guesser-on-files.

The function tdb:lexicon-lookup, given a surface form and a lexicon, returns as multiple values two parallel vectors containing lexical forms and corresponding tags. Note that if all of the lexical forms are identical, the first value is simply that string.

Specifying Biases

The parameters :transition-biases and :symbol-biases enable tuning of the tagger by influencing initial probabilities when training. For example:

(setq *my-transition-biases*
      '((:valid :to-infinitive :verb)
	(:invalid :determiner :verb :preposition)))
(setq *my-symbol-biases*
      `((:valid (:determiner :noun) :determiner)
	(:valid ,*my-open-class* :noun :proper-noun)))

Here *my-transition-biases* indicates that the infinitive marker is likely to be followed by a verb. Similarly, determiners are not likely to be followed by verbs or prepositions. A :valid clause is equivalent to an :invalid clause whose tail is the complement of the tags in the :valid clause, and vice versa.

And *my-symbol-biases* indicates that words in the ambiguity class (:determiner :noun) are likely to be determiners, and words which are assigned the open class (i.e., words otherwise unknown to the lexicon) are more likely to be nouns or proper nouns than anything else.

References

  1. D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. A practical part-of-speech tagger. In Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, April 1992. ACL. Also available as Xerox PARC technical report SSL-92-01.
  2. D.R. Cutting, J. Pedersen, and P.-K. Halvorsen. An object-oriented architecture for text retrieval. In Conference Proceedings of RIAO'91, Intelligent Text and Image Handling, Barcelona, Spain, pages 285-298, April 1991. Also available as Xerox PARC technical report SSL-90-83.
  3. W. N. Francis and F. Ku^cera. Frequency Analysis of English Usage. Houghton Mifflin, 1982.