A simple Rocchio-tfidf text categorizer.
- Python 3 interpreter (Not Python 2!)
pip
package manager (recommended)
To get textcat
up and running, the following code snippet should suffice on a
UNIX terminal (note that depending on your Python distribution, you may need to
use pip3
instead of pip
, and python3
instead of python
):
$ git clone https://github.com/eigenfoo/textcat.git
$ cd textcat
$ pip install -r requirements.txt
$ python nltk_download.py
This clones textcat
from my my GitHub
repository, installs all required Python
packages using pip
, and downloads all required nltk
packages.
Note that, for the Python packages installed by pip
, I have specified the
package versions I had on the machine that I developed this program on. It is
likely that the program will still work with more recent versions, but I have
not tested this.
To train:
python train.py TRAIN_LABEL_FILENAME MODEL_FILENAME
where the arguments are, in order:
- the filename of the list of labelled training documents, and
- the filename where you wish the classifier to be saved.
To test:
python test.py TEST_LABEL_FILENAME MODEL_FILENAME OUTPUT_FILENAME
where the arguments are, in order:
- the filename of the list of documents to be categorized,
- the filename of the saved classifier,
- the filename where you wish the results to be written.
Since Rocchio-tfidf is a simple centroid-based categorization technique, it has no parameters to tune, and no smoothing is required. As such, the manner in which the text is preprocessing is of primary importance.
The linguistic preprocessing in the final categorizer is as follows:
- All text is lowercased.
- The text is tokenized with the Punkt tokenizer.
- The tokens are part-of-speech tagged with the Averaged Perceptron tagger.
- The tokens are lemmatized with the WordNet lemmatizer.
- Any stopwords (using the
nltk
builtin stopword list) are then stripped.
Lowercasing was done more as an act of habit than as a well thought-out decision.
The tokenizer used is the Punkt tokenizer. Several other tokenizers were
considered (e.g. the Stanford Tokenizer and Penn Treebank Tokenizer). The
Stanford Tokenizer was found to be an improvement on the Penn Treebank
Tokenizer, so the latter was not tested. The Stanford Tokenizer appeared to give
similar performance to the Punkt tokenizer, and had a significantly more
complicated interface (since the Punkt tokenizer is the default in nltk
), so
the Punkt tokenizer was used.
The part-of-speech tagger used is the Averaged Perceptron Tagger. Tagging
provided small boosts in performance, so it was adopted. The Averaged Perceptron
Tagger is a state-of-the-art
tagger, and is
also the default part-of-speech tagger in nltk
.
Lemmatization was chosen instead of stemming, as several unrelated words may be
stemmed to the same token: lemmatization avoids this problem, at the cost of
a greater computation load (which for the present application is not a problem).
Having decided on lemmatization, there is only one lemmatizer in nltk
: the
WordNet lemmatizer.
Stop words were filtered using the built-in nltk
stopword list, again, more as
an action of habit than as a well thought-out decision. In any case, stopwords
usually have low tf-idf values, and would not contribute much to the cosine
similarity metric in any case.
For the second corpus, it was found that stripping proper nouns and cardinal numbers led to minor improvements in categorization accuracy. This makes sense, as whether an image is taken indoors or outdoors does not depend on who is indoors or outdoors. Thus, any article with fewer than 100 words was assumed to come from the second corpus, and had its proper nouns (both singular and plural) and cardinal numbers stripped. This is admittedly a clunky implementation, as there are some articles in the other two corpora with 100 words or fewer, but this error seems to cause no grievous problems.
In order to evaluate the performance of the text categorizer on the second and
third corpora (for which the test sets were not provided), a shell script
(split_corpora.sh
) was written to split off a test set from the training set
(with about 1/3 of the articles, following the pattern of the first corpus).