MalletWrapper

A Python wrapper for MALLET. Topic modeling only, for now.

Mallet Installation Instructions (Mac)

Download the Mallet directory
Download JDK
Install Homebrew
Install Ant via brew install ant

cd /dir/containing/mallet-2.x.x
ant

MalletWrapper Usage

from MalletWrapper import Mallet

model = Mallet('/Users/mikeronayne/mallet-2.0.8/')
model.import_dir(input='/Users/mikeronayne/mallet-2.0.8/sample-data/web/en')
model.train_topics()

print(model.topic_keys) # see output_topic_keys parameter in Train Topics documentation
print(model.doc_topics) # see output_doc_topics parameter in Train Topics documentation
print(model.word_weights) # see topic_word_weights_file parameter in Train Topics documentation

# {topic # (int): {dirichlet parameter: float, words: list}, ... }
{0: {'dirichlet': 0.5, 'words': ['rings', 'are', 'were', 'norway', 'ring', 'dust', 'only', 'number', 'may', 'moons', 'narrow', 'uranian', 'particles', 'discovered', 'uranus', 'survived', 'some', 'saga', 'including', 'system']}, ... }

# {document # (int): {document name: e.g. file path (str), topics: {topic # (int): weight (float), ... }}, ... }
{0: {'name': 'file:/Users/mikeronayne/mallet-2.0.8/sample-data/web/en/elizabeth_needham.txt', 'topics': {7: 0.3105263157894737, 6: 0.3, 1: 0.19473684210526315, 8: 0.07894736842105263, 0: 0.03684210526315789, 9: 0.02631578947368421, 2: 0.02631578947368421, 3: 0.015789473684210527, 5: 0.005263157894736842, 4: 0.005263157894736842}}, ... }

# {topic # (int): {word (str): weight (float)}, ... }
{0: {'elizabeth': 0.01, 'needham': 0.01, 'died': 0.01, 'may': 3.01, 'also': 0.01, 'known': 0.01, 'mother': 0.01, 'was': 0.01, 'english': 0.01, 'procuress': 0.01, 'and': 0.01, 'brothel-keeper': 1.01, 'th-century': 0.01, 'london': 0.01, ... }, ... }

MalletWraper Documentation

Constructor

Mallet(mallet_dir, memory=1)

Parameter	Type	Description	Default
mallet_dir	str	File path of Mallet-2.x.x directory
memory	int, float	Maximum gigabytes of memory to allocate to Mallet	1

Importing Data

Via Directory

import_dir(**kwargs)

Parameter	Type	Description	Default
input	str, list	The directories containing text files to be classified, one directory per class	null
preserve_case	bool	If true, do not force all strings to lowercase.	False
replacement_files	str, list	Files containing string replacements, one per line: 'A B [tab] C' replaces A B with C; 'A B' replaces A B with A_B	null
deletion_files	str, list	Files containing strings to delete after replacements but before tokenization (ie multiword stop terms)	null
remove_stopwords	bool	If true, remove a default list of common English "stop words" from the text.	False
stoplist_file	str	Instead of the default list, read stop words from a file, one per line. Implies `remove_stopwords`	null
extra_stopwords	str	Read whitespace-separated words from this file, and add them to either the default English stoplist or the list specified by `stoplist_file`.	null
stop_pattern_file	str	Read regular expressions from a file, one per line. Tokens matching these regexps will be removed.	null
skip_header	bool	If true, in each document, remove text occurring before a blank line. This is useful for removing email or UseNet header	False
skip_html	bool	If true, remove text occurring inside <...>, as in HTML or SGML.	False
gram_sizes	int, str	Include among the features all n-grams of sizes specified. For example, to get all unigrams and bigrams, use `gram_sizes='1,2'`. This option occurs after the removal of stop words, if removed.	1
encoding	str	Character encoding for input file	UTF-8
token_regex	str	Regular expression used for tokenization. Example: `[\p{L}\p{N}_]+\|[\p{P}]+` (unicode letters, numbers and underscore OR all punctuation)	`\p{L}[\p{L}\p{P}]+\p{L}`
print_output	bool	If true, print a representation of the processed data to standard output. This option is intended for debugging.	False

Via File

import_file(**kwargs)

Parameter	Type	Description	Default
input	str	The file containing data to be classified, one instance per line	null
line_regex	str	Regular expression containing regex-groups for label, name and data.	`^(\S)[\s,](\S)[\s,](.*)$`
name	int	The index of the group containing the instance name. Use 0 to indicate that the name field is not used.	1
data	int	The index of the group containing the data.	3
remove_stopwords	bool	If true, remove a default list of common English "stop words" from the text.	False
replacement_files	str, list	Files containing string replacements, one per line: 'A B [tab] C' replaces A B with C; 'A B' replaces A B with A_B	null
deletion_files	str, list	Files containing strings to delete after replacements but before tokenization (ie multiword stop terms)	null
stoplist_file	str	Instead of the default list, read stop words from a file, one per line. Implies `remove_stopwords`	null
extra_stopwords	str	Read whitespace-separated words from this file, and add them to either the default English stoplist or the list specified by `stoplist_file`.	null
stop_pattern_file	str	Read regular expressions from a file, one per line. Tokens matching these regexps will be removed.	null
preserve_case	bool	If true, do not force all strings to lowercase.	False
encoding	str	Character encoding for input file	UTF-8
token_regex	str	Regular expression used for tokenization. Example: `[\p{L}\p{N}_]+\|[\p{P}]+` (unicode letters, numbers and underscore OR all punctuation)	`\p{L}[\p{L}\p{P}]+\p{L}`
print_output	bool	If true, print a representation of the processed data to standard output. This option is intended for debugging.	False

Train Topics

train_topics(**kwargs)

Parameter	Type	Description	Default
input_model	str	The filename from which to read the binary topic model. The `input` option is ignored. By default this is null, indicating that no file will be read.	null
input_state	str	The filename from which to read the gzipped Gibbs sampling state created by `output_state`. The original input file must be included, using `input`. By default this is null, indicating that no file will be read.	null
output_model	str	The filename in which to write the binary topic model at the end of the iterations. By default this is null, indicating that no file will be written.	null
output_state	str	The filename in which to write the Gibbs sampling state after at the end of the iterations. By default this is null, indicating that no file will be written.	null
output_model_interval	int	The number of iterations between writing the model (and its Gibbs sampling state) to a binary file. You must also set the output_model to use this option, whose argument will be the prefix of the filenames.	0
output_state_interval	int	The number of iterations between writing the sampling state to a text file. You must also set the `output_state` to use this option, whose argument will be the prefix of the filenames.	0
inferencer_filename	str	A topic inferencer applies a previously trained topic model to new documents. By default this is null, indicating that no file will be written.	null
evaluator_filename	str	A held-out likelihood evaluator for new documents. By default this is null, indicating that no file will be written.	null
output_topic_keys	str	The filename in which to write the top words for each topic and any Dirichlet parameters. By default this is null, indicating that no file will be written.	null
num_top_words	int	The number of most probable words to print for each topic after model estimation.	20
show_topics_interval	int	The number of iterations between printing a brief summary of the topics so far.	50
topic_word_weights_file	str	The filename in which to write unnormalized weights for every topic and word type. By default this is null, indicating that no file will be written.	null
word_topic_counts_file	str	The filename in which to write a sparse representation of topic-word assignments. By default this is null, indicating that no file will be written.	null
diagnostics_file	str	The filename in which to write measures of topic quality, in XML format. By default this is null, indicating that no file will be written.	null
xml_topic_report	str	The filename in which to write the top words for each topic and any Dirichlet parameters in XML format. By default this is null, indicating that no file will be written.	null
xml_topic_phrase_report	str	The filename in which to write the top words and phrases for each topic and any Dirichlet parameters in XML format. By default this is null, indicating that no file will be written.	null
output_topic_docs	str	The filename in which to write the most prominent documents for each topic, at the end of the iterations. By default this is null, indicating that no file will be written.	null
num_top_docs	int	When writing topic documents with `output_topic_docs`, report this number of top documents.	100
output_doc_topics	str	The filename in which to write the topic proportions per document, at the end of the iterations. By default this is null, indicating that no file will be written.	null
doc_topics_threshold	float	When writing topic proportions per document with `output_doc_topics`, do not print topics with proportions less than this threshold value.	0.0
doc_topics_max	int	When writing topic proportions per document with `output_doc_topics`, do not print more than INTEGER number of topics. A negative value indicates that all topics should be printed.	-1
num_topics	int	The number of topics to fit.	10
num_threads	int	The number of threads for parallel training.	1
num_iterations	int	The number of iterations of Gibbs sampling.	1000
num_icm_iterations	int	The number of iterations of iterated conditional modes (topic maximization).	0
no_inference	bool	Do not perform inference, just load a saved model and create a report. Equivalent to `num_iterations` 0.	False
random_seed	int	The random seed for the Gibbs sampler. Default is 0, which will use the clock.	0
optimize_interval	int	The number of iterations between reestimating dirichlet hyperparameters.	0
optimize_burn_in	int	The number of iterations to run before first estimating dirichlet hyperparameters.	200
use_symmetric_alpha	bool	Only optimize the concentration parameter of the prior over document-topic distributions. This may reduce the number of very small, poorly estimated topics, but may disperse common words over several topics.	False
alpha	float	SumAlpha parameter: sum over topics of smoothing over doc-topic distributions. alpha_k = [this value] / [num topics]	5.0
beta	float	Beta parameter: smoothing parameter for each topic-word. beta_w = [this value]	0.01

Future Improvements

Provide interface to move away from file reading (e.g. no extra stopwords file)
Better error handling, especially checking for bad inputs
Classification

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
MalletWrapper.py		MalletWrapper.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MalletWrapper

Mallet Installation Instructions (Mac)

MalletWrapper Usage

MalletWraper Documentation

Constructor

Importing Data

Via Directory

Via File

Train Topics

Future Improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MalletWrapper

Mallet Installation Instructions (Mac)

MalletWrapper Usage

MalletWraper Documentation

Constructor

Importing Data

Via Directory

Via File

Train Topics

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages