AMC (Chen and Liu, KDD 2014)

AMC is an open-source Java package implementing the algorithm proposed in the paper (Chen and Liu, KDD 2014), created by Zhiyuan (Brett) Chen. For more details, please refer to this paper.

If you use this package, please cite the paper: Zhiyuan Chen and Bing Liu. Mining Topics in Documents: Standing on the Shoulders of Big Data. In Proceedings of KDD 2014, pages 1116-1125.

If you have any question or bug report, please send it to Zhiyuan (Brett) Chen (czyuanacm@gmail.com).

-ismall: the path of input domains directory containing small dataset (e.g., 100 reviews each domain).
-ibig: the path of input domains directory containing big dataset (e.g., 1000 reviews each domain).
-o: the path of output model directory.
-nthreads: the number of threads used in the program. The program runs in parallel supporting multithreading.
-nTopics: the number of topics used in Topic Model for each domain.

## Input and Output ### Input The input directory should have two datasets, one is small and the other is big. Each dataset contains domain files. For each domain, there should be 2 files (can be opened by text editors):

domain.docs: each line (representing a document) contains a list of word ids. Here, a document is a sentence in a review after preprocessing.
domain.vocab: mapping from word id (starting from 0) to word, separated by ":".

Output

The output directory contains topic model results for LDA and AMC.

Under each model (LDA or AMC) directory, we have results for different datasets (small or big). Under the sub-folder "DomainModels", there are a list of domain folders where each domain folder contains topic model results for each domain. Under each domain folder, there are 9 files (can be opened by text editors):

domain.docs: each line (representing a document) contains a list of word ids.
domain.dtopicdist: document-topic distribution.
domain.knowl_cannotlinks: record the cannot-links used in the model (for AMC only).
domain.knowl_mustlinks: record the must-links used in the model (for AMC only).
domain.param: parameter settings.
domain.tassign: topic assignment for each word in each document.
domain.twdist: topic-word distribution
domain.twords: top words under each topic. The columns are separated by '\t' where each column corresponds to each topic.
domain.vocab: mapping from word id (starting from 0) to word.

## Efficiency The program and parameters are set to achieve the best performance in terms of topic coherence quality, instead of efficiency. There are several ways to improve efficiency (from the simplest to the hardest).

Increase the number of threads in the program (specified by -nthreads in file "global/CmdOption.java"). The topic models are execuated in parallel in each domain using multithreading.
Use a better implementation for Apriori algorithm with muliple supports or use faster frequent itemset algorithm such as FP-growth.

## Contact Information * Author: Zhiyuan (Brett) Chen * Affiliation: University of Illinois at Chicago * Research Area: Text Mining, Machine Learning, Statistical Natural Language Processing, and Data Mining * Email: czyuanacm@gmail.com * Homepage: http://www.cs.uic.edu/~zchen/

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Data/Input		Data/Input
Src		Src
LICENSE		LICENSE
README.md		README.md
ReadMe_PlainText		ReadMe_PlainText

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMC (Chen and Liu, KDD 2014)

Table of Contents

Output

About

Releases

Packages

Languages

License

czyuan/AMC

Folders and files

Latest commit

History

Repository files navigation

AMC (Chen and Liu, KDD 2014)

Table of Contents

Output

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages