GCTM

This is an implementation of Graph Convolution Topic Model (GCTM) which exploits an external knowledge graph in a streaming environment. In this work, we use graph convolutional networks (GCN) to embed the knowledge graph in topic space. We also develop a method to learn GCTM in data streams. Some benefits of GCTM are follows:

GCTM exploits a knowledge graph, which is derived from human knowledge or a pre-trained model, to enrich a topic model for data streams, especially in case of sparse or noisy data. We emphasize that our work first provides a way to model prior knowledge of graph form in a streaming environment
We also propose an automatic mechanism to balance between the prior knowledge and old knowledge learnt in the previous minibatch. This mechanism can automatically control the impact of the prior knowledge in each minibatch. When concept drift happens, it can automatically decrease the influence of the old knowledge but increase the influence of the prior knowledge to help GCTM deal well with the concept drift.

Installation

Clone the repository

		https://github.com/bachtranxuan/GCTM.git

Requirements environment

		Python 3.7
		Pytorch 1.2.0
		Numpy, Scipy

Training

You can run with command (Code is going to update very soon)

	python runGCTM.py \
		--folder='data' \
		--iteration=100 \
		--num_topics=50 \
		--batch_size=500 \
		--opt='adam' \
		--lr=0.01 \
		--alpha=0.01 \
		--sigma=1 \
		--num_tests=1 \
		--top=20 \
		--hidden=200 \
		--dropout=0

Data descriptions

Data for training consists of an external knowledge graph and a set of documents.

In terms of knowledge graph, we experiment on both WordNet and a pre-trained graph (Word2vec). For Wordnet, we use both synonym and antonym relationships between words to create edges and the weight of each edge is the Wu-Palmer similarity of the corresponding pair of words. For the pre-trained graph, we create a 200-nearest neighbour graph based on cosine similarity between Word2vec representations of words. The two graphs are saved in files: data/edgesw.txt and data/edges_knn200.txt respectively. Each line represents an edge and is of the form:

	vertex_id1 \tab vertex_id2 \tab weight

We use the bag-of-words model to represent documents. Each document is represented by a sparse vector of word counts. Data ís saved in a file (data/train.txt) in which each row is a document representation in form:

	[M] [term_id1]:[count] [term_id2]:[count] ... [term_idN]:[count]
	where [M] is the total of unique terms and the [count] is the word counts of each corresponding term in the document.  We note that both vertex_id and term_id refer to word_id in the vocabulary (data/vocab.txt).

Each document in the test set is divided randomly into two disjoint part (part_1) and (part_2) with a ratio of 4:1. We compute the predictive probability of part_2 when given part_1. The two parts are saved in two files: data/data_test_1_part_1.txt and data/data_test_1_part_2.txt respectively. Their forms are the same as the training data file.

Performance Measure:

We use log predictive probability (LPP) and Normalized pointwise mutual information (NPMI) to measure performance. While LPP is computed on the test set after training the model on each minibatch, NPMI is calculated on the whole training set after finishing training process.

Result

We compare our model with three state-of-the-art base-lines: SVB (Broderick et al., 2013), PVB (McInerney et al., 2015) and SVP-PP (Masegosa et al., 2017). We conduct intensive experiments with several scenarios that are described more explicitly in the paper. Here are some results.

Citation

if you find that TPS is useful for your research, please citing:

@article{van2022graph,
  title={A graph convolutional topic model for short and noisy text streams},
  author={Van Linh, Ngo and Bach, Tran Xuan and Than, Khoat},
  journal={Neurocomputing},
  volume={468},
  pages={345--359},
  year={2022},
  publisher={Elsevier}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
figures		figures
README.md		README.md
layerGCN.py		layerGCN.py
modelGCTM.py		modelGCTM.py
runGCTM.py		runGCTM.py
utilities.py		utilities.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GCTM

Installation

Training

Data descriptions

Performance Measure:

Result

Citation

About

Releases

Packages

Contributors 2

Languages

bachtranxuan/GCTM

Folders and files

Latest commit

History

Repository files navigation

GCTM

Installation

Training

Data descriptions

Performance Measure:

Result

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages