Dynamic Topic Model of Reuters News Articles between 2007-2013
We have implemented fast version of Dynamic Topic Model proposed by David Blei and John Lafferty in 2006.
This version takes advantage of new advancements in LDA model. We have implemented the LDA part of DTM using SCVB0 which is proposed by Foulds, et al 2013. This is parallelized implementation of SCVB0 using OpenMP.
As per our evaluation, even our Serial version gives 36X speedup and the Parallel version when run on core 2 duo 2GHz 2Gb machine gives 53X speedup.
Reuters News Dataset Details
Timestamped News articles published by Reuters between 2007 and 2013. This is corpus of 161,989 documents with vocab size of 32,468 after preprocessing. Following are the preprocessing steps performed (Scripts are available in Scrapper folder)
- From Reuters data we removed all the docs which have length less than 100 words
- We have scrapped random 10% of the data from each day. This was done just to minimize the corpus size.The assumption is that randomly selected data wont cause problem while finding the long and major topics.
- We removed all the punctuation marks and performed stemming using Porter2 stemmer
- We also removed the words which have frequency of less than 25 or more than 100,000 example run of text2ldac:
We have investigated the Topic Chains a solution to topic Birth-Death problem in Dynamic LDA proposed by Kim, et al in 2013.
- We use the same Reuters dataset and use the Jensen-Shannon (JS) divergence to compare similarity between the topics.
- We evaluate performance at different Similarity Thresholds and Window Sizes and find similar results as given in the original paper
- We identify some issues in the method and propose solutions to the same (Please refer the report for more details)
- Scrape Data from reuters archive website between startMonth for num_of_months
python init.py startMonth num_of_months
- Get Stopwords
- Convert the text data to ldac format used by Blei's implementation
python multitext2ldac.py data_folder --stopwords stopwords_file
- Convert data to UCI format
- Compile Dynamic LDA.
- Execute Dynamic Topic Modeling on UCI dataset
./fastLDA UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
- Get the word trend in a topic
python getWordVariation.py TopicId WordId PiFolderPath StartYear EndYear
- Compile Topic Chains GetData to get all the Topics in the dataset for all the TimeSlices
- Execute GetData for Topic Chains
./GetData UCIFormat_data_file iterations NumOfTopics MiniBatchSize Vocab_file GeneratePi
- Compile GenerateChains for Topic Chains
- Execute GenerateChains
./GenerateChains Pi_folder num_topics WindowSize SimilarityThreshold