Skip to content

quynhneo/detm-arxiv

Repository files navigation

Extracting topic trends from paper abstracts with DETM

Quynh M. Nguyen a, b and Kyle Cranmer a, c

a Physics Department, New York University, New York 10003

b Applied Math Lab, Courant Institute, New York University, New York 10012

c Center for Data Science, New York University, New York 10011

Project description

Running dynamic embedded topic modeling on abstracts of arxiv articles and discover how topics in STEM change in time. This is an implementation of Dynamic Embedded Topic Modeling by Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei of Columbia University.

Get the abstracts

Visit https://www.kaggle.com/Cornell-University/arxiv to get arxiv-metadata-oai-snapshot.json which contains about 2 million records, each has a dozen of fields, and we are interested in abstract, categories, and update_date.

Generate embedding with word2vec

Modify the path to arxiv-metadata-oai-snapshot.json in arxivtools/word2vec.py and run:

python arxivtools/word2vec.py

This will read in abstracts, remove punctuations, remove stop words listed in arxivtools/stops.txt, remove rare words that appear in less than 30 abstracts, and words appear in more than 70% of abstracts, and produces vector representations of all the words left (default embedding dimension = 300) using original settings from Mikolov 2013 NIPS paper. The ressults are save as embeddings.txt where each line is a word following by 300 numbers. The process takes about an hour per 150,000 abstracts on a laptop.

Clone our fork of the original DETM repository

This is the main repo for DETM. We have made some changes to fix runtime errors, match the setting in the paper, adapt to arxiv metadata file, but no change to the model:

git clone https://github.com/quynhneo/DETM

The environtment could be set up by pip or conda, for example, using conda:

conda create --name detm --file requirements.txt 
conda activate detm

Preprocess text data

This step will convert each abstract to a bag of words (bag of integer tokens to be exact), with timestamp for each abstract, split the data into train, validation, test. These will be stored in .mat files. It also create a list of words, the vocabulary of all the abstracts, stored in vocab.txt. This is just list of words, without vectors. The vectors will be taken from embeddings.txt. So ideally the two lists contain the same words, or vocab is a large subset of embeddings. Modify path to arxiv-metadata-oai-snapshot.json in scripts/data_undebates.py and run:

python scripts/data_undebates.py

This will take about 5 minutes per 150,000 abstracts on a laptop. Using default settings, the output will be save in script/split_paragraph_False/min_df_30

Run Dynamic Embedded Topic Modeling

To run with all defaults settings, make changes in two lines: https://github.com/quynhneo/DETM/blob/master/main.py#L34: the parent folder of preprocessed data folder min_df_30. https://github.com/quynhneo/DETM/blob/master/main.py#L35 : path to prefit embedding embeddings.txt. Run with all default settings:

python main.py

This stage will take much longer and should be run with GPU (CPU mode is too slow even with a 16 cores node)

More instruction for running on a cluster using CUDA is here

Output will be 3 .mat files in results.

Plot the results

Edit beta_file in plot_word_evolution.py to be the path to the file ending in _beta in results and run:

python plot_word_evolution.py 

Results

The plot below shows results for DETM trained on hep-ph (high energy physics phenomenology) category, containning 150,000 abstracts. Six out of 50 topics are shown here. For each topics, probabilities of some selected words (in most cases, words with high probability) are plotted against time (2007-2020).

result

In topics #33 and #34, peak probability of the word 750 coincides with the flurry of papers on a possible discovery of new physics at 750 GeV around 2015-2016, which turned out to be just a statistical fluke. Topic 38 shows the increase in higgs around the time of the discovery of Higgs boson in 2012.

The above plots are from running 400 epoches on data of 150,000 abstracts of hep-ph. We use 1 Nvidia RTX8000 GPUs and the runtime was 13 hours.

About

Implementation of Dynamic Embedding Topic Modeling on arxiv.org articles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages