PVTM

Paragraph Vector Topic Model

PVTM represents documents in a semantic space via Doc2Vec to cluster them into meaningful topics using Gaussian mixture models (GMM). Doc2Vec has been shown to capture latent variables of documents, e.g., the underlying topics of a document. Clusters of documents in the vector space can be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling.

Highlights

💬 Easily identify latent topics in large text corpora
📈 Detect trends and measure topic importance over time
📊 Identify topics in unseen documents
🔭 Built-In text preprocessing

Install

git clone https://github.com/davidlenz/pvtm
cd pvtm
pip install -r requirements.txt
python setup.py install

Getting Started

Importing & Preprocessing documents

Once PVTM is installed, analysis on text documents can be conducted. The example below considers texts from different online news, the data can be loaded as follows

from pvtm import pvtm, example_texts
p = pvtm.PVTM(example_texts)
_ = p.preprocess(lemmatize = False, lang = 'en', min_df = 0.005)

PVTM object takes a list of strings as input. .preprocess method offers the possibility to clean (e.g. removal of special characters, number, currency symbols etc.) and lemmatize these strings. Parameter lemmatize should be set to True when documents' texts should be lemmatized. This can lead to improved results but also may take some time depending on the size of the corpus. If the texts should be lemmatized first, corresponding language models should be downloaded from here and the language parameter should be set, e.g. lang='en'. With the parameters min_df and max_df the thresholds for very rare/common words, which should not be included in the corpus specific vocabulary, can be set. Further, language specific stopwords can be excluded by importing custom stopwords.

Training

The next step includes training the Doc2Vec model and clustering of the resulted document vectors by means of Gaussian mixture modeling. The pvtm.fit() method should be called and the parameters needed for the Doc2Vec model training and GMM clustering should be passed. For more detailed description of the parameters see information provided on the gensim Doc2Vec documentation(Doc2Vec model) and sklearn for the Gaussian mixture model(GMM).

p.fit(vector_size = 50, # dimensionality of the feature vectors (Doc2Vec)
         n_components = 20, # number of Gaussian mixture components, i.e. Topics (GMM)
         epochs=30)

Visualize topics

The words closest to a topic center vector are considered as topic words. You can visualize topic words with a wordcloud:

p.wordcloud_by_topic(0)

Parameters

param	default	description
vector_size	300	dimensionality of the feature vectors (Doc2Vec)
n_components	15	number of Gaussian mixture components, i.e. Topics (GMM)
hs	0	negative sampling will be used for model training (Doc2Vec)
dbow_words	1	simultaneous training of word vectors and document vectors (Doc2Vec)
dm	0	Distributed bag of words (word2vec-Skip-Gram) (dm=0) OR distributed memory (dm=1)
epochs	1	training epochs (Doc2Vec)
window	1	window size (Doc2Vec)
seed	123	seed for the random number generator (Doc2Vec)
min_count	5	minimal number of appearences for a word to be considered (Doc2Vec)
workers	1	number workers (Doc2Vec)
alpha	0.025	initial learning rate (Doc2Vec)
min_alpha	0.025	doc2vec final learning rate. Learning rate will linearly drop to min_alpha as training progresses.
random_state	123	random seed (GMM)
covarinace_type	'diag'	covariance type (GMM)
save	True	save the trained model
filename	'pvtm_model'	name of the model to be saved

p.topic_wordscontains 100 frequent words from the texts which were assingned to single topics. p.wordcloud_dfcontains all texts which were assingned to single topics.

Best matching topics

search_topic_by_term method allows to search for topics which best describe defined term(s). For example,

p.search_topic_by_term(['deal'])
p.search_topic_by_term(['chance','market'])

return:

best_matching_topic 2
best_matching_topic 14

PVTM Web Viewer

For visualization of your results, one can run a dash app which allows to interactively explore topics in the browser. PVTM includes a web app build on dash to visualize results.

First, save a trained model:

p.save(path="./pvtm_model")

Then start the webapp from shell:

python webapp/webapp.py -m ./pvtm_model

Topics can be viewed in the browser. The PVTM webapp natively runs on port 8050.

Inference (experimental)

PVTM allows you to easily estimate the topic distribution for unseen documents using .infer_topics(). This methods explicitly calls .get_string_vector(getting a vector from the input text) and .get_topic_weights(probability distribution over all topics) consecutively.

topics = p.infer_topics(new_text)

which returns:

array([1.56368593e-06, 6.37091895e-10, 3.80703376e-04, 5.03966331e-06,
       1.42747313e-06, 1.67904347e-06, 4.88286876e-03, 2.65966754e-04,
       2.36464245e-05, 1.11277397e-02, 1.75574895e-05, 1.65568283e-04,
       1.86956832e-08, 5.60976912e-07, 2.58802897e-02, 2.47131308e-05,
       7.21725620e-08, 1.10484111e-02, 9.46138567e-01, 3.36056592e-05])

References

If you use PVTM, please cite the following paper:

Lenz D, Winker P (2020) Measuring the diffusion of innovations with paragraph vector topic models.

@article{10.1371/journal.pone.0226685,
    author = {Lenz, David AND Winker, Peter},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Measuring the diffusion of innovations with paragraph vector topic models},
    year = {2020},
    month = {01},
    volume = {15},
    url = {https://doi.org/10.1371/journal.pone.0226685},
    pages = {1-18},
    abstract = {Measuring the diffusion of innovations from textual data sources besides patent data has not been studied extensively. However, early and accurate indicators of innovation and the recognition of trends in innovation are mandatory to successfully promote economic growth through technological progress via evidence-based policy making. In this study, we propose Paragraph Vector Topic Model (PVTM) and apply it to technology-related news articles to analyze innovation-related topics over time and gain insights regarding their diffusion process. PVTM represents documents in a semantic space, which has been shown to capture latent variables of the underlying documents, e.g., the latent topics. Clusters of documents in the semantic space can then be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling. In using PVTM, we identify innovation-related topics from 170, 000 technology news articles published over a span of 20 years and gather insights about their diffusion state by measuring the topic importance in the corpus over time. Our results suggest that PVTM is a credible alternative to widely used topic models for the discovery of latent topics in (technology-related) news articles. An examination of three exemplary topics shows that innovation diffusion could be assessed using topic importance measures derived from PVTM. Thereby, we find that PVTM diffusion indicators for certain topics are Granger causal to Google Trend indices with matching search terms.},
    number = {1},
    doi = {10.1371/journal.pone.0226685}
}

License

MIT License https://en.wikipedia.org/wiki/MIT_License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

example

example

img

img

pvtm

pvtm

webapp

webapp

README.md

README.md

README.rst

README.rst

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

PVTM

Highlights

Install

Getting Started

Importing & Preprocessing documents

Training

Visualize topics

Parameters

Best matching topics

PVTM Web Viewer

Inference (experimental)

References

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
data		data
example		example
img		img
pvtm		pvtm
webapp		webapp
README.md		README.md
README.rst		README.rst
requirements.txt		requirements.txt
setup.py		setup.py

davidlenz/pvtm

Folders and files

Latest commit

History

Repository files navigation

PVTM

Highlights

Install

Getting Started

Importing & Preprocessing documents

Training

Visualize topics

Parameters

Best matching topics

PVTM Web Viewer

Inference (*experimental*)

References

License

About

Resources

Stars

Watchers

Forks

Languages

Inference (experimental)