Skip to content

davidlenz/pvtm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PVTM

Paragraph Vector Topic Model

PVTM represents documents in a semantic space via Doc2Vec to cluster them into meaningful topics using Gaussian mixture models (GMM). Doc2Vec has been shown to capture latent variables of documents, e.g., the underlying topics of a document. Clusters of documents in the vector space can be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling.

Highlights

  • 💬 Easily identify latent topics in large text corpora
  • 📈 Detect trends and measure topic importance over time
  • 📊 Identify topics in unseen documents
  • 🔭 Built-In text preprocessing

Install

git clone https://github.com/davidlenz/pvtm
cd pvtm
pip install -r requirements.txt
python setup.py install

Getting Started

Importing & Preprocessing documents

Once PVTM is installed, analysis on text documents can be conducted. The example below considers texts from different online news, the data can be loaded as follows

from pvtm import pvtm, example_texts
p = pvtm.PVTM(example_texts)
_ = p.preprocess(lemmatize = False, lang = 'en', min_df = 0.005)

PVTM object takes a list of strings as input. .preprocess method offers the possibility to clean (e.g. removal of special characters, number, currency symbols etc.) and lemmatize these strings. Parameter lemmatize should be set to True when documents' texts should be lemmatized. This can lead to improved results but also may take some time depending on the size of the corpus. If the texts should be lemmatized first, corresponding language models should be downloaded from here and the language parameter should be set, e.g. lang='en'. With the parameters min_df and max_df the thresholds for very rare/common words, which should not be included in the corpus specific vocabulary, can be set. Further, language specific stopwords can be excluded by importing custom stopwords.

Training

The next step includes training the Doc2Vec model and clustering of the resulted document vectors by means of Gaussian mixture modeling. The pvtm.fit() method should be called and the parameters needed for the Doc2Vec model training and GMM clustering should be passed. For more detailed description of the parameters see information provided on the gensim Doc2Vec documentation(Doc2Vec model) and sklearn for the Gaussian mixture model(GMM).

p.fit(vector_size = 50, # dimensionality of the feature vectors (Doc2Vec)
         n_components = 20, # number of Gaussian mixture components, i.e. Topics (GMM)
         epochs=30)

Visualize topics

The words closest to a topic center vector are considered as topic words. You can visualize topic words with a wordcloud:

p.wordcloud_by_topic(0)

Parameters

param default description
vector_size 300 dimensionality of the feature vectors (Doc2Vec)
n_components 15 number of Gaussian mixture components, i.e. Topics (GMM)
hs 0 negative sampling will be used for model training (Doc2Vec)
dbow_words 1 simultaneous training of word vectors and document vectors (Doc2Vec)
dm 0 Distributed bag of words (word2vec-Skip-Gram) (dm=0) OR distributed memory (dm=1)
epochs 1 training epochs (Doc2Vec)
window 1 window size (Doc2Vec)
seed 123 seed for the random number generator (Doc2Vec)
min_count 5 minimal number of appearences for a word to be considered (Doc2Vec)
workers 1 number workers (Doc2Vec)
alpha 0.025 initial learning rate (Doc2Vec)
min_alpha 0.025 doc2vec final learning rate. Learning rate will linearly drop to min_alpha as training progresses.
random_state 123 random seed (GMM)
covarinace_type 'diag' covariance type (GMM)
save True save the trained model
filename 'pvtm_model' name of the model to be saved

p.topic_wordscontains 100 frequent words from the texts which were assingned to single topics. p.wordcloud_dfcontains all texts which were assingned to single topics.

Best matching topics

search_topic_by_term method allows to search for topics which best describe defined term(s). For example,

p.search_topic_by_term(['deal'])
p.search_topic_by_term(['chance','market'])

return:

best_matching_topic 2
best_matching_topic 14

PVTM Web Viewer

For visualization of your results, one can run a dash app which allows to interactively explore topics in the browser. PVTM includes a web app build on dash to visualize results.

First, save a trained model:

p.save(path="./pvtm_model")

Then start the webapp from shell:

python webapp/webapp.py -m ./pvtm_model

Topics can be viewed in the browser. The PVTM webapp natively runs on port 8050.

Inference (*experimental*)

PVTM allows you to easily estimate the topic distribution for unseen documents using .infer_topics(). This methods explicitly calls .get_string_vector(getting a vector from the input text) and .get_topic_weights(probability distribution over all topics) consecutively.

topics = p.infer_topics(new_text)

which returns:

array([1.56368593e-06, 6.37091895e-10, 3.80703376e-04, 5.03966331e-06,
       1.42747313e-06, 1.67904347e-06, 4.88286876e-03, 2.65966754e-04,
       2.36464245e-05, 1.11277397e-02, 1.75574895e-05, 1.65568283e-04,
       1.86956832e-08, 5.60976912e-07, 2.58802897e-02, 2.47131308e-05,
       7.21725620e-08, 1.10484111e-02, 9.46138567e-01, 3.36056592e-05])

References

If you use PVTM, please cite the following paper:

Lenz D, Winker P (2020) Measuring the diffusion of innovations with paragraph vector topic models.

@article{10.1371/journal.pone.0226685,
    author = {Lenz, David AND Winker, Peter},
    journal = {PLOS ONE},
    publisher = {Public Library of Science},
    title = {Measuring the diffusion of innovations with paragraph vector topic models},
    year = {2020},
    month = {01},
    volume = {15},
    url = {https://doi.org/10.1371/journal.pone.0226685},
    pages = {1-18},
    abstract = {Measuring the diffusion of innovations from textual data sources besides patent data has not been studied extensively. However, early and accurate indicators of innovation and the recognition of trends in innovation are mandatory to successfully promote economic growth through technological progress via evidence-based policy making. In this study, we propose Paragraph Vector Topic Model (PVTM) and apply it to technology-related news articles to analyze innovation-related topics over time and gain insights regarding their diffusion process. PVTM represents documents in a semantic space, which has been shown to capture latent variables of the underlying documents, e.g., the latent topics. Clusters of documents in the semantic space can then be interpreted and transformed into meaningful topics by means of Gaussian mixture modeling. In using PVTM, we identify innovation-related topics from 170, 000 technology news articles published over a span of 20 years and gather insights about their diffusion state by measuring the topic importance in the corpus over time. Our results suggest that PVTM is a credible alternative to widely used topic models for the discovery of latent topics in (technology-related) news articles. An examination of three exemplary topics shows that innovation diffusion could be assessed using topic importance measures derived from PVTM. Thereby, we find that PVTM diffusion indicators for certain topics are Granger causal to Google Trend indices with matching search terms.},
    number = {1},
    doi = {10.1371/journal.pone.0226685}
}

License

MIT License https://en.wikipedia.org/wiki/MIT_License

About

Paragraph Vector Topic Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages