<a href="https://colab.research.google.com/github/DmitryIo/LDA/blob/master/LDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `pyLDAvis.sklearn`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using the 20 newsgroups dataset as provided by scikit-learn.

In [None]:
from __future__ import print_function

In [None]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [None]:
pip install pyldavis


Collecting pyldavis
[?25l  Downloading https://files.pythonhosted.org/packages/a5/3a/af82e070a8a96e13217c8f362f9a73e82d61ac8fff3a2561946a97f96266/pyLDAvis-2.1.2.tar.gz (1.6MB)
[K     |████████████████████████████████| 1.6MB 1.4MB/s 
Collecting funcy
[?25l  Downloading https://files.pythonhosted.org/packages/ce/4b/6ffa76544e46614123de31574ad95758c421aae391a1764921b8a81e1eae/funcy-1.14.tar.gz (548kB)
[K     |████████████████████████████████| 552kB 8.4MB/s 
Building wheels for collected packages: pyldavis, funcy
  Building wheel for pyldavis (setup.py) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97711 sha256=6aa51c34211991f641a94b5abbb09986f3295515f7a90ee545bfa3a3a5d707ff
  Stored in directory: /root/.cache/pip/wheels/98/71/24/513a99e58bb6b8465bae4d2d5e9dba8f0bef8179e3051ac414
  Building wheel for funcy (setup.py) ... [?25l[?25hdone
  Created wheel for funcy: filename=funcy-1.14-py2.py3-none-any.whl size=32042 sha256=3ce388d0b

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn is loaded. As always, the headers, footers and quotes are removed.

In [None]:
newsgroups = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
docs_raw = newsgroups.data
print(len(docs_raw))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


11314


## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or in TF-IDF form.

In [None]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b',
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(11314, 9144)


In [None]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)



(11314, 9144)


In [None]:
lda_tf.components_ # тоже умная штука

array([[0.05      , 0.05      , 0.08500886, ..., 2.90861539, 0.05      ,
        1.90846925],
       [0.05      , 0.05      , 0.05      , ..., 2.11472034, 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       ...,
       [0.05      , 0.05      , 0.05      , ..., 0.05      , 0.05      ,
        0.05      ],
       [0.05      , 0.05      , 0.05      , ..., 7.39787718, 0.05      ,
        8.05801007],
       [0.05      , 0.31202149, 0.05      , ..., 4.92074253, 0.05      ,
        5.67092789]])

In [None]:
lda_tf.transform(dtm_tf) # берем аргмакс и получаем класс для каждого документа

array([[1.51515154e-03, 1.51515154e-03, 1.51515155e-03, ...,
        1.51515154e-03, 7.46240384e-01, 1.51515153e-03],
       [9.95189200e-02, 1.31134780e-01, 1.11111113e-03, ...,
        1.11111114e-03, 1.11111114e-03, 4.55751517e-01],
       [3.84615392e-04, 1.74595318e-02, 3.84615397e-04, ...,
        3.84615395e-04, 4.28973140e-01, 1.71498085e-01],
       ...,
       [1.21951221e-03, 1.21951221e-03, 1.21951222e-03, ...,
        1.21951223e-03, 3.68732706e-01, 4.75973660e-01],
       [2.17391307e-03, 2.17391307e-03, 2.17391312e-03, ...,
        2.17391307e-03, 2.17391307e-03, 2.17391306e-03],
       [1.85185187e-03, 2.24614659e-01, 1.85185189e-03, ...,
        3.45809632e-02, 1.55685795e-01, 1.85185188e-03]])

## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted.

In [None]:
# for TF DTM
lda_tf = LatentDirichletAllocation(20, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
lda_tfidf = LatentDirichletAllocation(20, random_state=0)
lda_tfidf.fit(dtm_tfidf)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=20, n_jobs=None,
                          perp_tol=0.1, random_state=0, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

## Visualizing the models with pyLDAvis

In [None]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)

In [None]:
p = pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)
pyLDAvis.save_html(p, 'lda.html')

### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [None]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [None]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')