PyLDAvis now also support LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20new_groups dataset as provided by scikit-learn.

In [2]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [None]:
%matplotlib inline
import pyLDAvis
pyLDAvis.enable_notebook()

Next, we only fetch three categories from the news groups. If we set number of topics to 3 in LDA, we should get an intuitive result that measure intertopic distance map acrosss these categories.

In [24]:
cats = ['talk.religion.misc','comp.graphics', 'sci.med']
train = fetch_20newsgroups(subset='train',categories=cats,
                           shuffle=True,random_state=42)

# vect = CountVectorizer(stop_words='english')
lda = LatentDirichletAllocation(n_topics=3,random_state=123,max_iter=1500)

Finally, we use prepare method to visualize inside the notebook.

In [38]:
%%file ../pyLDAvis/sklearn.py
import pandas as pd
import funcy as fp
from . import prepare as vis_prepare

def _extract_data(docs, vect, lda):
  
    #LDA scikit-learn implementation seems to have buggy code.
    #Topic_term_dists and doc_topic_dists isn't accummulated to 1.
    #Hence norm function implemented to normalize the distributions.
    norm = lambda data: pd.DataFrame(data).div(data.sum(1),axis=0).values
    vected = vect.fit_transform(docs)
    doc_topic_dists = norm(lda.fit_transform(vected))
  
    return lda,vect, dict(
                      doc_lengths = docs.str.len(),
                      vocab = vect.get_feature_names(),
                      term_frequency = vected.sum(axis=0).tolist()[0],
                      topic_term_dists = norm(lda.components_),
                      doc_topic_dists = doc_topic_dists)

def prepare(docs, vect, lda, **kwargs):
    """Create Prepared Data from sklearn's vectorizer and Latent Dirichlet
    Application.

    Parameters
    ----------
    docs : Pandas Series.
        Documents to be passed as an input.
    vect : Scikit-Learn Vectorizer (CountVectorizer,TfIdfVectorizer).
        vectorizer to convert documents into matrix sparser
    lda  : sklearn.decomposition.LatentDirichletAllocation.
        Latent Dirichlet Allocation

    **kwargs: Keyword argument to be passed to pyLDAvis.prepare()


    Returns
    -------
    prepared_data : PreparedData
          the data structures used in the visualization


    Example
    --------
    For example usage please see this notebook:
    http://nbviewer.ipython.org/github/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb

    See
    ------
    See `pyLDAvis.prepare` for **kwargs.
    """
  
    opts = fp.merge(_extract_data(docs, vect, lda)[2], kwargs)

    return vis_prepare(**opts)

Writing ../pyLDAvis/sklearn.py


## Visualzing the model with pyLDAvis

In [14]:
prepare(pd.Series(train['data']),vect,lda)

Alternatively, you can also have LDA model that have sklearn's API, for example https://github.com/ariddell/lda.

In [15]:
from lda import LDA

lda_b = LDA(n_topics=len(cats),n_iter=1500,random_state=123)

In [17]:
prepare(pd.Series(train['data']),vect,lda_b)

Note that you can get the vectorizer and LDA that has been learned on the data, by using `_extract_data` method.

In [31]:
lda_fit,vect_fit, prepared = _extract_data(pd.Series(train['data']),vect,lda_b)

In [32]:
vect_fit

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [35]:
lda_fit

<lda.lda.LDA at 0x7f735378bd30>