# `pyLDAvis.sklearn`

pyLDAvis now also supports LDA application from scikit-learn. Let's take a look into this in more detail. We will be using 20 newsgroups dataset as provided by scikit-learn.

In [3]:
from __future__ import print_function

In [5]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF


## Load 20 newsgroups dataset

First, the 20 newsgroups dataset available in sklearn are loaded. The headers, footers and quotes are removed, as always.

In [11]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))

def strip_links(x):
    stripped = re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", x)
    return stripped

def remove_stopwords(x):
    word_tokens = word_tokenize(x)
    filtered = [w for w in word_tokens if w.lower() not in stop_words if w.isalpha()]
    return " ".join(filtered)

df = pd.read_json('../data/all.json')
df["likes_normalized"] = df["likes"] / df["likes"].max()
df["replies_normalized"] = df["replies"] / df["replies"].max()
df["retweets_normalized"] = df["retweets"] / df["retweets"].max()
df = df[(df["likes_normalized"] > 0.05) | (df["replies_normalized"] > 0.05) | (df["retweets_normalized"] > 0.05)]
# df = df[(df["likes_normalized"] < 0.0001) | (df["replies_normalized"] < 0.0001) | (df["retweets_normalized"] < 0.0001)]
df['text'] = df['text'].apply(lambda x: x.replace('http', ' http'))
df['text'] = df['text'].apply(strip_links)
df['text'] = df['text'].apply(remove_stopwords)
print(len(df))

3898


In [12]:
docs_raw = df['text'].tolist()


## Convert to document-term matrix

Next, the raw documents are converted into document-term matrix, possibly as raw counts or TF-IDF form.

In [13]:
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',
                                stop_words = 'english',
                                lowercase = True,
#                                 ngram_range=[1,3],
                                max_df = 0.5, 
                                min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
print(dtm_tf.shape)

(3898, 708)


In [14]:
tfidf_vectorizer = TfidfVectorizer(**tf_vectorizer.get_params())
dtm_tfidf = tfidf_vectorizer.fit_transform(docs_raw)
print(dtm_tfidf.shape)

(3898, 708)


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


## Fit Latent Dirichlet Allocation models

Finally, the LDA models are fitted.

In [15]:
# for TF DTM
lda_tf = LatentDirichletAllocation(n_topics=10, random_state=0)
lda_tf.fit(dtm_tf)
# for TFIDF DTM
# lda_tfidf = LatentDirichletAllocation(n_topics=10, random_state=0)
# lda_tfidf.fit(dtm_tfidf)
lda_tfidf = NMF(n_components=30, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(dtm_tfidf)



## Visualzing the models with pyLDAvis

In [16]:
pyLDAvis.sklearn.prepare(lda_tfidf, dtm_tfidf, tfidf_vectorizer)

  return dists / dists.sum(axis=1)[:, None]
  kernel = (topic_given_term * np.log((topic_given_term.T / topic_proportion).T))
  log_lift = np.log(topic_term_dists / term_proportion)
  log_ttd = np.log(topic_term_dists)
  relevance = lambda_ * log_ttd + (1 - lambda_) * log_lift
  relevance = lambda_ * log_ttd + (1 - lambda_) * log_lift


### Using different MDS functions

With `sklearn` installed, other MDS functions, such as MMDS and TSNE can be used for plotting if the default PCoA is not satisfactory.

In [17]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='mmds')

In [18]:
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer, mds='tsne')