## Topic Extraction: 20 Newsgroups Data Set

11.08.19

(Data set: http://qwone.com/~jason/20Newsgroups/)

Take the 20 newsgroups dataset and use different methods of topic modeling. The goal is to determine which method, if any, best reproduces the topics represented by the newsgroups. 

Methods:
- Latent Semantic Analysis (LSA)
- Latent Dirichlet Allocation (LDA)
- Non-Negative Matrix Factorization (NNMF)

In [1]:
import pandas as pd
import numpy as np

from nltk.corpus import gutenberg
import re

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import NMF
from sklearn.decomposition import LatentDirichletAllocation as LDA

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Load data.
dataset = fetch_20newsgroups(shuffle=True, random_state=1,
                            remove=('headers', 'footers', 'quotes'))
news = dataset.data

In [3]:
# Prepare data.
# Create the tf-idf matrix.
vectorizer = TfidfVectorizer(max_df=0.95,
                             min_df=2,
                             max_features=1000,
                             stop_words='english')
news_tfidf=vectorizer.fit_transform(news)

# LDA can only use raw term counts for LDA
# because it is a probabilistic graphical model.
vectorizer2 = CountVectorizer(max_df=0.95,
                             min_df=2,
                             max_features=1000,
                             stop_words='english')
news_tf=vectorizer2.fit_transform(news)

In [4]:
# Get the word list.
terms = vectorizer.get_feature_names()

# Number of topics.
ntopics=20

# Link words to topics.
def word_topic(tfidf,solution, wordlist):
    
    # Loading scores for each word on each topic/component.
    words_by_topic=tfidf.T * solution

    # Linking the loadings to the words in an easy-to-read way.
    components=pd.DataFrame(words_by_topic,index=wordlist)
    
    return components

# Extract the top N words and their loadings for each topic.
def top_words(components, n_top_words):
    n_topics = range(components.shape[1])
    index= np.repeat(n_topics, n_top_words, axis=0)
    topwords=pd.Series(index=index)
    for column in range(components.shape[1]):
        # Sort the column so that highest loadings are at the top.
        sortedwords=components.iloc[:,column].sort_values(ascending=False)
        # Choose the N highest loadings.
        chosen=sortedwords[:n_top_words]
        # Combine loading and index into a string.
        chosenlist=chosen.index +" "+round(chosen,2).map(str) 
        topwords.loc[column]=[x for x in chosenlist]
    return(topwords)

# Number of words to look at for each topic.
n_top_words = 10

### Latent Semantic Analysis (LSA)
A process of applying PCA to a tf-idf matrix: we reduce the tf-idf-weighted term-document matrix into a lower-dimensional space to get clusters of terms that should reflect a topic. Shortcomings: the link between words and topics is not very clear: some words may have high negative loadings on a component, etc.

### Probabilistic LSA
(Note: it is not supported by scikit-learn)

pLSA (=pLSI, Probabilistic Latent Semantic Indexing) assumes the existence of set of topics, that set being unknown at the start. Opposite of LSA where we start with the data and solve for a set of component-topics.

### Latent Dirichlet Allocation (LDA) 
is a Bayesian implementation of pLSA; it includes: 
- sparse Dirichlet priors for estimating the probability that a topic will be in a document, 
- the probability that a word will be in a topic.

### Topic modeling with Non-Negative Matrix Factorization
- like PCA, searches for 2 matrices that result in the tf-idf matrix,
- unlike PCA, we apply constraint that all three matrices must contain no negative values.

In [5]:
topwords=pd.DataFrame()

In [6]:
# LSA.
svd= TruncatedSVD(ntopics)
lsa = make_pipeline(svd, Normalizer(copy=False))
news_lsa = lsa.fit_transform(news_tfidf)

components_lsa = word_topic(news_tfidf, news_lsa, terms)
topwords['LSA']=top_words(components_lsa, n_top_words) 

In [7]:
# LDA.
lda = LDA(n_components=ntopics, 
          doc_topic_prior=None, # Prior = 1/n_documents
          topic_word_prior=1/ntopics,
          learning_decay=0.7, # Convergence rate.
          learning_offset=10.0, # Causes earlier iterations to have less influence on the learning.
          max_iter=10, # when to stop even if the model is not converging (to prevent running forever).
          evaluate_every=-1, # Do not evaluate perplexity, as it slows training time.
          mean_change_tol=0.001, # Stop updating the document topic distribution in the E-step when mean change is < tol.
          max_doc_update_iter=100, # When to stop updating the document topic distribution in the E-step even if tol is not reached.
          n_jobs=-1, # Use all available CPUs to speed up processing time.
          verbose=0, # amount of output to give while iterating.
          random_state=0
         )
news_lda = lda.fit_transform(news_tf) 

components_lda = word_topic(news_tf, news_lda, terms)
topwords['LDA']=top_words(components_lda, n_top_words)

In [8]:
# NMF.
nmf = NMF(alpha=0.0, 
          init='nndsvdar', # How starting value are calculated.
          l1_ratio=0.0, # Sets whether regularization is L2 (0), L1 (1), or a combination (values between 0 and 1).
          max_iter=200, # when to stop even if the model is not converging (to prevent running forever).
          n_components=ntopics, 
          random_state=0, 
          solver='cd', # Use Coordinate Descent to solve.
          tol=0.0001, # model will stop if tfidf-WH <= tol.
          verbose=0 # amount of output to give while iterating.
         )
news_nmf = nmf.fit_transform(news_tfidf) 

components_nmf = word_topic(news_tfidf, news_nmf, terms)
topwords['NNMF']=top_words(components_nmf, n_top_words)

In [9]:
for topic in range(ntopics):
    print('Topic {}:'.format(topic))
    print(topwords.loc[topic])

Topic 0:
             LSA             LDA        NNMF
0     don 161.96    space 541.44   good 5.92
0    just 161.61       00 456.26     ve 5.12
0    like 160.79     nasa 187.38   time 4.96
0     know 150.2      new 180.46   just 3.97
0  people 144.89       10 173.54    like 3.6
0   think 134.99   launch 164.58    don 3.57
0    time 118.36     1993 160.14    did 3.26
0     does 115.0       20 155.88    got 3.26
0     use 114.24  program 141.86   know 3.23
0    good 113.05     data 139.16  think 3.07
Topic 1:
              LSA           LDA          NNMF
1    thanks 64.82    use 299.92     card 8.24
1   windows 57.77   scsi 299.23    video 4.09
1        use 38.3    like 258.5  monitor 2.69
1      mail 34.52   just 242.63  drivers 2.17
1       card 34.1    don 203.27    cards 2.11
1      file 33.22  power 201.21      bus 2.04
1  software 30.19   time 200.03   driver 1.87
1       dos 29.23     car 192.6      vga 1.85
1   program 29.22  speed 189.58  windows 1.78
1     drive 29.09   good 18

Ground truth (20 news groups):
- comp.graphics
- comp.os.ms-windows.misc
- comp.sys.ibm.pc.hardware
- comp.sys.mac.hardware
- comp.windows.x	
- rec.autos
- rec.motorcycles
- rec.sport.baseball
- rec.sport.hockey	
- sci.crypt
- sci.electronics
- sci.med
- sci.space
- misc.forsale	
- talk.politics.misc
- talk.politics.guns
- talk.politics.mideast	
- talk.religion.misc
- alt.atheism
- soc.religion.christian


### Summary
In the results above, some topics are shared, though the order of topics varies. Additionally, the content of some of the topics varies considerably across methods. It's best to use multiple methods when exploring topics.