# **Bonus Notebook: Topic Modeling with NMF & Visualization**

## *IS 5150*

Due to time constraints we won't cover this in class. However, if you're interested in running through this notebook on your own time, I encourage you to do so. We will run through non-negative matrix factorization (NMF), which is a newer method in topic modeling that is implemented through `sklearn`. We will also produce an interactive topic modeling visualization.

We will use the NeurIPS corpus again for this exercise.

### **Retrieve and Extract Data**

In [None]:
# load all dependencies

import nltk
#nltk.download()  #stopwords, wordnet, omw-1.4
import gensim

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF

#!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.sklearn
import dill
import warnings

warnings.filterwarnings('ignore')
pyLDAvis.enable_notebook()

In [None]:
!wget https://cs.nyu.edu/~roweis/data/nips12raw_str602.tgz
!tar -xzf nips12raw_str602.tgz

DATA_PATH = '/content/nipstxt'
print(os.listdir(DATA_PATH))

In [None]:
folders = ["nips{0:02}".format(i) for i in range(0,13)]
# Read all texts into a list.
papers = []
for folder in folders:
    file_names = os.listdir(DATA_PATH + '/' + folder )
    for file_name in file_names:
        with open(DATA_PATH + '/' + folder + '/' + file_name, encoding='utf-8', errors='ignore', mode='r+') as f:
            data = f.read()
        papers.append(data)

### **Preprocess & Normalize Corpus**

In [None]:
%%time

stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in papers:
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1]
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)
            
    return norm_papers
    
norm_papers = normalize_corpus(papers)

CPU times: user 35.7 s, sys: 420 ms, total: 36.1 s
Wall time: 36.4 s


### **Feature Engineering**

In [None]:
cv = CountVectorizer(min_df=20, max_df=0.6, ngram_range=(1,2),                                                        # bag of words with unigrams
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)
cv_features = cv.fit_transform(norm_papers)
cv_features.shape

(1740, 14408)

In [None]:
vocabulary = np.array(cv.get_feature_names_out())
print('Total Vocabulary Size:', len(vocabulary))

Total Vocabulary Size: 14408


### **Topic Models with Non-Negative Matrix Factorization (NMF)**

Read more about parameters that can be tuned on the in the `sklearn.decomposition.NMF` [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html#sklearn.decomposition.NMF).

In [None]:
%%time

TOTAL_TOPICS = 20                                                                                                 # set topics to 20

nmf_model = NMF(n_components=TOTAL_TOPICS, solver='cd', max_iter=1000,                                            # set nmf model parameters
                random_state=42, alpha_H=.1, l1_ratio=.85)
document_topics = nmf_model.fit_transform(cv_features)                                                            # fit model to bow features

In [None]:
topic_terms = nmf_model.components_
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :20]
topic_keyterms = vocabulary[topic_key_term_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth', None)
topics_df = pd.DataFrame(topics,
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df

Unnamed: 0,Terms per Topic
Topic1,"bound, generalization, size, optimal, let, solution, equation, approximation, theorem, gradient, class, xi, rate, loss, matrix, convergence, theory, dimension, sample, minimum"
Topic2,"neuron, synaptic, connection, potential, dynamic, activity, synapsis, excitatory, layer, simulation, synapse, inhibitory, delay, biological, equation, state, et, et al, fig, activation"
Topic3,"state, action, policy, step, optimal, reinforcement, transition, reinforcement learning, probability, reward, value function, dynamic, markov, machine, task, agent, finite, iteration, sequence, decision"
Topic4,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database"
Topic5,"hidden, layer, net, hidden unit, task, hidden layer, architecture, back, trained, propagation, connection, back propagation, activation, representation, output unit, generalization, neural net, training set, learn, test"
Topic6,"cell, firing, head, direction, response, rat, layer, cortex, activity, spatial, synaptic, inhibitory, synapsis, simulation, cue, property, complex, active, lot, cortical"
Topic7,"word, recognition, speech, context, hmm, speaker, speech recognition, character, phoneme, probability, frame, sequence, rate, level, test, acoustic, experiment, letter, segmentation, state"
Topic8,"signal, noise, source, filter, component, frequency, channel, speech, matrix, independent, separation, sound, ica, phase, eeg, blind, auditory, dynamic, delay, fig"
Topic9,"control, controller, trajectory, dynamic, motor, movement, task, forward, feedback, arm, inverse, position, robot, architecture, hand, force, adaptive, change, command, target"
Topic10,"circuit, chip, current, analog, voltage, vlsi, gate, threshold, transistor, pulse, design, implementation, synapse, bit, digital, device, analog vlsi, pp, cmos, element"


### **Produce Document-Topic Matrix**

In [None]:
pd.options.display.float_format = '{:,.3f}'.format
dt_df = pd.DataFrame(document_topics, 
                     columns=['T'+str(i) for i in range(1, TOTAL_TOPICS+1)])
dt_df.head(10)

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20
0,0.0,12.776,0.0,0.0,639.88,0.0,0.0,0.0,0.0,0.0,0.0,0.0,550.682,188.026,954.312,0.0,0.0,2.901,320.169,0.0
1,607.484,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.707,53.24,0.0,0.0,0.0,0.0,0.0,1038.811
2,241.364,0.0,0.0,15.56,334.323,0.205,31.155,68.091,191.836,23.796,17.661,87.579,29.07,327.783,0.708,13.89,19.448,56.399,90.52,107.91
3,181.122,247.821,0.0,9.309,0.0,0.0,92.493,260.344,97.541,67.238,0.0,247.365,27.248,0.479,17.492,0.0,0.0,0.0,10.121,46.843
4,1375.297,63.936,69.425,0.874,0.0,0.0,0.0,119.457,23.659,125.153,0.0,0.533,0.0,0.0,0.0,0.0,7.833,1.696,6.116,49.991
5,368.685,65.162,13.38,0.0,0.0,0.0,0.0,0.0,51.199,18.11,0.0,0.0,0.0,0.0,0.0,0.0,27.653,0.0,4.738,333.448
6,113.141,66.287,35.285,0.0,13.685,341.223,0.0,47.721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,13.388,190.274,0.0,0.0,103.878
7,0.0,21.919,128.753,0.0,970.255,0.0,57.912,62.241,33.981,80.387,0.0,106.429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,185.427
8,40.128,52.09,0.0,0.0,0.0,234.903,0.472,123.762,54.22,44.068,44.494,13.885,0.0,18.469,34.054,40.225,0.0,0.0,7.493,40.416
9,31.261,16.017,33.366,0.0,0.0,45.335,0.0,0.0,0.0,5.196,13.175,0.0,0.0,105.629,56.487,90.603,0.0,0.0,86.977,455.645


### **Display Most Prototypical Paper for Each Topic**

In [None]:
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_colwidth', 200)

max_score_topics = dt_df.max(axis=0)
dominant_topics = max_score_topics.index
term_score = max_score_topics.values
document_numbers = [dt_df[dt_df[t] == max_score_topics.loc[t]].index[0]
                       for t in dominant_topics]
documents = [papers[i] for i in document_numbers]

results_df = pd.DataFrame({'Dominant Topic': dominant_topics, 'Max Score': term_score,
                          'Paper Num': document_numbers, 'Topic': topics_df['Terms per Topic'], 
                          'Paper Name': documents})
results_df

Unnamed: 0,Dominant Topic,Max Score,Paper Num,Topic,Paper Name
Topic1,T1,2225.18929,1026,"bound, generalization, size, optimal, let, solution, equation, approximation, theorem, gradient, class, xi, rate, loss, matrix, convergence, theory, dimension, sample, minimum","A Bound on the Error of Cross Validation Using \nthe Approximation and Estimation Rates, with \nConsequences for the Training-Test Split \nMichael Kearns \nAT&T Research \nABSTRACT\n1 INTRODUCTION..."
Topic2,T2,580.02299,346,"neuron, synaptic, connection, potential, dynamic, activity, synapsis, excitatory, layer, simulation, synapse, inhibitory, delay, biological, equation, state, et, et al, fig, activation","Signal Processing by Multiplexing and \nDemultiplexing in Neurons \nDavid C. Tam \nDivision of Neuroscience \nBaylor College of Medicine \nHouston, TX 77030 \ndtamCnext-cns.neusc.bcm.tmc.edu \nAb..."
Topic3,T3,731.6719,1186,"state, action, policy, step, optimal, reinforcement, transition, reinforcement learning, probability, reward, value function, dynamic, markov, machine, task, agent, finite, iteration, sequence, de...","Reinforcement Learning for Mixed \nOpen-loop and Closed-loop Control \nEric A. Hansen, Andrew G. Barto, and Shlomo Zilbersteln \nDepartment of Computer Science \nUniversity of Massachusetts \nAmhe..."
Topic4,T4,758.89359,1707,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database",Image representations for facial expression \ncoding \nMarian Stewart Bartlett* \nU.C. San Diego \nmarnisalk. edu \nJavier R. Movellan \nU.C. San Diego \nmovellancogsc. ucsd. edu \nPaul Ekman \n...
Topic5,T5,970.25529,7,"hidden, layer, net, hidden unit, task, hidden layer, architecture, back, trained, propagation, connection, back propagation, activation, representation, output unit, generalization, neural net, tr...","5O5 \nCONNECTING TO THE PAST \nBruce A. MacDonald, Assistant Professor \nKnowledge Sciences Laboratory, Computer Science Department \nThe University of Calgary, 2500 University Drive NW \nCalgary,..."
Topic6,T6,1249.35793,56,"cell, firing, head, direction, response, rat, layer, cortex, activity, spatial, synaptic, inhibitory, synapsis, simulation, cue, property, complex, active, lot, cortical","317 \nPARTITIONING OF SENSORY DATA BY A COPTICAI, NETWOPK  \nRichard Granger, Jos Ambros-Ingerson, Howard Henry, Gary Lynch \nCenter for the Neurobiology of Learning and Memory \nUniversity of..."
Topic7,T7,1391.51985,1410,"word, recognition, speech, context, hmm, speaker, speech recognition, character, phoneme, probability, frame, sequence, rate, level, test, acoustic, experiment, letter, segmentation, state","Comparison of Human and Machine Word \nRecognition \nM. Schenkel \nDept of Electrical Eng. \nUniversity of Sydney \nSydney, NSW 2006, Australia \nschenkel@sedal.usyd.edu.au \nC. Latimer \nDept of ..."
Topic8,T8,1488.39833,251,"signal, noise, source, filter, component, frequency, channel, speech, matrix, independent, separation, sound, ica, phase, eeg, blind, auditory, dynamic, delay, fig","232 Sejnowski, Yuhas, Goldstein and Jenkins \nCombining Visual and \nwith a Neural Network \nAcoustic Speech Signals \nImproves Intelligibility \nT.J. Sejnowski \nThe Salk Institute \nand \nDepart..."
Topic9,T9,1895.13379,911,"control, controller, trajectory, dynamic, motor, movement, task, forward, feedback, arm, inverse, position, robot, architecture, hand, force, adaptive, change, command, target","An Integrated Architecture of Adaptive Neural Network \nControl for Dynamic Systems \nLiu Ke '2 Robert L. Tokaf Brian D.McVey z \nCenter for Nonlinear Studies, 2Applied Theoretical Physics Divis..."
Topic10,T10,1057.24839,1644,"circuit, chip, current, analog, voltage, vlsi, gate, threshold, transistor, pulse, design, implementation, synapse, bit, digital, device, analog vlsi, pp, cmos, element","Kirchoff Law Markov Fields for Analog \nCircuit Design \nRichard M. Golden * \nRMG Consulting Inc. \n2000 Fresno Road, Plano, Texas 75074 \nRMG CONS UL T@A OL. COM, \nwww. neural-network. corn \nA..."


### **Visualizing Topic Models**

In [None]:
import dill

with open('nmf_model.pkl', 'wb') as f:
    dill.dump(nmf_model, f)
with open('cv_features.pkl', 'wb') as f:
    dill.dump(cv_features, f)
with open('cv.pkl', 'wb') as f:
    dill.dump(cv, f)

In [None]:
pyLDAvis.sklearn.prepare(nmf_model, cv_features, cv, mds='mmds')

### **Predict Topics for New Research Papers**

**You'll need to download the `test_data` from Canvas. Or download your own papers from the NeurIPS website under conference procedings; create text documents from the paper metadata tab.**

In [None]:
import glob

new_paper_files = glob.glob('/content/nips16*.txt')
new_papers = []
for fn in new_paper_files:
    with open(fn, encoding='utf-8', errors='ignore', mode='r+') as f:
        data = f.read()
        new_papers.append(data)
              
print('Total New Papers:', len(new_papers))

Total New Papers: 3


In [None]:
norm_new_papers = normalize_corpus(new_papers)
cv_new_features = cv.transform(norm_new_papers)
cv_new_features.shape

(3, 14408)

In [None]:
topic_predictions = nmf_model.transform(cv_new_features)
best_topics = [[(topic, round(sc, 3)) 
                    for topic, sc in sorted(enumerate(topic_predictions[i]), 
                                            key=lambda row: -row[1])[:2]] 
                        for i in range(len(topic_predictions))]
best_topics

[[(4, 225.477), (17, 199.988)],
 [(15, 412.403), (0, 234.238)],
 [(0, 807.584), (15, 303.516)]]

In [None]:
results_df = pd.DataFrame()
results_df['Papers'] = range(1, len(new_papers)+1)
results_df['Dominant Topics'] = [[topic_num+1 for topic_num, sc in item] for item in best_topics]
res = results_df.set_index(['Papers'])['Dominant Topics'].apply(pd.Series).stack().reset_index(level=1, drop=True)
results_df = pd.DataFrame({'Dominant Topics': res.values}, index=res.index)
results_df['Topic Score'] = [topic_sc for topic_list in 
                                        [[round(sc*100, 2) 
                                              for topic_num, sc in item] 
                                                 for item in best_topics] 
                                    for topic_sc in topic_list]

results_df['Topic Desc'] = [topics_df.iloc[t-1]['Terms per Topic'] for t in results_df['Dominant Topics'].values]
results_df['Paper Desc'] = [new_papers[i-1][:200] for i in results_df.index.values]

results_df

Unnamed: 0_level_0,Dominant Topics,Topic Score,Topic Desc,Paper Desc
Papers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,5,22547.7,"hidden, layer, net, hidden unit, task, hidden layer, architecture, back, trained, propagation, connection, back propagation, activation, representation, output unit, generalization, neural net, tr...","{""title"": ""Generating Videos with Scene Dynamics"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 613, ""page_last"": 621, ""abstract"": ""We capitalize on large amounts of ..."
1,18,19998.8,"object, view, recognition, representation, layer, visual, 3d, 2d, part, human, object recognition, position, transformation, scheme, image, aspect, frame, shape, viewpoint, rotation","{""title"": ""Generating Videos with Scene Dynamics"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 613, ""page_last"": 621, ""abstract"": ""We capitalize on large amounts of ..."
2,16,41240.3,"distribution, probability, gaussian, mixture, variable, density, likelihood, prior, bayesian, component, posterior, em, log, estimate, sample, approximation, estimation, matrix, conditional, maximum","{""title"": ""Online Bayesian Moment Matching for Topic Modeling with Unknown Number of Topics"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 4536, ""page_last"": 4544, ""a..."
2,1,23423.8,"bound, generalization, size, optimal, let, solution, equation, approximation, theorem, gradient, class, xi, rate, loss, matrix, convergence, theory, dimension, sample, minimum","{""title"": ""Online Bayesian Moment Matching for Topic Modeling with Unknown Number of Topics"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 4536, ""page_last"": 4544, ""a..."
3,1,80758.4,"bound, generalization, size, optimal, let, solution, equation, approximation, theorem, gradient, class, xi, rate, loss, matrix, convergence, theory, dimension, sample, minimum","{""title"": ""Eliciting Categorical Data for Optimal Aggregation"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 2450, ""page_last"": 2458, ""abstract"": ""Models for collecti..."
3,16,30351.6,"distribution, probability, gaussian, mixture, variable, density, likelihood, prior, bayesian, component, posterior, em, log, estimate, sample, approximation, estimation, matrix, conditional, maximum","{""title"": ""Eliciting Categorical Data for Optimal Aggregation"", ""book"": ""Advances in Neural Information Processing Systems"", ""page_first"": 2450, ""page_last"": 2458, ""abstract"": ""Models for collecti..."
