# Topic Modeling with Scikit-Learn

Welcome to a new NLP project!

In this project, we are going to use Scikit-Learn framework to cover topic modeling. 
- Topic models have been designed specifically for the purpose of extracting
various distinguishing concepts or topics from a large corpus that has various types of documents and each document talks about one or more concepts. The main aim of topic modeling is to use mathematical and statistical techniques to discover hidden and latent semantic structures in a corpus.
- Topic modeling involves extracting features from document terms and using mathematical structures and frameworks like matrix factorization and SVD to generate clusters or groups of terms that are distinguishable from each other and these cluster of words form topics or concepts. These concepts can be used to interpret the main themes of a corpus and make semantic connections among words that co-occur frequently in various documents.

We build some topic models using the following methods:
- Latent Semantic Indexing (LSI)
- Latent Dirichlet Allocation (LDA) - Default
- Non-negative Matrix Factorization (NMF)

## Set up the working directory & Import packages ##

In [1]:
# Move to the working directory on Google Drive as using Google Colab
import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  PROJECT_ROOT ="/content/drive/MyDrive/GitHub/NLP-Topic-Modeling"
else:
  PROJECT_ROOT ="."
os.chdir(PROJECT_ROOT)
!pwd

Running on CoLab
/content/drive/MyDrive/GitHub/NLP-Topic-Modeling


In [2]:
# Get the running time of each cell 
#  (similar to the ExecuteTime extension for Jupyter Notebook
!pip install ipython-autotime
%load_ext autotime

Collecting ipython-autotime
  Downloading ipython_autotime-0.3.1-py2.py3-none-any.whl (6.8 kB)
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 180 µs (started: 2021-09-21 21:31:55 +00:00)


In [3]:
!pip install pyLDAvis==2.1.2

Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 5.2 MB/s 
Collecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=7c54dbb0e2f77e3c341b391a2fe67928221229e4030a39665a882e0bddd22e0f
  Stored in directory: /root/.cache/pip/wheels/3b/fb/41/e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.16 pyLDAvis-2.1.2
time: 5.05 s (started: 2021-09-21 21:31:55 +00:00)


In [4]:
import pyLDAvis
import pyLDAvis.sklearn
import warnings

warnings.filterwarnings('ignore')
pyLDAvis.enable_notebook()

  from collections import Iterable
  from collections import Mapping


time: 1.38 s (started: 2021-09-21 21:32:00 +00:00)


In [5]:
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from tqdm import tqdm
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.decomposition import NMF
import dill

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
time: 1.33 s (started: 2021-09-21 21:32:01 +00:00)


In [6]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['figure.facecolor'] = 'white'
%matplotlib inline

time: 13.6 ms (started: 2021-09-21 21:32:03 +00:00)


## Load and View Dataset

We're loading all the research papers into `papers` list. Each paper is in its own text file.

In [7]:
DATA_PATH = 'datasets/nipstxt/'
print(os.listdir(DATA_PATH))

folders = ["nips{0:02}".format(i) for i in range(0,13)] # Read all texts into a list.
papers = []
for folder in folders:
  file_names = os.listdir(DATA_PATH + folder) 
  for file_name in file_names:
    with open(DATA_PATH + folder + '/' + file_name, 
              encoding='utf-8', errors='ignore', mode='r+') as f:
      data = f.read() 
      papers.append(data)
len(papers)

['README_yann', 'nips08', 'nips04', 'nips06', 'nips12', 'nips05', 'nips11', 'nips09', 'nips07', 'nips03', 'nips02', 'nips00', 'nips01', 'nips10', 'idx', 'orig', 'MATLAB_NOTES', 'RAW_DATA_NOTES']


1740

time: 5min 55s (started: 2021-09-21 21:32:03 +00:00)


There are a total of 1,740 research papers, which is not a small number! Let’s take a look at a fragment of text from one of the research papers to get an idea.

In [8]:
print(papers[0][:1000])

1 
CONNECTIVITY VERSUS ENTROPY 
Yaser S. Abu-Mostafa 
California Institute of Technology 
Pasadena, CA 91125 
ABSTRACT 
How does the connectivity of a neural network (number of synapses per 
neuron) relate to the complexity of the problems it can handle (measured by 
the entropy)? Switching theory would suggest no relation at all, since all Boolean 
functions can be implemented using a circuit with very low connectivity (e.g., 
using two-input NAND gates). However, for a network that learns a problem 
from examples using a local learning rule, we prove that the entropy of the 
problem becomes a lower bound for the connectivity of the network. 
INTRODUCTION 
The most distinguishing feature of neural networks is their ability to spon- 
taneously learn the desired function from 'training' samples, i.e., their ability 
to program themselves. Clearly, a given neural network cannot just learn any 
function, there must be some restrictions on which networks can learn which 
functions. One obv

## Basic Text Wrangling
We perform some basic text wrangling or preprocessing before diving into topic modeling. We keep things simple here and perform tokenization, lemmatizing nouns, and removing stopwords and any terms having a single character.

In [9]:
stop_words = nltk.corpus.stopwords.words('english')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def normalize_corpus(papers):
    norm_papers = []
    for paper in papers:
        paper = paper.lower()
        paper_tokens = [token.strip() for token in wtk.tokenize(paper)]
        paper_tokens = [wnl.lemmatize(token) for token in paper_tokens if not token.isnumeric()]
        paper_tokens = [token for token in paper_tokens if len(token) > 1]
        paper_tokens = [token for token in paper_tokens if token not in stop_words]
        paper_tokens = list(filter(None, paper_tokens))
        if paper_tokens:
            norm_papers.append(paper_tokens)
            
    return norm_papers
    
norm_papers = normalize_corpus(papers)
print(len(norm_papers))

1740
time: 37.1 s (started: 2021-09-21 21:37:58 +00:00)


In [10]:
# viewing a processed paper
print(norm_papers[0][:50])

['connectivity', 'versus', 'entropy', 'yaser', 'abu', 'mostafa', 'california', 'institute', 'technology', 'pasadena', 'ca', 'abstract', 'doe', 'connectivity', 'neural', 'network', 'number', 'synapsis', 'per', 'neuron', 'relate', 'complexity', 'problem', 'handle', 'measured', 'entropy', 'switching', 'theory', 'would', 'suggest', 'relation', 'since', 'boolean', 'function', 'implemented', 'using', 'circuit', 'low', 'connectivity', 'using', 'two', 'input', 'nand', 'gate', 'however', 'network', 'learns', 'problem', 'example', 'using']
time: 1.39 ms (started: 2021-09-21 21:38:35 +00:00)


## Text Representation with Feature Engineering

In [11]:
cv = CountVectorizer(min_df=20, max_df=0.6, ngram_range=(1,2),
                     token_pattern=None, tokenizer=lambda doc: doc,
                     preprocessor=lambda doc: doc)
cv_features = cv.fit_transform(norm_papers)
cv_features.shape

(1740, 14408)

time: 12.4 s (started: 2021-09-21 21:38:35 +00:00)


In [12]:
# Validate dictionary representation of the documents.
vocabulary = np.array(cv.get_feature_names())
print('Total Vocabulary Size:', len(vocabulary))

Total Vocabulary Size: 14408
time: 33.9 ms (started: 2021-09-21 21:38:48 +00:00)


## Topic Models with Latent Semantic Indexing (LSI)
- LSI is a statistical technique that has the ability to uncover latent hidden terms that correlate semantically to form topics. The main principle behind LSI is that similar terms tend to be used in the same context and hence tend to co-occur more. 
- LSI uses the very popular Singular Value Decomposition (SVD) technique.
- LSI is not just used for text summarization, but also in information retrieval and search. 

### Build model

In [13]:
TOTAL_TOPICS = 20    #this optimal number was found from the previous project
lsi_model = TruncatedSVD(n_components=TOTAL_TOPICS, n_iter=500, random_state=42)
document_topics = lsi_model.fit_transform(cv_features)

time: 1min 21s (started: 2021-09-21 21:38:48 +00:00)


In [14]:
topic_terms = lsi_model.components_
topic_terms.shape

(20, 14408)

time: 6.14 ms (started: 2021-09-21 21:40:10 +00:00)


### View the major topics/themes

In [15]:
top_terms = 20
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
topic_keyterm_weights = np.array([topic_terms[row, columns] 
                             for row, columns in list(zip(np.arange(TOTAL_TOPICS), topic_key_term_idxs))])
topic_keyterms = vocabulary[topic_key_term_idxs]
topic_keyterms_weights = list(zip(topic_keyterms, topic_keyterm_weights))
for n in range(TOTAL_TOPICS):
    print('Topic #'+str(n+1)+':')
    print('='*50)
    d1 = []
    d2 = []
    terms, weights = topic_keyterms_weights[n]
    term_weights = sorted([(t, w) for t, w in zip(terms, weights)], 
                          key=lambda row: -abs(row[1]))
    for term, wt in term_weights:
        if wt >= 0:
            d1.append((term, round(wt, 3)))
        else:
            d2.append((term, round(wt, 3)))

    print('Direction 1:', d1)
    print('-'*50)
    print('Direction 2:', d2)
    print('-'*50)
    print()


Topic #1:
Direction 1: [('state', 0.221), ('neuron', 0.169), ('image', 0.138), ('cell', 0.13), ('layer', 0.13), ('feature', 0.127), ('probability', 0.121), ('hidden', 0.114), ('distribution', 0.105), ('rate', 0.098), ('signal', 0.095), ('task', 0.093), ('class', 0.092), ('noise', 0.09), ('net', 0.089), ('recognition', 0.089), ('representation', 0.088), ('field', 0.082), ('rule', 0.082), ('step', 0.08)]
--------------------------------------------------
Direction 2: []
--------------------------------------------------

Topic #2:
Direction 1: [('cell', 0.417), ('neuron', 0.39), ('response', 0.175), ('stimulus', 0.155), ('visual', 0.131), ('spike', 0.13), ('firing', 0.117), ('synaptic', 0.11), ('activity', 0.104), ('cortex', 0.097), ('field', 0.085), ('frequency', 0.085), ('direction', 0.082), ('circuit', 0.082), ('motion', 0.082)]
--------------------------------------------------
Direction 2: [('state', -0.289), ('probability', -0.109), ('hidden', -0.098), ('class', -0.091), ('policy',

**Note that**:
- The higher the weight, the more important the contribution.
- The sign on each term indicates a sense of direction or orientation in the vector space for a particular topic. So similar correlated terms have the same sign or direction.

Let’s separate these terms in each topic based on their signs and try to interpret the topics again.

### View the proportion of each topic per document

In [16]:
# get document-topic matrix
#  that would help us see the proportion of each topic per document 
#  (a larger proportion means the topic is more dominant in the document
dt_df = pd.DataFrame(np.round(document_topics, 3), 
                     columns=['T'+str(i) for i in range(1, TOTAL_TOPICS+1)])
dt_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739
T1,19.875,28.542,35.743,46.58,20.943,40.64,25.488,44.865,41.207,39.717,61.973,38.008,19.562,26.408,29.155,16.437,32.633,9.863,26.854,26.435,20.107,32.413,21.56,26.058,17.774,42.274,23.13,41.24,32.52,38.178,14.773,25.907,50.279,41.058,57.723,28.231,46.477,64.387,56.191,33.197,...,25.148,28.173,30.177,24.224,24.81,42.257,41.72,30.98,46.058,34.049,51.332,37.91,32.311,51.697,36.537,28.628,38.544,26.53,26.262,25.953,31.059,27.079,39.029,39.996,35.729,29.186,43.202,36.184,47.523,46.935,27.643,32.831,43.85,25.679,51.97,40.438,40.425,30.99,46.227,34.768
T2,6.621,-1.013,15.339,-3.142,-5.107,34.036,-5.832,53.061,28.411,15.803,76.753,0.648,-6.945,-3.343,21.634,2.358,18.856,-1.871,2.662,25.82,4.801,34.87,6.716,-0.357,-3.912,1.726,-4.835,-10.152,30.041,12.68,1.322,18.406,-0.977,-11.349,-0.892,-0.721,-8.457,7.883,-19.533,8.359,...,-4.536,-18.016,-3.665,7.12,-2.135,20.01,28.335,-13.963,-11.195,-8.576,-6.196,2.415,-10.548,-4.074,-20.13,-10.822,3.154,-13.144,-6.212,1.214,-8.146,-6.517,7.613,-28.144,-4.67,-9.614,-8.751,-21.41,-43.199,-31.116,-16.505,1.312,-31.794,-7.744,-47.552,-21.47,-28.742,-25.951,-34.977,-11.836
T3,6.901,10.889,8.195,9.403,-4.495,-7.725,3.284,25.235,7.661,4.083,21.601,0.832,8.693,-10.793,6.546,5.168,12.085,-1.272,17.76,5.257,-0.567,10.749,7.3,-6.436,1.502,-7.283,-7.784,-5.385,15.089,-14.701,5.769,3.165,9.098,25.269,-5.349,16.053,12.215,16.075,14.272,-6.2,...,-0.678,14.523,-19.333,-7.312,-11.548,-22.862,-9.267,-3.506,-45.472,-23.708,-49.62,-27.608,5.98,-51.893,36.419,3.483,4.987,-8.551,-18.248,-3.542,-3.255,-9.752,-11.751,34.184,-24.388,-18.183,-39.015,34.469,71.275,22.727,17.544,6.881,28.409,16.584,74.061,33.258,31.823,34.19,42.456,-2.838
T4,-12.437,-4.124,-12.896,-16.608,-9.058,11.906,-6.972,-14.257,-28.256,-13.368,2.504,-1.92,-3.513,-2.24,-7.502,-5.587,-3.427,-3.087,-3.397,-4.165,4.045,-11.371,1.774,-2.603,-5.042,-8.394,-5.905,-1.585,12.833,25.229,-4.901,2.862,-15.599,2.506,-20.905,15.456,-1.709,11.57,0.086,0.36,...,-9.012,2.661,1.408,5.33,11.615,15.496,25.008,8.717,23.541,2.464,38.539,19.978,0.55,49.427,17.599,-8.2,-13.335,-9.464,1.509,-5.38,-7.138,-3.635,2.235,22.421,7.786,-1.926,27.755,16.593,40.027,-4.266,6.431,2.634,2.059,8.284,38.714,22.846,15.627,13.784,20.365,-15.593
T5,0.486,2.986,15.392,10.623,7.615,-16.534,6.618,2.599,7.78,7.242,-11.041,13.015,-1.869,22.351,0.605,-1.485,10.644,3.455,7.543,-7.91,-2.117,-0.405,7.323,3.179,-2.194,-3.958,3.375,25.71,-8.494,-2.389,0.574,0.767,13.568,-0.431,40.715,-4.269,-10.586,-8.81,50.193,17.794,...,-6.433,-6.279,4.057,-4.47,-8.468,-29.037,-12.912,-24.297,-20.47,-28.593,-2.26,-23.086,-6.551,-21.342,-2.935,-12.866,-4.907,-5.5,-0.972,-7.368,1.579,-14.202,-16.915,-2.164,-7.153,-7.298,-18.84,-1.06,-10.952,-15.803,-10.094,-19.416,-24.891,-1.335,-9.818,-15.625,-6.168,-14.767,-21.917,-15.64
T6,-12.345,-8.386,-19.503,-12.045,3.771,28.526,1.415,-6.171,-24.211,-20.493,27.667,-5.612,-4.911,6.13,0.332,-5.476,-18.962,0.672,-12.153,18.177,1.885,10.698,7.397,-1.389,-0.094,3.388,0.267,7.709,25.439,-35.468,-10.863,2.634,-10.784,-5.521,0.16,3.513,-8.04,16.16,12.312,-14.401,...,-4.033,-1.727,4.099,-3.545,-3.801,7.663,2.365,-11.73,-19.396,-15.132,-19.988,-13.817,-0.689,-33.519,-1.299,-3.721,-16.038,3.976,1.392,0.502,4.988,-7.294,-13.568,3.936,5.102,0.723,-22.516,6.211,1.563,5.392,0.107,15.578,-0.492,-2.257,1.262,-0.151,4.616,1.075,-0.68,4.404
T7,2.338,-6.295,3.969,-13.328,-8.912,-6.555,-13.522,22.992,32.674,6.068,9.542,-18.856,-4.419,11.448,4.929,-1.637,5.825,-3.861,1.128,6.338,-6.071,8.759,0.041,-10.312,-1.985,-9.013,-5.809,-35.487,-1.665,2.072,-2.904,-3.28,-3.7,-0.76,-33.546,1.173,-12.092,-3.892,-28.829,0.655,...,-3.427,11.143,26.393,3.806,3.428,11.085,-3.273,10.282,15.697,7.893,16.872,1.791,-4.653,14.565,10.767,2.631,-3.079,12.962,2.477,0.39,-3.196,0.048,-1.495,16.651,20.456,12.992,12.851,2.442,10.86,4.562,-0.207,-1.715,6.574,-6.449,14.963,0.971,-2.428,-0.225,8.646,-11.088
T8,-9.726,-3.0,0.965,-19.17,-3.682,-5.681,2.267,-3.439,-6.19,8.278,1.975,-7.554,-9.254,5.588,0.386,0.022,-4.327,0.189,-7.526,-2.577,1.663,-6.897,0.758,0.444,-0.957,-7.26,4.941,13.459,4.501,-6.602,-0.08,-1.043,7.657,-5.87,1.034,1.066,4.081,-4.527,2.247,-10.861,...,10.635,12.502,6.971,20.049,8.843,-3.623,4.988,1.932,-16.684,5.101,-23.859,4.307,4.139,-11.503,-4.593,12.862,-1.856,-1.531,-11.032,0.343,-0.025,9.829,28.297,-3.277,-14.25,-7.423,-19.979,0.823,-16.423,-28.766,-7.084,-0.818,-4.835,8.033,-16.091,-1.173,-2.27,-8.967,-8.538,-4.322
T9,-4.675,-3.557,0.939,13.668,-1.345,-0.779,-3.886,4.34,-6.82,12.494,-18.838,13.34,2.724,1.319,9.17,2.997,17.273,0.098,-8.924,1.979,5.208,-5.522,-4.313,0.452,-3.153,5.522,-1.381,-30.652,19.309,4.216,0.54,0.694,-7.076,-2.528,-18.733,-5.479,6.762,-9.038,-18.139,25.506,...,8.427,-4.2,1.527,5.211,1.974,-9.645,-4.896,-16.402,-20.269,-12.837,-2.797,-9.16,1.924,-9.045,4.575,-4.695,38.41,1.861,-1.993,5.66,4.35,-3.041,1.486,-3.166,-2.541,-4.812,0.588,-6.278,-8.726,9.841,3.488,-1.686,-8.082,15.654,-20.378,0.036,-13.175,2.683,-10.459,-5.32
T10,-0.788,1.996,-8.685,27.952,6.169,-2.211,-1.268,-6.516,-28.475,-4.172,-17.032,8.1,7.27,0.274,4.059,-6.458,11.458,1.068,-0.433,3.387,-4.24,1.467,2.454,2.614,1.982,7.987,-1.09,11.795,-0.569,12.835,4.218,-8.663,10.953,6.132,9.663,-0.377,2.857,9.553,18.847,20.09,...,-2.919,8.417,-5.183,-0.506,3.173,13.984,-1.794,13.927,8.319,14.441,-9.846,2.917,0.357,4.377,0.552,0.123,38.086,-2.462,-15.318,8.089,-8.582,2.555,0.468,-1.845,5.946,-11.09,4.01,3.458,0.463,2.515,-4.447,4.018,0.428,-6.867,17.049,0.008,3.501,-6.856,-0.829,19.941


time: 230 ms (started: 2021-09-21 21:40:10 +00:00)


Ignoring the sign, we can try to find out the most important topics for a few sample papers and see if they make sense.

In [17]:
document_numbers = [13, 250, 500]

for document_number in document_numbers:
    top_topics = list(dt_df.columns[np.argsort(-np.absolute(dt_df.iloc[document_number].values))[:3]])
    print('Document #'+str(document_number)+':')
    print('Dominant Topics (top 3):', top_topics)
    print('Paper Summary:')
    print(papers[document_number][:500])
    print()

Document #13:
Dominant Topics (top 3): ['T1', 'T5', 'T7']
Paper Summary:
144 
SPEECH RECOGNITION EXPERIMENTS 
WITH PERCEPTRONS 
D. J. Burr 
Bell Communications Research 
Morristown, NJ 07960 
ABSTRACT 
Artificial neural networks (ANNs) are capable of accurate recognition of 
simple speech vocabularies such as isolated digits [1]. This paper looks at two 
more difficult vocabularies, the alphabetic E-set and a set of polysyllabic 
words. The E-set is difficult because it contains weak discriminants and 
polysyllables are difficult because of timing variation. Polysyll

Document #250:
Dominant Topics (top 3): ['T1', 'T7', 'T5']
Paper Summary:
642 Chauvin 
Dynamic Behavior of Constrained 
Back-Propagation Networks 
Yves Chauvin 1 
Thomson-CSF, Inc. 
630 Hansen Way, Suite 250 
Palo Alto, CA. 94304 
ABSTRACT 
The learning dynamics of the back-propagation algorithm are in- 
vestigated when complexity constraints are added to the standard 
Least Mean Square (LMS) cost function. It is shown th

## Topic Models with Latent Dirichlet Allocation (LDA)

### Build model

In [18]:
BATCH_SIZE = cv_features.shape[0]
lda_model = LatentDirichletAllocation(n_components =TOTAL_TOPICS, 
                                      max_iter=500, max_doc_update_iter=50,
                                      learning_method='online', 
                                      batch_size=BATCH_SIZE, learning_offset=50., 
                                      random_state=42, n_jobs=16)
document_topics = lda_model.fit_transform(cv_features)

time: 20min 32s (started: 2021-09-21 21:40:10 +00:00)


### View the major topics/themes

In [19]:
topic_terms = lda_model.components_
topic_terms

array([[5.01462650e-02, 5.00905019e-02, 3.59025944e-01, ...,
        5.01883981e-02, 5.81127851e-02, 5.03026923e-02],
       [5.00110145e-02, 5.00104880e-02, 5.00115279e-02, ...,
        5.00160397e-02, 5.00139241e-02, 5.00110167e-02],
       [5.00107536e-02, 5.00112827e-02, 5.00145413e-02, ...,
        5.00199098e-02, 5.00098675e-02, 5.00095204e-02],
       ...,
       [5.00389483e-02, 1.14275402e-01, 6.72729590e+00, ...,
        9.97546778e+01, 4.61222273e+00, 5.00137029e-02],
       [5.00100252e-02, 5.00092796e-02, 5.00170307e-02, ...,
        5.00200764e-02, 5.00127201e-02, 5.00115672e-02],
       [5.00131265e-02, 5.00125290e-02, 5.00123134e-02, ...,
        5.00190171e-02, 5.00122492e-02, 5.00109662e-02]])

time: 16.3 ms (started: 2021-09-21 22:00:43 +00:00)


In [20]:
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
topic_keyterms = vocabulary[topic_key_term_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame(topics,
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df

Unnamed: 0,Terms per Topic
Topic1,"neuron, circuit, signal, chip, current, analog, voltage, channel, noise, vlsi, pulse, implementation, synapse, fig, frequency, delay, gain, potential, synaptic, control"
Topic2,"image, feature, structure, layer, state, neuron, local, distribution, cell, motion, recognition, object, node, net, matrix, gaussian, sequence, size, line, hidden"
Topic3,"sound, auditory, template, frequency, motor, acoustic, syllable, song, production, harmonic, nucleus, spectrogram, representation, temporal, phase, hearing, bird, template matching, khz, control"
Topic4,"cell, neuron, visual, response, stimulus, activity, field, spike, motion, direction, cortex, orientation, map, spatial, eye, synaptic, firing, cortical, fig, rate"
Topic5,"image, feature, recognition, layer, hidden, task, representation, object, trained, test, net, classification, classifier, architecture, class, level, rule, hidden unit, experiment, speech"
Topic6,"state, dynamic, equation, matrix, rule, gradient, recurrent, solution, signal, hidden, fixed, source, component, attractor, field, energy, eq, phase, fixed point, transition"
Topic7,"sequence, markov, chain, structure, markov model, prediction, hmms, protein, hidden markov, region, state, hidden, bengio, site, gene, class, length, receptor, human, mouse"
Topic8,"word, context, speech, language, phoneme, similarity, letter, item, state, probability, node, phone, list, vocabulary, acoustic, activation, connectionist, duration, proximity, short"
Topic9,"activation, behavior, winner, take, active, winner take, competitive, connection, role, binding, activity, wta, self, distributed, level, activation function, body, sensor, competition, food"
Topic10,"state, cell, distribution, neuron, probability, control, response, layer, rate, signal, task, architecture, random, test, hidden, image, change, fig, generalization, field"


time: 65.4 ms (started: 2021-09-21 22:00:43 +00:00)


In [21]:
pd.options.display.float_format = '{:,.3f}'.format
dt_df = pd.DataFrame(document_topics, 
                     columns=['T'+str(i) for i in range(1, TOTAL_TOPICS+1)])
dt_df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,1700,1701,1702,1703,1704,1705,1706,1707,1708,1709,1710,1711,1712,1713,1714,1715,1716,1717,1718,1719,1720,1721,1722,1723,1724,1725,1726,1727,1728,1729,1730,1731,1732,1733,1734,1735,1736,1737,1738,1739
T1,0.111,0.036,0.336,0.099,0.009,0.0,0.0,0.431,0.203,0.508,0.045,0.39,0.0,0.077,0.576,0.095,0.877,0.0,0.027,0.19,0.165,0.207,0.0,0.0,0.0,0.081,0.001,0.0,0.217,0.685,0.357,0.067,0.027,0.064,0.049,0.0,0.096,0.0,0.051,0.62,...,0.023,0.039,0.101,0.127,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013,0.0,0.168,0.0,0.455,0.0,0.0,0.165,0.007,0.048,0.046,0.0,0.039,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.191,0.0,0.0,0.0,0.0,0.0,0.0
T2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
T3,0.0,0.0,0.0,0.0,0.001,0.0,0.0,0.0,0.002,0.047,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.001,0.0,0.0,0.0,0.0,0.045,0.0,0.0,0.0,0.001,0.0,0.0,0.0,0.0,0.0,0.019,0.0,...,0.0,0.008,0.014,0.047,0.0,0.0,0.0,0.0,0.002,0.0,0.0,0.0,0.0,0.009,0.0,0.0,0.0,0.0,0.002,0.0,0.0,0.0,0.0,0.0,0.006,0.0,0.0,0.0,0.0,0.0,0.001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
T4,0.015,0.0,0.032,0.019,0.0,0.903,0.0,0.537,0.336,0.117,0.901,0.0,0.0,0.0,0.392,0.193,0.058,0.0,0.021,0.687,0.211,0.751,0.35,0.005,0.0,0.165,0.011,0.008,0.5,0.0,0.0,0.615,0.012,0.03,0.0,0.704,0.079,0.548,0.0,0.0,...,0.0,0.0,0.029,0.378,0.104,0.395,0.681,0.131,0.0,0.185,0.0,0.508,0.0,0.098,0.0,0.004,0.0,0.0,0.0,0.125,0.0,0.038,0.361,0.0,0.044,0.0,0.0,0.0,0.0,0.0,0.0,0.135,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0
T5,0.0,0.068,0.215,0.029,0.478,0.066,0.321,0.0,0.162,0.188,0.015,0.158,0.0,0.86,0.0,0.0,0.021,0.391,0.046,0.016,0.122,0.0,0.153,0.356,0.116,0.183,0.56,0.361,0.0,0.187,0.0,0.105,0.013,0.002,0.417,0.058,0.065,0.035,0.283,0.176,...,0.0,0.08,0.414,0.349,0.625,0.083,0.164,0.107,0.515,0.137,0.815,0.215,0.122,0.729,0.0,0.015,0.0,0.174,0.398,0.084,0.356,0.295,0.099,0.336,0.436,0.203,0.478,0.0,0.0,0.0,0.0,0.0,0.0,0.128,0.034,0.0,0.08,0.0,0.0,0.0
T6,0.058,0.614,0.284,0.227,0.121,0.021,0.567,0.0,0.0,0.0,0.0,0.051,0.221,0.0,0.0,0.089,0.0,0.065,0.436,0.011,0.086,0.0,0.377,0.513,0.372,0.01,0.112,0.576,0.0,0.0,0.411,0.0,0.244,0.289,0.277,0.116,0.397,0.159,0.357,0.0,...,0.454,0.137,0.0,0.0,0.0,0.291,0.0,0.113,0.0,0.0,0.0,0.0,0.279,0.105,0.0,0.08,0.107,0.0,0.0,0.02,0.0,0.061,0.261,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012,0.03,0.003,0.0,0.04,0.014,0.0,0.146
T7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017,0.011,0.0,0.02,0.012,0.0,0.0,0.018,0.0,...,0.0,0.029,0.0,0.0,0.014,0.0,0.0,0.0,0.016,0.009,0.007,0.0,0.024,0.0,0.0,0.008,0.018,0.0,0.0,0.133,0.0,0.0,0.0,0.007,0.015,0.0,0.058,0.011,0.0,0.0,0.0,0.006,0.0,0.002,0.0,0.0,0.016,0.0,0.005,0.0
T8,0.0,0.0,0.028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017,0.0,0.036,0.0,0.006,0.005,0.01,0.0,0.0,0.0,0.013,0.02,0.0,0.0,0.009,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.003,0.0,0.0,0.0,0.0,0.016,0.013,...,0.0,0.032,0.053,0.0,0.0,0.003,0.0,0.0,0.0,0.0,0.003,0.0,0.0,0.0,0.0,0.0,0.0,0.043,0.0,0.0,0.0,0.0,0.0,0.005,0.067,0.011,0.002,0.004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
T9,0.0,0.022,0.009,0.0,0.0,0.0,0.0,0.028,0.0,0.0,0.0,0.057,0.002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.034,0.0,0.047,0.0,0.06,0.0,0.009,0.008,0.071,0.0,0.0,0.106,0.01,0.0,0.0,0.0,0.0,0.0,0.048,0.0,...,0.0,0.0,0.0,0.0,0.019,0.0,0.0,0.0,0.0,0.0,0.011,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.0,0.0,0.007,0.0,0.0,0.006,0.0,0.0,0.031,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,0.0,0.0,0.0
T10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


time: 151 ms (started: 2021-09-21 22:00:43 +00:00)


In [22]:
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_colwidth', 200)

max_contrib_topics = dt_df.max(axis=0)
dominant_topics = max_contrib_topics.index
contrib_perc = max_contrib_topics.values
document_numbers = [dt_df[dt_df[t] == max_contrib_topics.loc[t]].index[0]
                       for t in dominant_topics]
documents = [papers[i] for i in document_numbers]

results_df = pd.DataFrame({'Dominant Topic': dominant_topics, 'Contribution %': contrib_perc,
                          'Paper Num': document_numbers, 'Topic': topics_df['Terms per Topic'], 
                          'Paper Name': documents})
results_df

Unnamed: 0,Dominant Topic,Contribution %,Paper Num,Topic,Paper Name
Topic1,T1,0.99727,1076,"neuron, circuit, signal, chip, current, analog, voltage, channel, noise, vlsi, pulse, implementation, synapse, fig, frequency, delay, gain, potential, synaptic, control","Improved Silicon Cochlea \nusing \nCompatible Lateral Bipolar Transistors \nAndr6 van Schaik, Eric Fragnire, Eric Vittoz \nMANTRA Center for Neuromimetic Systems \nSwiss Federal Institute of Tech..."
Topic2,T2,0.00033,181,"image, feature, structure, layer, state, neuron, local, distribution, cell, motion, recognition, object, node, net, matrix, gaussian, sequence, size, line, hidden",794 \nNEURAL ARCHITECTURE \nValentino Braitenberg \nMax Planck Institute \nFederal Republic of Germany \nABSTRACT\nWhile we are waiting for the ultimate biophysics of cell membranes and synapses \...
Topic3,T3,0.82975,182,"sound, auditory, template, frequency, motor, acoustic, syllable, song, production, harmonic, nucleus, spectrogram, representation, temporal, phase, hearing, bird, template matching, khz, control",795 \nSONG LEARNING IN BIRDS \nM. Konishi \nDivision of Biology \nCalifornia Institute of Technology \nABSTRACT\nBirds sing to communicate. Male birds use song to advertise their territories and \...
Topic4,T4,0.99947,1609,"cell, neuron, visual, response, stimulus, activity, field, spike, motion, direction, cortex, orientation, map, spatial, eye, synaptic, firing, cortical, fig, rate",Can V1 mechanisms account for \nfigure-ground and medial axis effects? \nZhaoping Li \nGatsby Computational Neuroscience Unit \nUniversity College London \nzhaopinggat shy. ucl. ac. uk \nAbstract...
Topic5,T5,0.99949,213,"image, feature, recognition, layer, hidden, task, representation, object, trained, test, net, classification, classifier, architecture, class, level, rule, hidden unit, experiment, speech","266 Zemel, Mozer and Hinton \nTRAFFIC: Recognizing Objects Using \nHierarchical Reference Frame Transformations \nRichard S. Zemel \nComputer Science Dept. \nUniversity of Toronto \nToronto, ONT M..."
Topic6,T6,0.9812,988,"state, dynamic, equation, matrix, rule, gradient, recurrent, solution, signal, hidden, fixed, source, component, attractor, field, energy, eq, phase, fixed point, transition","Harmony Networks Do Not Work \nRen5 Gourley \nSchool of Computing Science \nSimon Fraser University \nBurnaby, B.C., V5A 1S6, Canada \ngourley@mprgate.mpr.ca \nAbstract \nHarmony networks have be..."
Topic7,T7,0.99956,271,"sequence, markov, chain, structure, markov model, prediction, hmms, protein, hidden markov, region, state, hidden, bengio, site, gene, class, length, receptor, human, mouse","A Neural Network to Detect \nHomologies in Proteins \nYoshua Bengio \nSchool of Computer Science \nMcGill University \nMontreal, Canada H3A 2A7 \nSamy Bengio \nDepartement d'Informatique \nUnivers..."
Topic8,T8,0.96324,850,"word, context, speech, language, phoneme, similarity, letter, item, state, probability, node, phone, list, vocabulary, acoustic, activation, connectionist, duration, proximity, short","A solvable connectionist model of \nimmediate recall of ordered lists \nNell Burgess \nDepartment of Anatomy, University College London \nLondon WCiE 6BT, England \n(e-mail: n .burgessucl. ac. uk..."
Topic9,T9,0.99938,466,"activation, behavior, winner, take, active, winner take, competitive, connection, role, binding, activity, wta, self, distributed, level, activation function, body, sensor, competition, food","Dynamically-Adaptive Winner-Take-All Networks \nTret E. Lange \nComputer Scienos Department \nUniversity of California, Los Angeles, CA 90024 \nAbstract \nWinner-Take-All (WTA) networks, in which..."
Topic10,T10,0.00033,181,"state, cell, distribution, neuron, probability, control, response, layer, rate, signal, task, architecture, random, test, hidden, image, change, fig, generalization, field",794 \nNEURAL ARCHITECTURE \nValentino Braitenberg \nMax Planck Institute \nFederal Republic of Germany \nABSTRACT\nWhile we are waiting for the ultimate biophysics of cell membranes and synapses \...


time: 99 ms (started: 2021-09-21 22:00:43 +00:00)


## Topic Models with Non-Negative Matrix Factorization (NMF)

### Build model

In [23]:
nmf_model = NMF(n_components=TOTAL_TOPICS, solver='cd', max_iter=500,
                random_state=42, alpha=.1, l1_ratio=.85)
document_topics = nmf_model.fit_transform(cv_features)

time: 53.5 s (started: 2021-09-21 22:00:43 +00:00)


In [24]:
pd.options.display.float_format = '{:,.3f}'.format
dt_df = pd.DataFrame(document_topics, 
                     columns=['T'+str(i) for i in range(1, TOTAL_TOPICS+1)])
dt_df.head(10)

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20
0,0.338,0.983,0.043,0.0,0.0,0.0,0.0,0.0,0.0,0.136,0.0,0.085,0.094,0.023,0.0,0.193,0.0,0.044,0.0,0.0
1,0.393,0.601,0.464,0.015,0.186,0.021,0.0,0.224,0.126,0.025,0.0,0.247,0.0,0.0,0.0,0.0,0.0,0.105,0.12,0.204
2,0.132,1.526,0.0,0.048,0.0,0.0,0.328,0.63,0.248,0.187,0.0,0.967,0.105,0.0,0.078,0.0,0.0,0.0,0.046,0.151
3,0.732,0.259,0.332,0.0,0.453,0.0,0.0,0.0,0.0,1.598,0.0,0.28,0.225,0.0,0.072,0.0,0.0,0.0,0.153,0.944
4,0.115,0.139,0.0,0.0,0.607,0.013,0.013,0.018,0.033,0.0,0.0,0.075,0.809,0.013,0.059,0.273,0.009,0.028,0.0,0.0
5,0.189,0.0,0.0,0.028,0.037,1.175,0.018,0.0,0.0,0.0,0.099,0.086,0.009,0.788,0.143,0.0,1.92,0.0,0.386,0.0
6,0.369,0.097,0.188,0.0,0.713,0.0,0.0,0.065,0.044,0.038,0.05,0.703,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.119
7,0.0,0.962,0.176,0.0,0.0,1.063,0.0,0.0,0.023,0.959,3.054,0.0,0.072,0.0,0.0,0.0,0.0,0.054,0.046,0.025
8,0.028,1.297,0.0,0.0,0.0,0.0,0.0,0.251,0.0,0.0,3.223,0.116,0.0,0.479,0.578,0.137,0.0,0.06,0.0,0.0
9,0.134,0.827,0.008,0.047,0.032,0.023,0.0,1.633,0.092,0.505,0.567,0.162,0.074,0.111,0.359,0.0,0.0,0.146,0.017,0.062


time: 67 ms (started: 2021-09-21 22:01:37 +00:00)


In [25]:
pd.options.display.float_format = '{:,.5f}'.format
pd.set_option('display.max_colwidth', 200)

max_score_topics = dt_df.max(axis=0)
dominant_topics = max_score_topics.index
term_score = max_score_topics.values
document_numbers = [dt_df[dt_df[t] == max_score_topics.loc[t]].index[0]
                       for t in dominant_topics]
documents = [papers[i] for i in document_numbers]

results_df = pd.DataFrame({'Dominant Topic': dominant_topics, 'Max Score': term_score,
                          'Paper Num': document_numbers, 'Topic': topics_df['Terms per Topic'], 
                          'Paper Name': documents})
results_df

Unnamed: 0,Dominant Topic,Max Score,Paper Num,Topic,Paper Name
Topic1,T1,1.6661,1130,"neuron, circuit, signal, chip, current, analog, voltage, channel, noise, vlsi, pulse, implementation, synapse, fig, frequency, delay, gain, potential, synaptic, control","A Bound on the Error of Cross Validation Using \nthe Approximation and Estimation Rates, with \nConsequences for the Training-Test Split \nMichael Kearns \nAT&T Research \nABSTRACT\n1 INTRODUCTION..."
Topic2,T2,3.56605,1613,"image, feature, structure, layer, state, neuron, local, distribution, cell, motion, recognition, object, node, net, matrix, gaussian, sequence, size, line, hidden","Predictive Sequence Learning in Recurrent \nNeocortical Circuits* \nR. P. N. Rao \nComputational Neurobiology Lab and \nSloan Center for Theoretical Neurobiology \nThe Salk Institute, La Jolla, CA..."
Topic3,T3,5.85117,1275,"sound, auditory, template, frequency, motor, acoustic, syllable, song, production, harmonic, nucleus, spectrogram, representation, temporal, phase, hearing, bird, template matching, khz, control","Reinforcement Learning for Mixed \nOpen-loop and Closed-loop Control \nEric A. Hansen, Andrew G. Barto, and Shlomo Zilbersteln \nDepartment of Computer Science \nUniversity of Massachusetts \nAmhe..."
Topic4,T4,3.9397,1713,"cell, neuron, visual, response, stimulus, activity, field, spike, motion, direction, cortex, orientation, map, spatial, eye, synaptic, firing, cortical, fig, rate",Image representations for facial expression \ncoding \nMarian Stewart Bartlett* \nU.C. San Diego \nmarnisalk. edu \nJavier R. Movellan \nU.C. San Diego \nmovellancogsc. ucsd. edu \nPaul Ekman \n...
Topic5,T5,3.0168,38,"image, feature, recognition, layer, hidden, task, representation, object, trained, test, net, classification, classifier, architecture, class, level, rule, hidden unit, experiment, speech","5O5 \nCONNECTING TO THE PAST \nBruce A. MacDonald, Assistant Professor \nKnowledge Sciences Laboratory, Computer Science Department \nThe University of Calgary, 2500 University Drive NW \nCalgary,..."
Topic6,T6,7.57212,73,"state, dynamic, equation, matrix, rule, gradient, recurrent, solution, signal, hidden, fixed, source, component, attractor, field, energy, eq, phase, fixed point, transition","317 \nPARTITIONING OF SENSORY DATA BY A COPTICAI, NETWOPK  \nRichard Granger, Jos Ambros-Ingerson, Howard Henry, Gary Lynch \nCenter for the Neurobiology of Learning and Memory \nUniversity of..."
Topic7,T7,4.9445,1301,"sequence, markov, chain, structure, markov model, prediction, hmms, protein, hidden markov, region, state, hidden, bengio, site, gene, class, length, receptor, human, mouse","Comparison of Human and Machine Word \nRecognition \nM. Schenkel \nDept of Electrical Eng. \nUniversity of Sydney \nSydney, NSW 2006, Australia \nschenkel@sedal.usyd.edu.au \nC. Latimer \nDept of ..."
Topic8,T8,3.61488,209,"word, context, speech, language, phoneme, similarity, letter, item, state, probability, node, phone, list, vocabulary, acoustic, activation, connectionist, duration, proximity, short","232 Sejnowski, Yuhas, Goldstein and Jenkins \nCombining Visual and \nwith a Neural Network \nAcoustic Speech Signals \nImproves Intelligibility \nT.J. Sejnowski \nThe Salk Institute \nand \nDepart..."
Topic9,T9,4.83202,970,"activation, behavior, winner, take, active, winner take, competitive, connection, role, binding, activity, wta, self, distributed, level, activation function, body, sensor, competition, food","An Integrated Architecture of Adaptive Neural Network \nControl for Dynamic Systems \nLiu Ke '2 Robert L. Tokaf Brian D.McVey z \nCenter for Nonlinear Studies, 2Applied Theoretical Physics Divis..."
Topic10,T10,2.96757,1716,"state, cell, distribution, neuron, probability, control, response, layer, rate, signal, task, architecture, random, test, hidden, image, change, fig, generalization, field","Kirchoff Law Markov Fields for Analog \nCircuit Design \nRichard M. Golden * \nRMG Consulting Inc. \n2000 Fresno Road, Plano, Texas 75074 \nRMG CONS UL T@A OL. COM, \nwww. neural-network. corn \nA..."


time: 56.3 ms (started: 2021-09-21 22:01:37 +00:00)


### View the major topics/themes

In [26]:
topic_terms = nmf_model.components_
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
topic_keyterms = vocabulary[topic_key_term_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame(topics,
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df


Unnamed: 0,Terms per Topic
Topic1,"bound, generalization, size, let, optimal, solution, equation, theorem, approximation, gradient, class, xi, loss, rate, matrix, convergence, theory, dimension, sample, minimum"
Topic2,"neuron, synaptic, connection, potential, dynamic, activity, synapsis, excitatory, layer, simulation, synapse, inhibitory, delay, biological, equation, state, et, et al, fig, activation"
Topic3,"state, action, policy, step, optimal, reinforcement, transition, reinforcement learning, probability, reward, dynamic, value function, markov, machine, task, agent, finite, iteration, sequence, decision"
Topic4,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database"
Topic5,"hidden, layer, net, hidden unit, task, hidden layer, architecture, back, trained, propagation, connection, back propagation, activation, representation, generalization, output unit, neural net, training set, test, learn"
Topic6,"cell, firing, head, direction, response, rat, layer, cortex, activity, spatial, synaptic, inhibitory, synapsis, simulation, cue, property, complex, active, lot, cortical"
Topic7,"word, recognition, speech, context, hmm, speaker, speech recognition, character, phoneme, probability, frame, sequence, rate, level, test, acoustic, experiment, letter, segmentation, state"
Topic8,"signal, noise, source, filter, component, frequency, channel, speech, matrix, independent, separation, sound, ica, phase, eeg, blind, auditory, dynamic, delay, fig"
Topic9,"control, controller, trajectory, motor, dynamic, movement, task, forward, feedback, arm, inverse, position, robot, architecture, hand, force, adaptive, change, command, plant"
Topic10,"circuit, chip, current, analog, voltage, vlsi, gate, threshold, transistor, pulse, design, implementation, synapse, bit, digital, device, analog vlsi, pp, cmos, element"


time: 40.8 ms (started: 2021-09-21 22:01:37 +00:00)


## Predicting Topics for New Research Papers
Even though topic models are unsupervised models, we can estimate or predict potential topics for new documents based on what it has learned previously on the so-called “training” corpus. 

### Load some testing samples
For testing our model, let's load some recent papers from the NIPS 16 conference proceedings.

In [27]:
TEST_DATA_PATH = 'datasets/test_data/'
file_names = os.listdir(TEST_DATA_PATH) 
new_papers = []
for file_name in file_names:
  with open(TEST_DATA_PATH + file_name, 
            encoding='utf-8', errors='ignore', mode='r+') as f:
    data = f.read() 
    new_papers.append(data)
len(new_papers)

4

time: 877 ms (started: 2021-09-21 22:01:37 +00:00)


### Build a text wrangling and feature engineering pipeline
These steps should match the same steps we followed when training our topic model.


In [28]:
norm_new_papers = normalize_corpus(new_papers)
cv_new_features = cv.transform(norm_new_papers)
cv_new_features.shape


(4, 14408)

time: 188 ms (started: 2021-09-21 22:01:38 +00:00)


### Extract the top n topics from a paper

In [29]:
topic_predictions = nmf_model.transform(cv_new_features)
best_topics = [[(topic, round(sc, 3)) for topic, sc in 
                sorted(enumerate(topic_predictions[i]), key=lambda row: -row[1])[:2]] 
               for i in range(len(topic_predictions))]
best_topics

[[(0, 1.13), (15, 0.836)],
 [(2, 4.135), (0, 0.872)],
 [(3, 2.146), (1, 1.345)],
 [(3, 3.061), (6, 2.216)]]

time: 41.8 ms (started: 2021-09-21 22:01:38 +00:00)


We get the top two topics for each research paper because a paper or document can always be a mixture of multiple topics. Let’s view the results for each paper in an easy-to-understand format.

In [30]:
results_df = pd.DataFrame()
results_df['Papers'] = range(1, len(new_papers)+1)
results_df['Dominant Topics'] = [[topic_num+1 for topic_num, sc in item] for item in best_topics]
res = results_df.set_index(['Papers'])['Dominant Topics'].apply(pd.Series).stack().reset_index(level=1, drop=True)
results_df = pd.DataFrame({'Dominant Topics': res.values}, index=res.index)
results_df['Topic Score'] = [topic_sc for topic_list in [[round(sc*100, 2) for topic_num, sc in item] 
                                                         for item in best_topics] for topic_sc in topic_list]

results_df['Topic Desc'] = [topics_df.iloc[t-1]['Terms per Topic'] for t in results_df['Dominant Topics'].values]
results_df['Paper Desc'] = [new_papers[i-1][:200] for i in results_df.index.values]

results_df

Unnamed: 0_level_0,Dominant Topics,Topic Score,Topic Desc,Paper Desc
Papers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,1,113.0,"bound, generalization, size, let, optimal, solution, equation, theorem, approximation, gradient, class, xi, loss, rate, matrix, convergence, theory, dimension, sample, minimum","Cooperative Graphical Models\nJosip Djolonga\nDept. of Computer Science, ETH Zurich ¨\njosipd@inf.ethz.ch\nStefanie Jegelka\nCSAIL, MIT\nstefje@mit.edu\nSebastian Tschiatschek\nDept. of Computer Science, ETH"
1,16,83.6,"distribution, probability, gaussian, mixture, variable, density, likelihood, prior, bayesian, component, posterior, em, log, estimate, sample, approximation, estimation, matrix, conditional, maximum","Cooperative Graphical Models\nJosip Djolonga\nDept. of Computer Science, ETH Zurich ¨\njosipd@inf.ethz.ch\nStefanie Jegelka\nCSAIL, MIT\nstefje@mit.edu\nSebastian Tschiatschek\nDept. of Computer Science, ETH"
2,3,413.5,"state, action, policy, step, optimal, reinforcement, transition, reinforcement learning, probability, reward, dynamic, value function, markov, machine, task, agent, finite, iteration, sequence, decision","PAC Reinforcement Learning with Rich Observations\nAkshay Krishnamurthy\nUniversity of Massachusetts, Amherst\nAmherst, MA, 01003\nakshay@cs.umass.edu\nAlekh Agarwal\nMicrosoft Research\nNew York, NY 10011\na"
2,1,87.2,"bound, generalization, size, let, optimal, solution, equation, theorem, approximation, gradient, class, xi, loss, rate, matrix, convergence, theory, dimension, sample, minimum","PAC Reinforcement Learning with Rich Observations\nAkshay Krishnamurthy\nUniversity of Massachusetts, Amherst\nAmherst, MA, 01003\nakshay@cs.umass.edu\nAlekh Agarwal\nMicrosoft Research\nNew York, NY 10011\na"
3,4,214.6,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database","Automated scalable segmentation of neurons from\nmultispectral images\nUygar Sümbül\nGrossman Center for the Statistics of Mind\nand Dept. of Statistics, Columbia University\nDouglas Roossien Jr.\nUniversit"
3,2,134.5,"neuron, synaptic, connection, potential, dynamic, activity, synapsis, excitatory, layer, simulation, synapse, inhibitory, delay, biological, equation, state, et, et al, fig, activation","Automated scalable segmentation of neurons from\nmultispectral images\nUygar Sümbül\nGrossman Center for the Statistics of Mind\nand Dept. of Statistics, Columbia University\nDouglas Roossien Jr.\nUniversit"
4,4,306.1,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database","Unsupervised Learning of Spoken Language with\nVisual Context\nDavid Harwath, Antonio Torralba, and James R. Glass\nComputer Science and Artificial Intelligence Laboratory\nMassachusetts Institute of Tech"
4,7,221.6,"word, recognition, speech, context, hmm, speaker, speech recognition, character, phoneme, probability, frame, sequence, rate, level, test, acoustic, experiment, letter, segmentation, state","Unsupervised Learning of Spoken Language with\nVisual Context\nDavid Harwath, Antonio Torralba, and James R. Glass\nComputer Science and Artificial Intelligence Laboratory\nMassachusetts Institute of Tech"


time: 63 ms (started: 2021-09-21 22:01:38 +00:00)


## Save the model

In [31]:
with open('nmf_model.pkl', 'wb') as f:
    dill.dump(nmf_model, f)
with open('cv_features.pkl', 'wb') as f:
    dill.dump(cv_features, f)
with open('cv.pkl', 'wb') as f:
    dill.dump(cv, f)

time: 8.75 s (started: 2021-09-21 22:01:38 +00:00)


## Load the model

In [32]:
with open('nmf_model.pkl', 'rb') as f:
    nmf_model = dill.load(f)
with open('cv_features.pkl', 'rb') as f:
    cv_features = dill.load(f)
with open('cv.pkl', 'rb') as f:
    cv = dill.load(f)

time: 778 ms (started: 2021-09-21 22:01:47 +00:00)


## Visualization

In [33]:
top_terms = 20
TOTAL_TOPICS = 20
vocabulary = np.array(cv.get_feature_names())
topic_terms = nmf_model.components_
topic_key_term_idxs = np.argsort(-np.absolute(topic_terms), axis=1)[:, :top_terms]
topic_keyterms = vocabulary[topic_key_term_idxs]
topics = [', '.join(topic) for topic in topic_keyterms]
pd.set_option('display.max_colwidth', -1)
topics_df = pd.DataFrame(topics,
                         columns = ['Terms per Topic'],
                         index=['Topic'+str(t) for t in range(1, TOTAL_TOPICS+1)])
topics_df

Unnamed: 0,Terms per Topic
Topic1,"bound, generalization, size, let, optimal, solution, equation, theorem, approximation, gradient, class, xi, loss, rate, matrix, convergence, theory, dimension, sample, minimum"
Topic2,"neuron, synaptic, connection, potential, dynamic, activity, synapsis, excitatory, layer, simulation, synapse, inhibitory, delay, biological, equation, state, et, et al, fig, activation"
Topic3,"state, action, policy, step, optimal, reinforcement, transition, reinforcement learning, probability, reward, dynamic, value function, markov, machine, task, agent, finite, iteration, sequence, decision"
Topic4,"image, face, pixel, recognition, local, distance, scale, digit, texture, filter, scene, vision, facial, pca, edge, transformation, representation, visual, surface, database"
Topic5,"hidden, layer, net, hidden unit, task, hidden layer, architecture, back, trained, propagation, connection, back propagation, activation, representation, generalization, output unit, neural net, training set, test, learn"
Topic6,"cell, firing, head, direction, response, rat, layer, cortex, activity, spatial, synaptic, inhibitory, synapsis, simulation, cue, property, complex, active, lot, cortical"
Topic7,"word, recognition, speech, context, hmm, speaker, speech recognition, character, phoneme, probability, frame, sequence, rate, level, test, acoustic, experiment, letter, segmentation, state"
Topic8,"signal, noise, source, filter, component, frequency, channel, speech, matrix, independent, separation, sound, ica, phase, eeg, blind, auditory, dynamic, delay, fig"
Topic9,"control, controller, trajectory, motor, dynamic, movement, task, forward, feedback, arm, inverse, position, robot, architecture, hand, force, adaptive, change, command, plant"
Topic10,"circuit, chip, current, analog, voltage, vlsi, gate, threshold, transistor, pulse, design, implementation, synapse, bit, digital, device, analog vlsi, pp, cmos, element"


time: 58.1 ms (started: 2021-09-21 22:01:48 +00:00)


In [34]:
pyLDAvis.sklearn.prepare(nmf_model, cv_features, cv, mds='mmds')

time: 8.34 s (started: 2021-09-21 22:01:48 +00:00)


In [35]:
pyLDAvis.sklearn.prepare(nmf_model, cv_features, cv, mds='tsne')

time: 5.74 s (started: 2021-09-21 22:01:56 +00:00)


In [36]:
pyLDAvis.sklearn.prepare(nmf_model, cv_features, cv, mds='pcoa')

time: 6.06 s (started: 2021-09-21 22:02:02 +00:00)
