# Building Custom Discovery for Digitized Collections Using Computational Methods

## Learning Goals

As we move through the workshop, make note of places in the process where an expert is required or needed to make the computational methods worthwhile.

## Packages we're using

In [1]:
# semantic modeling
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.tokenize import word_tokenize

# visualization
import pyLDAvis
import pyLDAvis.gensim
from sklearn.manifold import TSNE
from bokeh.io import output_file, output_notebook, save, show
from bokeh.models import ColumnDataSource
from bokeh.palettes import viridis
from bokeh.plotting import figure


# data manipulation
import pandas as pd

# general utility
import glob
import itertools
import os
from tqdm import tqdm

## Exploring the corpus and determining approaches

Let's begin by just taking a look at some of the individual OCR files to get a sense of what they might be like. We could look at the items by way of the Libraries' website, but anytime I'm doing text analysis work, I like to see the text I'll be working with directly. 

In [2]:
text_dir = "texts"

In [3]:
fns = glob.glob("texts/*.txt")
print(len(fns))
fns[:5]

1006


['texts/mc00456-001-bx0004-043-001.txt',
 'texts/mc00456-001-bx0004-053-001.txt',
 'texts/mc00344-001-lb0001_26-002-000.txt',
 'texts/mc00456-001-bx0007-015-001.txt',
 'texts/mc00456-001-bx0007-005-001.txt']

In [4]:
with open(fns[0], 'r') as f:
    print(f.read())

THE MORAL ASPECT
VIVISECTION.

B Y

E. JANE VVHATELY.

IT is sometimes well for the instruction and encouraga
ment of those who give serious thought to the question
of Vivisection, to recall the words of persons eminent for
high qualities of intellect and of moral character, who
have passed judgment upon it. Miss E. Jane VVhately

daughter of Archbishop VVhately—was respected, trusted,

. . . ’-
and loved in no common degree by a large olrcle of friends

and acquaintances.

In the preface to a short memoir of her, by her sister,
published in 1893, there is the following tribute to her
worth from the pen of the well-known author of “The
Schijnberg—Gotta Family” : “If I were to fix on one quality
as especially characteristic of her, it would be truth—
truth of perception, which rested on entire truthfulness
of character. She was true to the core in mind and
heart. True, because she was clear-sighted, candid to
acknowledge difﬁculties in thought or memory, and
therefore tolerant to differ

What do you see in this text? Does the OCR look good? Are there parts of the text that you think shouldn't be included in a model that helps with discovering texts?

Try picking a different file and reading through it with the same questions.

Now that we now a bit about what types of texts we have, what goals would you have in providing discovery for this collection? What aspects of the documents would you want to focus on to expose to scholars?

One of the main advantages of using any sort of machine learning process is that we can show relationships between and features of the items in a collection that we had not otherwise known or shown. The types of features could vary greatly. Maybe we want to show relationships based on the content of the documents in some cases. Maybe we want to expose something in the metadata of the documents. We might want to do both. 

Here we're going to focus on the content, and specifically one type of model that allows us to make connections across the collection: topic modeling. 



### What is topic modeling?

According to [David Blei](http://www.cs.columbia.edu/~blei/topicmodeling.html), topic models are a "suite of algorithms that uncover the hidden thematic structure in document collections." Topic models operate on the idea that for any given document collection, or corpus, there is a finite number of themes, or topics, from which the corpus draws and each document is composed of words that are associated with some number of those topics. While we don't necessarily think of an author simply dipping into buckets (topics) of words and putting them together to create a document, it's turned out to be a useful model for understanding collections of documents according to the themes that cut across the collection.

There are quite a few types of topic models, but we'll focus on one of the most common forms: latent dirichlet allocation (LDA). LDA topic modeling is a form of unsupervised machine learning, wherein we provide an unlabeled corpus of texts to the algorithm, which then produces the model, though we often provide the number of topics that the algorithm should use for the model. While there are processes for determining the "correct" number of topics, many consider this part of topic modeling a bit of an art that is determined as much by the research questions of the person running the model as it is by the corpus and model. Other types of topic models highlight different aspects and problematics of types of corpora, such as temporal differentiation and author bias. 

A topic model gives us a number of data objects. We'll have a list of topics, which are distributions over terms, though we could think of topics somewhat simply as sets of regularly co-occuring terms. We'll also have a representation of each document in the corpus as a vector denoting the composition of the document according to the topics, that is, we'll have an account of how much of each document is associated with each topic. 

Key resource: [Probabilistic Topic Models](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) by David Blei. 

### Other models or approaches that could be useful for discovery

- Keyword extraction
- Automated summarization
- Entity extraction, including geospatial data
- Various clustering algorithms

## Modeling the corpus

Now that we know what type of model, we'll use, let's jump in to building hte model itself

### Reading in and cleaning the documents

How we're going to read in and clean our texts is somewhat particular to `gensim`, the library we're using for our topic model. You could absolutely approach this part of the process in different ways, but we'll stick with an approach recommended by the author of `gensim` so that if you're looking for documentation and help later, it will be easier to find. For this part we'll hew closely to the code in the following tutorial.

[Radim Řehůřek's topic modeling tutorial](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html)

Since corpora can be large, it's often a good idea to approach reading in data with streaming in mind. Rather than reading in all of our data at once and then processing it, we'll read it each item in and process it one at a time. 

In [4]:
def head(stream, n=10):
    """Given a stream of data items, return just the first n as a list"""
    return list(itertools.islice(stream, n))

In [5]:
# We currently have filenames that include the item id. We'll want to associate the processed texts with just item id, so we need to pull it out of the filename.
def get_item_id(fn):
    """Given a filename, return just the item id"""
    return os.path.split(fn)[1].split(".")[0]

Cleaning texts is often iterative, and how much you clean your corpus depends on the model you use. For topic modeling, I typically start without cleaning at all, get results, then piece by piece add in the minimum necessary cleaning to get sensible results. What we'll do in the function below is based on that minimal approach. 

Just due to how `gensim` builds corpora for processing, we need to break each text down into its component tokens, which in this case are just the individual words of the corpus. 

In [6]:
def tokenize(text):
    """Given a text, tokenize it while removing stopwords, non-alpha characters, and one letter words"""
    tokens = [token for token in word_tokenize(text) if token.lower() not in STOPWORDS]
    cleaned = [token for token in tokens if token.isalpha()]
    cleaned_greater_1 = [token for token in cleaned if len(token) > 1]
    return cleaned_greater_1

In [7]:
def text_stream(text_dir):
    """Given a directory of plain text files, return a stream of tuples with the item id from the filename and the cleaned, tokenized text"""
    for fn in glob.glob(f"{text_dir}/*.txt"):
        item_id = get_item_id(fn)
        with open(fn, 'r') as f:
            document = f.read()
            yield(item_id, tokenize(document))

We've defined all the functions we need to read in our documents and process them. We'll use the `head` utility function we wrote above to look at the first file and see how well our processing worked.

In [12]:
head(text_stream(text_dir), 1)

[('mc00456-001-bx0004-043-001',
  ['MORAL',
   'ASPECT',
   'VIVISECTION',
   'JANE',
   'VVHATELY',
   'instruction',
   'encouraga',
   'ment',
   'thought',
   'question',
   'Vivisection',
   'recall',
   'words',
   'persons',
   'eminent',
   'high',
   'qualities',
   'intellect',
   'moral',
   'character',
   'passed',
   'judgment',
   'Miss',
   'Jane',
   'VVhately',
   'daughter',
   'Archbishop',
   'respected',
   'trusted',
   'loved',
   'common',
   'degree',
   'large',
   'olrcle',
   'friends',
   'acquaintances',
   'preface',
   'short',
   'memoir',
   'sister',
   'published',
   'following',
   'tribute',
   'worth',
   'pen',
   'author',
   'Got',
   'ta',
   'Family',
   'fix',
   'quality',
   'especially',
   'characteristic',
   'truth',
   'perception',
   'rested',
   'entire',
   'truthfulness',
   'character',
   'true',
   'core',
   'mind',
   'heart',
   'True',
   'candid',
   'acknowledge',
   'difﬁculties',
   'thought',
   'memory',
   'tolera

We could also just look at the first or last bunch of tokens for each text to get a sense of the processing.

In [8]:
for item_id, tokens in head(text_stream(text_dir), n=5):
    print(item_id, tokens[:10])

mc00456-001-bx0004-043-001 ['MORAL', 'ASPECT', 'VIVISECTION', 'JANE', 'VVHATELY', 'instruction', 'encouraga', 'ment', 'thought', 'question']
mc00456-001-bx0004-053-001 ['ecial', 'Repert', 'Emu', 'BM', 'OW', 'NATNNAL', 'ALTN', 'MEDHAL', 'CUMMWTEE', 'Repmft']
mc00344-001-lb0001_26-002-000 ['Sydney', 'Daily', 'Telegraph', 'August', 'Cattle', 'producers', 'want', 'meat', 'eXport', 'inquiry']
mc00456-001-bx0007-015-001 ['EDHWON', 'ABOMINABLE', 'SIN', 'Lord', 'Shaftesbury', 'VIVISECTION', 'APPEAL', 'Scientific', 'Ethical', 'Thinkers']
mc00456-001-bx0007-005-001 ['UNSOIENTIFIC', 'VIEW', 'VIVISECTION', 'LADY', 'PAGET', 'Reprinted', 'NATIONAL', 'REVIEW', 'September', 'years']


In [9]:
for item_id, tokens in head(text_stream(text_dir), n=5):
    print(item_id, tokens[-10:])

mc00456-001-bx0004-043-001 ['sought', 'things', 'added', 'Guardian', 'General', 'Printing', 'Works', 'Manchester', 'Reddish', 'London']
mc00456-001-bx0004-053-001 ['Relations', 'Janet', 'Loud', 'Price', 'net', 'lnlh', 'll', 'lo', 'Ctltt', 'iﬁ']
mc00344-001-lb0001_26-002-000 ['resulted', 'fewer', 'hijack', 'attempts', 'orderly', 'open', 'process', 'days', 'legislative', 'session']
mc00456-001-bx0007-015-001 ['sectarian', 'political', 'barriers', 'appeals', 'phase', 'thought', 'Write', 'Nixon', 'page', 'cover']
mc00456-001-bx0007-005-001 ['Cause', 'annum', 'post', 'free', 'PEWIRESS', 'PIIEte', 'LiRle', 'Queen', 'Street', 'High']


In order to build the model, we now actually need to break apart the pieces of data that we put together in our text_stream function: the item id and the processed text. 

In [10]:
# You could extract the item_ids from the full text_stream, but in order to not
# tokenize everything when we don't yet need to we'll pull them directly from the filenames
item_ids = [get_item_id(fn) for fn in fns]
head(item_ids)

['mc00456-001-bx0004-043-001',
 'mc00456-001-bx0004-053-001',
 'mc00344-001-lb0001_26-002-000',
 'mc00456-001-bx0007-015-001',
 'mc00456-001-bx0007-005-001',
 'mc00344-001-bx0001_35-003-000',
 'mc00344-001-bx0001_38-004-000',
 'mc00456-001-bx0001-020-001',
 'aspca-scrapbooks-bx0001-002-001_0_20191213_759',
 'mc00344-001-bx0001_5-001-000']

In [11]:
# This is a generator comprehension. 
# What we get back is a generator that when called will provide the tokens for a single text, one at a time. 
doc_stream = (tokens for _, tokens in text_stream(text_dir))

### Building the model 

At this point, we have preprocessed texts that exist as lists of tokens, and a correspondingly ordered list of item ids. From here we'll build the different pieces of our model. 

First, we build a dictionary for the corpus, that is, a collection of the unique tokens (words) from the whole collection of documents. 

This will take a few minutes, so this is a great time for any questions you might have. 

In [12]:
%time id2word_items = gensim.corpora.Dictionary(doc_stream)

CPU times: user 3min 6s, sys: 1.13 s, total: 3min 7s
Wall time: 3min 9s


In [13]:
print(id2word_items)

Dictionary(561820 unique tokens: ['ASPECT', 'Archbishop', 'Asiatic', 'Brain', 'Close']...)


Looking at those first few tokens, we can see we definitely have some words or abbreviations we would want to remove from a production model. 

In [14]:
# Filter words based on occurence in docs
# https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes
# The filtered dictionary will only contain words that appear in at least two documents. We could also filter out words that appear in more than a percentage of the documents, and/or keep only the most frequent words. Sometimes this is good for a model, and sometimes not. It all depends on what you want your model to do. 
id2word_items.filter_extremes(no_below=2)

In [15]:
print(id2word_items)

Dictionary(100000 unique tokens: ['ASPECT', 'Archbishop', 'Asiatic', 'Brain', 'Close']...)


The corpus dictionary is now substantially smaller. It makes sense to filter out words that appear in only one document because we want to understand themes or topics that exist in the collection as a whole. 

In [16]:
# We're building this for use in the LDA model and so we can save it to disk for re-use
class ItemCorpus(object):
    def __init__(self, text_dir, dictionary):
        self.text_dir = text_dir
        self.dictionary = dictionary
        
    def __iter__(self):
        self.item_ids = []
        for item_id, tokens in text_stream(text_dir):
            self.item_ids.append(item_id)
            yield self.dictionary.doc2bow(tokens)

In [17]:
item_corpus = ItemCorpus(text_dir, id2word_items)

In [34]:
# Save serialized corpus for later use
# mm here is the Market Matrix format that Gensim prefers, though there are a few different formats that Gensim can work with.
%time gensim.corpora.MmCorpus.serialize("animal_turn_bow.mm", item_corpus)

CPU times: user 3min 5s, sys: 919 ms, total: 3min 6s
Wall time: 3min 6s


Once we've saved our corpus to disk, we can easily reload it. Especially if we are working iteratively on our model, it's a great idea to save the corpora in multiple states of pre-processing in case we ever need to go back to a previous version

In [18]:
loaded_corpus = gensim.corpora.MmCorpus("animal_turn_bow.mm")
print(loaded_corpus)

MmCorpus(1006 documents, 100000 features, 2656977 non-zero entries)


In [36]:
# DON'T run this cell during the workshop; it takes a bit too long for the live workshop.
# We'll build an LDA topic model. LdaMulticore allows us to use multiple CPU cores to build the model. 
%time lda_model = gensim.models.LdaMulticore(loaded_corpus, num_topics=40, id2word=id2word_items, passes=50, workers=4)

CPU times: user 1h 12min 19s, sys: 8min 45s, total: 1h 21min 5s
Wall time: 21min 48s


In [37]:
lda_model.save('animalturn_40.model')

### Exploring the model

Since the model takes 20ish minutes to train, we're going to just load the model that I've already trained rather than train a new one here. 

In [19]:
lda_model = gensim.models.LdaModel.load("animalturn_40.model")

In [21]:
lda_model.print_topics(-1)

[(0,
  '0.011*"ASPCA" + 0.009*"York" + 0.004*"pet" + 0.004*"Hospital" + 0.003*"City" + 0.003*"Shelter" + 0.003*"cat" + 0.003*"Manager" + 0.003*"Manhattan" + 0.003*"cats"'),
 (1,
  '0.035*"horse" + 0.017*"horses" + 0.010*"killer" + 0.007*"April" + 0.006*"tail" + 0.005*"Executive" + 0.005*"dollars" + 0.004*"pigs" + 0.004*"pavement" + 0.003*"Report"'),
 (2,
  '0.009*"experiments" + 0.005*"vivisection" + 0.005*"medical" + 0.004*"disease" + 0.004*"Vivisection" + 0.003*"scientiﬁc" + 0.003*"science" + 0.003*"knowledge" + 0.003*"Sir" + 0.003*"moral"'),
 (3,
  '0.012*"CASH" + 0.010*"SUNDRIES" + 0.010*"WW" + 0.009*"TOTALS" + 0.008*"House" + 0.007*"Ledger" + 0.007*"EXPENSE" + 0.006*"Ambulance" + 0.006*"se" + 0.006*"WWW"'),
 (4,
  '0.020*"rabbits" + 0.013*"trap" + 0.010*"traps" + 0.008*"rabbit" + 0.006*"trapping" + 0.006*"methods" + 0.006*"ULAWS" + 0.005*"ANIMAL" + 0.005*"method" + 0.005*"steel"'),
 (5,
  '0.009*"York" + 0.008*"JOHN" + 0.007*"City" + 0.006*"John" + 0.006*"organized" + 0.006*"State

That's not the easiest way to read through the topics, and we'll keep exploring different views on the topics. 

Take a minute though to read through the lists of words above. Topic models don't give you any sort of title for a topic or tell you how they cohere. This is the part of the interpreter or expert. Pick out a couple of topics, and think about you might name the topic or describe it. 

Let's now look at some better ways to view the topics. 

In [22]:
def explore_topic(lda_model, topic_number, topn, output=True):
    """
    accepts an ldamodel, a topic number and topn terms of interest
    prints a formatted list of the topn terms
    """
    terms = []
    for term, frequency in lda_model.show_topic(topic_number, topn=topn):
        terms += [term]
        if output:
            print(u'{:20} {:.3f}'.format(term, round(frequency, 3)))
    return terms

In [29]:
topic_summaries = []
print(u'{:20} {}'.format(u'term', u'frequency') + u'\n')
for i in range(40):
    print("Topic " + str(i) + " |---------------------\n")
    tmp = explore_topic(lda_model, topic_number=i, topn=10, output=True)
    topic_summaries += [tmp[:5]]
    print()

term                 frequency

Topic 0 |---------------------

ASPCA                0.011
York                 0.009
pet                  0.004
Hospital             0.004
City                 0.003
Shelter              0.003
cat                  0.003
Manager              0.003
Manhattan            0.003
cats                 0.003

Topic 1 |---------------------

horse                0.035
horses               0.017
killer               0.010
April                0.007
tail                 0.006
Executive            0.005
dollars              0.005
pigs                 0.004
pavement             0.004
Report               0.003

Topic 2 |---------------------

experiments          0.009
vivisection          0.005
medical              0.005
disease              0.004
Vivisection          0.004
scientiﬁc            0.003
science              0.003
knowledge            0.003
Sir                  0.003
moral                0.003

Topic 3 |---------------------

CASH                 0.012


In [30]:
topic_summaries

[['ASPCA', 'York', 'pet', 'Hospital', 'City'],
 ['horse', 'horses', 'killer', 'April', 'tail'],
 ['experiments', 'vivisection', 'medical', 'disease', 'Vivisection'],
 ['CASH', 'SUNDRIES', 'WW', 'TOTALS', 'House'],
 ['rabbits', 'trap', 'traps', 'rabbit', 'trapping'],
 ['York', 'JOHN', 'City', 'John', 'organized'],
 ['Miss', 'Humane', 'horse', 'John', 'Cruelty'],
 ['ll', 'horse', 'street', 'society', 'Cruelty'],
 ['species', 'trade', 'wildlife', 'research', 'Wildlife'],
 ['York', 'slaughter', 'American', 'Humane', 'school'],
 ['person', 'cruelly', 'misdemeanor', 'shelter', 'custody'],
 ['Veterinary', 'AVMA', 'University', 'Medicine', 'American'],
 ['experiments', 'Act', 'experiment', 'anaesthetics', 'performed'],
 ['information', 'und', 'die', 'Washington', 'der'],
 ['person', 'section', 'State', 'court', 'owner'],
 ['factor', 'diet', 'milk', 'oil', 'scurvy'],
 ['Zoo', 'research', 'pet', 'species', 'zoo'],
 ['Rev', 'Mass', 'na', 'kwa', 'ya'],
 ['exhibit', 'Science', 'AAAS', 'space', 'boo

This is quite a bit better as a way to view the most significant terms for each topic. Take another look at these topics? Do any stand out to you? Does it seem like setting 40 topics was the right choice or does it seem like the topics are too general or too granular? 

Let's look at one other way of visualizing our model before we shift to thinking about visualizing the corpus. We'll take advantage of the great `pyLDAvis` library, which is a Python version of a previous R library. 

In [31]:
pyLDAvis.enable_notebook()

In [33]:
# This will likely take a few minutes to load up, so feel free to ask any questions you might have been holding on to! 
pyLDAvis.gensim.prepare(lda_model, loaded_corpus, id2word_items)

This is a great way to explore the relations between the topics, and to understand some of the thematic shape of the corpus. We can get a sense of how closely the topics relate to each other, and whether there are strands of themes that are significantly different than others or less frequent on the whole. This visualization may also just uncover particular words that surprise you and cause you to delve back into the collection, and into specific documents to understand what's happening. 

In some ways, this visualization is already a visualization of the corpus. We can think about any given collection as being presented in many ways. The original images or videos or texts are one way. A presentation of the human-created metadata in tabular format would be a different way. This type of model and it's visualization might be another way. We could think about designing discovery in ways that show these different layers of presentation, a sort of thick, rich description of the collection as a phenomenon. 

But, what we haven't done yet is use this model as a way to visualize the individual items in the collection as a collection. We'll turn there now. 

## Visualizing the corpus

Let's start just with looking at the topic distribution for a single document. 

In [42]:
lda_model.get_document_topics(loaded_corpus[0])

[(2, 0.78964967), (18, 0.011434091), (25, 0.15175089), (35, 0.036497198)]

We can see that for the first document in our corpus, the model considers it to be about 79% about topic 2, based on the words in the document. It considers it roughly 15% about topic 25, 4% about topic 35, and 1% about topic 18. If we need a reminder of the most significant topics we can check that. 

N.B. the topic numbers in pyLDAvis can be different from the topics numbers in the model (unless they've fixed that). 

In [44]:
lda_model.show_topic(2)

[('experiments', 0.008631112),
 ('vivisection', 0.005415695),
 ('medical', 0.005117255),
 ('disease', 0.0040457817),
 ('Vivisection', 0.003984823),
 ('scientiﬁc', 0.002987461),
 ('science', 0.0029053374),
 ('knowledge', 0.0028985091),
 ('Sir', 0.0028815728),
 ('moral', 0.002684408)]

We could take the numbers from `get_document_topics` and generate a visualization for each document that shows the significant topics for each document according to the model, even putting in nice tooltips or click events to show the top words for each topic. 

We're going to hold off on that, though, and stay with thea idea of visualizating the whole collection. To do that, we'll build up a tabular dataset where every row is a document in the collection, and every column a topic proportion. The value at any given cell of the data will be the proportion of that document's words that are associated with that topic. A row would represent the vector of that document in the vector space, or feature space, of the collection model.   

We'll use `pandas` to bulid this data. 

In [46]:
# Extract the list of item ids as a pandas Series
source_id = pd.Series(item_ids)

In [47]:
# Above, `get_document_topics` only showed significant topics, but we can set a minimum probability to get numbers for all topics
lda_model.get_document_topics(loaded_corpus[0], minimum_probability=0)

[(0, 8.876168e-05),
 (1, 8.876168e-05),
 (2, 0.78964025),
 (3, 8.876168e-05),
 (4, 8.876168e-05),
 (5, 8.876168e-05),
 (6, 8.876168e-05),
 (7, 8.876168e-05),
 (8, 8.876168e-05),
 (9, 8.876168e-05),
 (10, 8.876168e-05),
 (11, 8.876168e-05),
 (12, 8.876168e-05),
 (13, 8.876168e-05),
 (14, 8.876168e-05),
 (15, 8.876168e-05),
 (16, 8.876168e-05),
 (17, 8.876168e-05),
 (18, 0.011431555),
 (19, 8.876168e-05),
 (20, 8.876168e-05),
 (21, 8.876168e-05),
 (22, 8.876168e-05),
 (23, 8.876168e-05),
 (24, 8.876168e-05),
 (25, 0.15173051),
 (26, 8.876168e-05),
 (27, 8.876168e-05),
 (28, 8.876168e-05),
 (29, 8.876168e-05),
 (30, 8.876168e-05),
 (31, 8.876168e-05),
 (32, 8.876168e-05),
 (33, 8.876168e-05),
 (34, 8.876168e-05),
 (35, 0.03652632),
 (36, 8.876168e-05),
 (37, 8.876168e-05),
 (38, 0.007564716),
 (39, 8.876168e-05)]

In [48]:
# Create headers for the DataFrame
headers = ["source_id"]
for i in range(40):
    headers.append(f"topic-{i}")

In [49]:
# Check that our last topic header is correctly numbered
headers[-1]

'topic-39'

In [55]:
# Set up the DataFrame
df = pd.DataFrame(columns=headers)
df

Unnamed: 0,source_id,topic-0,topic-1,topic-2,topic-3,topic-4,topic-5,topic-6,topic-7,topic-8,...,topic-30,topic-31,topic-32,topic-33,topic-34,topic-35,topic-36,topic-37,topic-38,topic-39


In [56]:
# Generally, building pandas DataFrames row by row is not a best practice, but it makes sense here given the gensim function that gives us the topic distribution for a document
for i in range(len(item_ids)):
    item_id = item_ids[i]
    new_row = [item_id]
    for _, prob in lda_model.get_document_topics(loaded_corpus[i], minimum_probability=0):
        new_row.append(prob)
    df.loc[item_id] = new_row

In [57]:
df.head()

Unnamed: 0,source_id,topic-0,topic-1,topic-2,topic-3,topic-4,topic-5,topic-6,topic-7,topic-8,...,topic-30,topic-31,topic-32,topic-33,topic-34,topic-35,topic-36,topic-37,topic-38,topic-39
mc00456-001-bx0004-043-001,mc00456-001-bx0004-043-001,8.876167e-05,8.876167e-05,0.789641,8.876167e-05,8.9e-05,8.876167e-05,8.876167e-05,8.9e-05,8.9e-05,...,8.876167e-05,8.9e-05,8.876167e-05,8.9e-05,8.9e-05,0.036521,8.876167e-05,8.876167e-05,0.007565,8.9e-05
mc00456-001-bx0004-053-001,mc00456-001-bx0004-053-001,1.363475e-06,1.363475e-06,1e-06,1.363475e-06,1e-06,1.363475e-06,1.363475e-06,1e-06,1e-06,...,1.363475e-06,1e-06,1.363475e-06,1e-06,1e-06,1e-06,1.363475e-06,1.363475e-06,1e-06,1e-06
mc00344-001-lb0001_26-002-000,mc00344-001-lb0001_26-002-000,6.37459e-07,6.37459e-07,0.013009,6.37459e-07,0.001292,6.37459e-07,6.37459e-07,0.012734,0.505212,...,6.37459e-07,0.00081,6.37459e-07,0.006812,0.059132,0.024353,6.37459e-07,6.37459e-07,0.002983,0.015213
mc00456-001-bx0007-015-001,mc00456-001-bx0007-015-001,2.434307e-05,2.434307e-05,0.974051,2.434307e-05,2.4e-05,2.434307e-05,0.02001165,2.4e-05,2.4e-05,...,2.434307e-05,2.4e-05,2.434307e-05,2.4e-05,2.4e-05,2.4e-05,2.434307e-05,2.434307e-05,0.002644,2.4e-05
mc00456-001-bx0007-005-001,mc00456-001-bx0007-005-001,1.979066e-05,1.979066e-05,0.734422,1.979066e-05,2e-05,1.979066e-05,1.979066e-05,0.011703,2e-05,...,0.04707528,2e-05,1.979066e-05,2e-05,2e-05,2e-05,1.979066e-05,1.979066e-05,2e-05,2e-05


In [58]:
# Save our data to a csv file 
df.to_csv("doc_topic_probs_model_40.csv")

## Critical Reflection?

IDEAS:
- top documents for each topic?
- Add top words for each top topic to the t-sne vis