# Building Custom Discovery for Digitized Collections Using Computational Methods

## Learning Goals

As we move through the workshop, make note of places in the process where an expert is required or needed to make the computational methods worthwhile.

## Packages we're using

In [1]:
# semantic modeling
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.tokenize import word_tokenize

# visualization
import pyLDAvis
import pyLDAvis.gensim
from sklearn.manifold import TSNE
from bokeh.io import output_file, output_notebook, save, show
from bokeh.models import ColumnDataSource
from bokeh.palettes import viridis
from bokeh.plotting import figure


# data manipulation
import pandas as pd

# general utility
import glob
import itertools
import os
from tqdm import tqdm

## Exploring the corpus and determining approaches

Let's begin by just taking a look at some of the individual OCR files to get a sense of what they might be like. We could look at the items by way of the Libraries' website, but anytime I'm doing text analysis work, I like to see the text I'll be working with directly. 

In [2]:
text_dir = "texts"

In [5]:
fns = glob.glob("texts/*.txt")
print(len(fns))
fns[:5]

1006


['texts/mc00456-001-bx0004-043-001.txt',
 'texts/mc00456-001-bx0004-053-001.txt',
 'texts/mc00344-001-lb0001_26-002-000.txt',
 'texts/mc00456-001-bx0007-015-001.txt',
 'texts/mc00456-001-bx0007-005-001.txt']

In [4]:
with open(fns[0], 'r') as f:
    print(f.read())

THE MORAL ASPECT
VIVISECTION.

B Y

E. JANE VVHATELY.

IT is sometimes well for the instruction and encouraga
ment of those who give serious thought to the question
of Vivisection, to recall the words of persons eminent for
high qualities of intellect and of moral character, who
have passed judgment upon it. Miss E. Jane VVhately

daughter of Archbishop VVhately—was respected, trusted,

. . . ’-
and loved in no common degree by a large olrcle of friends

and acquaintances.

In the preface to a short memoir of her, by her sister,
published in 1893, there is the following tribute to her
worth from the pen of the well-known author of “The
Schijnberg—Gotta Family” : “If I were to fix on one quality
as especially characteristic of her, it would be truth—
truth of perception, which rested on entire truthfulness
of character. She was true to the core in mind and
heart. True, because she was clear-sighted, candid to
acknowledge difﬁculties in thought or memory, and
therefore tolerant to differ

What do you see in this text? Does the OCR look good? Are there parts of the text that you think shouldn't be included in a model that helps with discovering texts?

Try picking a different file and reading through it with the same questions.

Now that we now a bit about what types of texts we have, what goals would you have in providing discovery for this collection? What aspects of the documents would you want to focus on to expose to scholars?

One of the main advantages of using any sort of machine learning process is that we can show relationships between and features of the items in a collection that we had not otherwise known or shown. The types of features could vary greatly. Maybe we want to show relationships based on the content of the documents in some cases. Maybe we want to expose something in the metadata of the documents. We might want to do both. 

Here we're going to focus on the content, and specifically one type of model that allows us to make connections across the collection: topic modeling. 



### What is topic modeling?

According to [David Blei](http://www.cs.columbia.edu/~blei/topicmodeling.html), topic models are a "suite of algorithms that uncover the hidden thematic structure in document collections." Topic models operate on the idea that for any given document collection, or corpus, there is a finite number of themes, or topics, from which the corpus draws and each document is composed of words that are associated with some number of those topics. While we don't necessarily think of an author simply dipping into buckets (topics) of words and putting them together to create a document, it's turned out to be a useful model for understanding collections of documents according to the themes that cut across the collection.

There are quite a few types of topic models, but we'll focus on one of the most common forms: latent dirichlet allocation (LDA). LDA topic modeling is a form of unsupervised machine learning, wherein we provide an unlabeled corpus of texts to the algorithm, which then produces the model, though we often provide the number of topics that the algorithm should use for the model. While there are processes for determining the "correct" number of topics, many consider this part of topic modeling a bit of an art that is determined as much by the research questions of the person running the model as it is by the corpus and model. Other types of topic models highlight different aspects and problematics of types of corpora, such as temporal differentiation and author bias. 

A topic model gives us a number of data objects. We'll have a list of topics, which are distributions over terms, though we could think of topics somewhat simply as sets of regularly co-occuring terms. We'll also have a representation of each document in the corpus as a vector denoting the composition of the document according to the topics, that is, we'll have an account of how much of each document is associated with each topic. 

Key resource: [Probabilistic Topic Models](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) by David Blei. 

### Other models or approaches that could be useful for discovery

- Keyword extraction
- Automated summarization
- Entity extraction, including geospatial data
- Various clustering algorithms

## Modeling the corpus

Now that we know what type of model, we'll use, let's jump in to building hte model itself

### Reading in and cleaning the documents

How we're going to read in and clean our texts is somewhat particular to `gensim`, the library we're using for our topic model. You could absolutely approach this part of the process in different ways, but we'll stick with an approach recommended by the author of `gensim` so that if you're looking for documentation and help later, it will be easier to find. For this part we'll hew closely to the code in the following tutorial.

[Radim Řehůřek's topic modeling tutorial](https://radimrehurek.com/topic_modeling_tutorial/2%20-%20Topic%20Modeling.html)

Since corpora can be large, it's often a good idea to approach reading in data with streaming in mind. Rather than reading in all of our data at once and then processing it, we'll read it each item in and process it one at a time. 

In [7]:
def head(stream, n=10):
    """Given a stream of data items, return just the first n as a list"""
    return list(itertools.islice(stream, n))

In [8]:
# We currently have filenames that include the item id. We'll want to associate the processed texts with just item id, so we need to pull it out of the filename.
def get_item_id(fn):
    """Given a filename, return just the item id"""
    return os.path.split(fn)[1].split(".")[0]

Cleaning texts is often iterative, and how much you clean your corpus depends on the model you use. For topic modeling, I typically start without cleaning at all, get results, then piece by piece add in the minimum necessary cleaning to get sensible results. What we'll do in the function below is based on that minimal approach. 

Just due to how `gensim` builds corpora for processing, we need to break each text down into its component tokens, which in this case are just the individual words of the corpus. 

In [9]:
def tokenize(text):
    """Given a text, tokenize it while removing stopwords, non-alpha characters, and one letter words"""
    tokens = [token for token in word_tokenize(text) if token.lower() not in STOPWORDS]
    cleaned = [token for token in tokens if token.isalpha()]
    cleaned_greater_1 = [token for token in cleaned if len(token) > 1]
    return cleaned_greater_1

In [11]:
def text_stream(text_dir):
    """Given a directory of plain text files, return a stream of tuples with the item id from the filename and the cleaned, tokenized text"""
    for fn in glob.glob(f"{text_dir}/*.txt"):
        item_id = get_item_id(fn)
        with open(fn, 'r') as f:
            document = f.read()
            yield(item_id, tokenize(document))

We've defined all the functions we need to read in our documents and process them. We'll use the `head` utility function we wrote above to look at the first file and see how well our processing worked.

In [12]:
head(text_stream(text_dir), 1)

[('mc00456-001-bx0004-043-001',
  ['MORAL',
   'ASPECT',
   'VIVISECTION',
   'JANE',
   'VVHATELY',
   'instruction',
   'encouraga',
   'ment',
   'thought',
   'question',
   'Vivisection',
   'recall',
   'words',
   'persons',
   'eminent',
   'high',
   'qualities',
   'intellect',
   'moral',
   'character',
   'passed',
   'judgment',
   'Miss',
   'Jane',
   'VVhately',
   'daughter',
   'Archbishop',
   'respected',
   'trusted',
   'loved',
   'common',
   'degree',
   'large',
   'olrcle',
   'friends',
   'acquaintances',
   'preface',
   'short',
   'memoir',
   'sister',
   'published',
   'following',
   'tribute',
   'worth',
   'pen',
   'author',
   'Got',
   'ta',
   'Family',
   'fix',
   'quality',
   'especially',
   'characteristic',
   'truth',
   'perception',
   'rested',
   'entire',
   'truthfulness',
   'character',
   'true',
   'core',
   'mind',
   'heart',
   'True',
   'candid',
   'acknowledge',
   'difﬁculties',
   'thought',
   'memory',
   'tolera

We could also just look at the first or last bunch of tokens for each text to get a sense of the processing.

In [14]:
for item_id, tokens in head(text_stream(text_dir), n=5):
    print(item_id, tokens[:10])

mc00456-001-bx0004-043-001 ['MORAL', 'ASPECT', 'VIVISECTION', 'JANE', 'VVHATELY', 'instruction', 'encouraga', 'ment', 'thought', 'question']
mc00456-001-bx0004-053-001 ['ecial', 'Repert', 'Emu', 'BM', 'OW', 'NATNNAL', 'ALTN', 'MEDHAL', 'CUMMWTEE', 'Repmft']
mc00344-001-lb0001_26-002-000 ['Sydney', 'Daily', 'Telegraph', 'August', 'Cattle', 'producers', 'want', 'meat', 'eXport', 'inquiry']
mc00456-001-bx0007-015-001 ['EDHWON', 'ABOMINABLE', 'SIN', 'Lord', 'Shaftesbury', 'VIVISECTION', 'APPEAL', 'Scientific', 'Ethical', 'Thinkers']
mc00456-001-bx0007-005-001 ['UNSOIENTIFIC', 'VIEW', 'VIVISECTION', 'LADY', 'PAGET', 'Reprinted', 'NATIONAL', 'REVIEW', 'September', 'years']


In [16]:
for item_id, tokens in head(text_stream(text_dir), n=5):
    print(item_id, tokens[-10:])

mc00456-001-bx0004-043-001 ['sought', 'things', 'added', 'Guardian', 'General', 'Printing', 'Works', 'Manchester', 'Reddish', 'London']
mc00456-001-bx0004-053-001 ['Relations', 'Janet', 'Loud', 'Price', 'net', 'lnlh', 'll', 'lo', 'Ctltt', 'iﬁ']
mc00344-001-lb0001_26-002-000 ['resulted', 'fewer', 'hijack', 'attempts', 'orderly', 'open', 'process', 'days', 'legislative', 'session']
mc00456-001-bx0007-015-001 ['sectarian', 'political', 'barriers', 'appeals', 'phase', 'thought', 'Write', 'Nixon', 'page', 'cover']
mc00456-001-bx0007-005-001 ['Cause', 'annum', 'post', 'free', 'PEWIRESS', 'PIIEte', 'LiRle', 'Queen', 'Street', 'High']


In order to build the model, we now actually need to break apart the pieces of data that we put together in our text_stream function: the item id and the processed text. 

In [17]:
# You could extract the item_ids from the full text_stream, but in order to not
# tokenize everything when we don't yet need to we'll pull them directly from the filenames
item_ids = [get_item_id(fn) for fn in fns]
head(item_ids)

['mc00456-001-bx0004-043-001',
 'mc00456-001-bx0004-053-001',
 'mc00344-001-lb0001_26-002-000',
 'mc00456-001-bx0007-015-001',
 'mc00456-001-bx0007-005-001',
 'mc00344-001-bx0001_35-003-000',
 'mc00344-001-bx0001_38-004-000',
 'mc00456-001-bx0001-020-001',
 'aspca-scrapbooks-bx0001-002-001_0_20191213_759',
 'mc00344-001-bx0001_5-001-000']

In [None]:
# This is a generator comprehension. 
doc_stream = (tokens for _, tokens in text_stream(text_dir))

### Building the model 

### Exploring the model

## Visualizing the corpus

## Critical Reflection?