# Building Custom Discovery for Digitized Collections Using Computational Methods

## Learning Goals

As we move through the workshop, make note of places in the process where an expert is required or needed to make the computational methods worthwhile.

## Packages we're using

In [1]:
# semantic modeling
import gensim
from gensim.parsing.preprocessing import STOPWORDS
from nltk.tokenize import word_tokenize

# visualization
import pyLDAvis
import pyLDAvis.gensim
from sklearn.manifold import TSNE
from bokeh.io import output_file, output_notebook, save, show
from bokeh.models import ColumnDataSource
from bokeh.palettes import viridis
from bokeh.plotting import figure


# data manipulation
import pandas as pd

# general utility
import glob
import itertools
import os
from tqdm import tqdm

## Exploring the corpus and determining approaches

Let's begin by just taking a look at some of the individual OCR files to get a sense of what they might be like. We could look at the items by way of the Libraries' website, but anytime I'm doing text analysis work, I like to see the text I'll be working with directly. 

In [2]:
text_dir = "texts"

In [3]:
fns = glob.glob("texts/*.txt")
fns[:5]

['texts/mc00456-001-bx0004-043-001.txt',
 'texts/mc00456-001-bx0004-053-001.txt',
 'texts/mc00344-001-lb0001_26-002-000.txt',
 'texts/mc00456-001-bx0007-015-001.txt',
 'texts/mc00456-001-bx0007-005-001.txt']

In [4]:
with open(fns[0], 'r') as f:
    print(f.read())

THE MORAL ASPECT
VIVISECTION.

B Y

E. JANE VVHATELY.

IT is sometimes well for the instruction and encouraga
ment of those who give serious thought to the question
of Vivisection, to recall the words of persons eminent for
high qualities of intellect and of moral character, who
have passed judgment upon it. Miss E. Jane VVhately

daughter of Archbishop VVhately—was respected, trusted,

. . . ’-
and loved in no common degree by a large olrcle of friends

and acquaintances.

In the preface to a short memoir of her, by her sister,
published in 1893, there is the following tribute to her
worth from the pen of the well-known author of “The
Schijnberg—Gotta Family” : “If I were to fix on one quality
as especially characteristic of her, it would be truth—
truth of perception, which rested on entire truthfulness
of character. She was true to the core in mind and
heart. True, because she was clear-sighted, candid to
acknowledge difﬁculties in thought or memory, and
therefore tolerant to differ

What do you see in this text? Does the OCR look good? Are there parts of the text that you think shouldn't be included in a model that helps with discovering texts?

Try picking a different file and reading through it with the same questions.

Now that we now a bit about what types of texts we have, what goals would you have in providing discovery for this collection? What aspects of the documents would you want to focus on to expose to scholars?

One of the main advantages of using any sort of machine learning process is that we can show relationships between and features of the items in a collection that we had not otherwise known or shown. The types of features could vary greatly. Maybe we want to show relationships based on the content of the documents in some cases. Maybe we want to expose something in the metadata of the documents. We might want to do both. 

Here we're going to focus on the content, and specifically one type of model that allows us to make connections across the collection: topic modeling. 



### What is topic modeling?

According to [David Blei](http://www.cs.columbia.edu/~blei/topicmodeling.html), topic models are a "suite of algorithms that uncover the hidden thematic structure in document collections." Topic models operate on the idea that for any given document collection, or corpus, there is a finite number of themes, or topics, from which the corpus draws and each document is composed of words that are associated with some number of those topics. While we don't necessarily think of an author simply dipping into buckets (topics) of words and putting them together to create a document, it's turned out to be a useful model for understanding collections of documents according to the themes that cut across the collection.

There are quite a few types of topic models, but we'll focus on one of the most common forms: latent dirichlet allocation (LDA). LDA topic modeling is a form of unsupervised machine learning, wherein we provide an unlabeled corpus of texts to the algorithm, which then produces the model, though we often provide the number of topics that the algorithm should use for the model. While there are processes for determining the "correct" number of topics, many consider this part of topic modeling a bit of an art that is determined as much by the research questions of the person running the model as it is by the corpus and model. Other types of topic models highlight different aspects and problematics of types of corpora, such as temporal differentiation and author bias. 

A topic model gives us a number of data objects. We'll have a list of topics, which are distributions over terms, though we could think of topics somewhat simply as sets of regularly co-occuring terms. We'll also have a representation of each document in the corpus as a vector denoting the composition of the document according to the topics, that is, we'll have an account of how much of each document is associated with each topic. 

Key resource: [Probabilistic Topic Models](http://www.cs.columbia.edu/~blei/papers/Blei2012.pdf) by David Blei. 

### Other models or approaches that could be useful for discovery

- Keyword extraction
- Automated summarization
- Entity extraction, including geospatial data
- Various clustering algorithms

## Modeling the corpus

## Visualizing the corpus

## Critical Reflection?