## Overview

This notebook offers a brif overview of the computation approach to topic modeling that I'm taking for my dissertation project. I've selected run of *Forest & Stream*, an outdoor sports magazine from the end of the nineteenth century, to demonstrate the results of this topic modeling and show the kinds of output that comes from these models. The code for achieveing this process can be read by toggling each of the sections below, but the final visualization and topics witht he words that comprise them appear below in the __Results__ section.

## Topic Modeling

### Define Parameters

We start with a list of the ID number for each of the volumes to analyze. In this case it is a series of volumes of *Forest & Stream* from the late nineteenth century. At this point the number of topics (based on a preliminary mathematical analysis of the coherence of the models with various numbers of topics) and the specific parameters for each of the models are established. 

In [17]:
#Set number of topics
num_topics=8

# Set Parameters
num_topics = 8
extract_key = False #False means that keyword page filtering will NOT happen
keys = ['machine', 'machinery', 'factory', 'manufacture', 'industry', 'industrial']
rem_stop = True #True means that stopwords will be removed from the set

htids = ['mdp.39015006960549',
'umn.31951p01140200w',
'mdp.39015012335488',
'mdp.39015012335405',
'mdp.39015084559023',
'mdp.39015048401601',
'mdp.39015013722171',
'mdp.39015049862595',
'mdp.39015084558868',
'mdp.39015084559155',
'mdp.39015006947595',
'mdp.39015022405248',
'mdp.39015012371939',
'mdp.39015012372010',
'mdp.39015084558850',
'mdp.39015084559007',
'mdp.39015006947421',
'mdp.39015006947256',
'mdp.39015006947439',
'mdp.39015006947264',
'mdp.39015006947231',
'mdp.39015049799383',
'mdp.39015006947249',
'mdp.39015006959996',
'mdp.39015084559148',
'mdp.39015049862561',
'mdp.39015006947801',
'mdp.39015006947322',
'mdp.39015049862579',
'mdp.39015084558843',
'mdp.39015006947330',
'osu.32435062356423',
'mdp.39015006947348',
'mdp.39015006947355',
'mdp.39015006947678',
'uc1.c0000084657',
'uc1.c0000084665',
'mdp.39015047692945',
'mdp.39015030547908',
'mdp.39015079983212',
'mdp.39015079982909',
'mdp.39015079983071',
'mdp.39015047692937',
'mdp.39015030813276',
'mdp.39015079982925',
'mdp.39015079983063',
'mdp.39015079982917',
'mdp.39015012335983',
             ]

### Clean the Data

The data from the HathiTrust comes as a series of zipped files that are converted into DataFrames, which are essentially giant tables full of textual information including the specific words for each page, counts of each word and each page, and the part of speech tag for that word. This data has to be cleaned, organized, and processed to remove errors in the optical charater recognition, to eliminate stopwords (words with little semantic value like "be" and "the"), and to decount the words so that we have a raw list of tokens for processing in the model. 

In [18]:
#Import Libraries
import os
import pandas as pd
import numpy as np
import re
from htrc_features import FeatureReader, Volume, utils
import nltk
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
wnl = WordNetLemmatizer()

#define the cleaner function called by the base modeling script.
def cleaner(htids,keys,extract_key=False,rem_stop=True):
    #define the dictionaries and DataFrame to hold the data throughout script
    corpus_dict = {} #dictionary of token list data frames
    token_dict = {} #dictionary of modified token list data frames
    key_tlist = pd.DataFrame() #array to pass pages numbers with key words

#create a dictionary of DataFrames (key = htid, dataframe = value)
    for htid in htids:
        vol = Volume(htid+'.json.bz2')
        tlist = vol.tokenlist(section='body', case=False)
        corpus_dict[htid] = tlist

#for each token list data frame, clean and process the data
    for htid, tlist in corpus_dict.items():
        tlist=tlist.reset_index(drop=False, inplace=False)
        #filter non alphabetical characters
        def alphafilter(row):
            return re.sub('[^a-zA-Z]', '', row)
        tlist.loc[:, 'lowercase'] = tlist.loc[:, 'lowercase'].apply (lambda row: alphafilter(row))
        #filter anything that's less than three characters long
        tlist = tlist[tlist.loc[:, 'lowercase'].map(len)>=3]
        #filter based on the part of speech
        pos = ['NN', 'NNS', 'NNP', 'NNPS', 'RB', 'RBR', 'RBS', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'JJ', 'JJR', 'JJS']
        tlist = tlist[tlist.loc[:, 'pos'].isin(pos)]
        #remove stopwords from the set if selected
        if rem_stop == True:
            from nltk.corpus import stopwords
            stops = set(stopwords.words('english'))
            tlist = tlist[~tlist.loc[:, 'lowercase'].isin(stops)]
        else:
            print('stopwords were not removed')
        tlist['pos_new'] = tlist.loc[:, 'pos'].str.extract(r'(^\w{1})')
        convert_dict = {'page': int,
                        'section':object,
                        'lowercase': object,
                        'pos': object,
                        'pos_new': object} #ensures correct data type for index items
        tlist = tlist.astype(convert_dict)
        #converts the part-of-speech labels into single letter lowercase codes to conform to
        #format required by the nltk WordNetLemmatizer
        tlist.loc[:, 'pos_new'] = tlist.loc[:, 'pos_new'].str.lower()
        tlist.loc[:, 'pos_new'] = tlist.loc[:, 'pos_new'].replace(r'j','a')
        #converts the existing lowercase column into the lemmatized form of the token
        def lemma(row):
            lemma = wnl.lemmatize(row['lowercase'], row['pos_new'])
            return lemma
        tlist['lowercase'] = tlist.apply (lambda row: lemma(row), axis=1)
        #remove tokens that appear only once in the whole volume
        tcount = tlist['lowercase'].value_counts()
        to_remove=tcount[tcount < 2].index
        tlist.replace(to_remove, np.nan, inplace=True)
        #remove nan values
        tlist = tlist.dropna()
        #de-count the tokens = multiple the tokens by the count on each page.
        def token_return(row):
            output = ' '
            i = 1
            if row['count']>1:
                while i < row['count']:
                    output += row['lowercase'] + ' '
                    i += 1
                return output
        tlist.loc[:, 'tokens'] = tlist.apply (lambda row: token_return(row), axis=1)
        tlist.loc[:, 'tokens'] = tlist['tokens'].fillna('') + tlist['lowercase']
        token_dict[htid] = tlist

    if extract_key == True:
        for htid, tlist in token_dict.items():
            tlist.set_index(['page', 'lowercase'], inplace=True, drop=True)
            tlist.drop(['section','pos','count', 'pos_new'], axis=1, inplace=True)
            topic_pages = tlist.loc[(slice(None), keys),]
            pages = topic_pages.index.get_level_values(0)
            idx = pd.IndexSlice #slicer based on pages
            topic_tlist = tlist.loc[idx[pages,:,:,:], :]
            topic_tlist = pd.concat([topic_tlist], keys=[htid], names=['volume'])
            topic_tlist.reset_index(drop=False, inplace=True)
            topic_tlist['id'] = topic_tlist['volume']+"_"+topic_tlist['page'].astype(str)
            topic_tlist.drop(['volume','page','lowercase'], axis=1, inplace=True)
            key_tlist = key_tlist.append(topic_tlist)
    else:
        for htid, tlist in token_dict.items(): #this section does the same thing without  selecting out the matching pages
            tlist.set_index(['page', 'lowercase'], inplace=True, drop=True)
            tlist =  pd.concat([tlist], keys=[htid], names=['volume'])
            tlist.reset_index(drop=False, inplace=True)
            tlist['id'] = tlist['volume']+"_"+tlist['page'].astype(str)
            tlist.drop(['volume','page','lowercase'], axis=1, inplace=True)
            key_tlist = key_tlist.append(tlist)

    page_tlist = key_tlist.groupby(['id'])['tokens'].apply(lambda x: ' '.join(x)).reset_index()
    page_tlist['tokens'] = page_tlist['tokens'].apply(word_tokenize)

    corpus_list = []
    for index, rows, in page_tlist.iterrows():
        token_list = rows.tokens
        corpus_list.append(token_list)

    return(corpus_list)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/bradykrien/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
#clean data
documents = cleaner(htids, keys, extract_key=extract_key, rem_stop=rem_stop)

### Develop the Model

Once the data is cleaned, organized, and prepared, I can run the model and analyze the results

In [14]:
#Import Libraries
import gensim
import pyLDAvis.gensim
from gensim import models, corpora, utils, parsing, similarities
from gensim.models.ldamulticore import LdaMulticore
from gensim.models.coherencemodel import CoherenceModel


#Need to figure out how to change the default value for the parameters above, but allow for adjusting individually.
def modeler(documents,
            run_report,
            filename,
            run_dir,
            num_topics,
            extract_key,
            keys,
            workers=47,
            chunksize=5000,
            passes=40,
            iterations=500,
            viz=True):
#create dictionary
    dictionary=corpora.Dictionary(documents)
    dictionary.save('dictionary')

#create corpus
    corpus = [dictionary.doc2bow(document) for document in documents]
    corpora.MmCorpus.serialize('corpus.mm', corpus)

#implement lda
#make an index to word dictionary
    temp = dictionary[0] #This is only to load the dictionary
    id2word = dictionary.id2token

    lda = LdaMulticore(corpus=corpus,
        id2word=id2word,
        workers=workers,
        num_topics=num_topics,
        chunksize=chunksize,
        passes=passes,
        eval_every=10,
        iterations=iterations)

    lda.save('lda_model.gensim')
    lda_model = 'lda_model.gensim'
    print('lda model created')

#This prints the topics with their relative weights to a .txt file
    topics = lda.print_topics(num_words = 20)
    topic_file = f"20topics_{filename}.txt"
    g = open(topic_file, "w+")
    for topic in topics:
        g.write(str(topic))
    topics = lda.print_topics(num_words = 7)
    topic_file = f"7topics_{filename}.txt"
    g = open(topic_file, "a")
    for topic in topics:
        g.write(str(topic))
    print('topic models created')

#This exports the topics and their relative weights to a csv file
    top_words_per_topic = []
    for t in range(lda.num_topics):
        top_words_per_topic.extend([(t, ) + x for x in lda.show_topic(t, topn = 20)])

    pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")



In [37]:
from gensim.corpora.mmcorpus import MmCorpus
lda_model = 'lda_model.gensim'

corpus = MmCorpus('corpus.mm')
dictionary=gensim.corpora.Dictionary.load('dictionary')
viz=True


#create PyLDAvis visualization
if viz == True:
    lda = gensim.models.ldamodel.LdaModel.load(lda_model)
    lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, sort_topics=False)
    pyLDAvis.save_html(lda_display, f"viz_visualization.html")
    print('data viz created')
else:
    print("no visualization created")

data viz created


In [41]:
pyLDAvis.display(lda_display)

In [None]:
#run the model
modeler(documents, run_report=run_report, filename=filename, run_dir=run_dir, num_topics=num_topics, extract_key=extract_key, keys=keys)

## Results

The model that is developed is split into a series of topics that can be visualized using using pyLDAvis library. 

The topics are represented by the bubbles in the quadrant on the left. The top 30 words for each topic can be seen in the list on the left and can be accessed either by clicking on the bubble or scrolling through using the Previous and Next Topic buttons about the quadrant. 

These eight topics might be labeled in the following ways: 
1. __Boats and Racing:__ Focuses on boats and primarily on racing and competitions
1. __Dogs:__ Focuses on breeding, competing, and hunting with dog breeds
1. __Hunting:__ Focuses on hunting animals for sport
1. __Publishing:__ Associated with the publication of the magazine and catalogues or advertising
1. __Fishing:__ Focuses on angling—note the overlap with the __Hunting__ category as both are associated with sports of pursuit
1. __Trap Shooting:__ Focuses on competetive shooting—note the overlap with the __Boats Category__ as both deal with competetition topics
1. __Narratives:__ This topic involves words primarily for driving a narrative forward and it seems, based on the overlap with the __Hunting__ and __Fishing__ categories that these are likely the primary topics for these narratives. 

These results largely align with what we would expect from *Forest & Stream* during the late nineteenth century. This magazine offers an ideal test case as it is quite topical (meaning that we have a good sense of what to expect) and largely homogenous in its topics (meaning that the model has to be particularly robust in order to distinquush between topics (like hunting and angling) that might share much of the same vocabulary.

In [16]:
#load visualization
from IPython.display import HTML
HTML(filename='viz_Forest&Stream_8.html')
