## Multiple File Exploration and Analysis

This workbook will load the data created in "MultiFile-Prep" to build a topic model using Gensim. 

In [1]:
import pandas as pd
import gensim
from gensim import corpora
import numpy as np

### Read in data

We exported the prepared data frame as a pickle file, so we'll read it back into pandas here.

In [2]:
vol_df = pd.read_pickle('processed_data/ucsf_medical.pkl')

In [3]:
vol_df

Unnamed: 0,htid,page_number,page_tokens,title
0,uc1.31378007786000,18,"[albert, alfred, allen, alphabetically, anatom...",Announcement of the College of Dentistry.
1,uc1.31378007786000,76,"[absent, alfred, allen, alphabetically, anatom...",Announcement of the College of Dentistry.
2,uc1.31378007786000,137,"[absent, albert, albert, ardell, arthur, assis...",Announcement of the College of Dentistry.
3,uc1.31378007786000,199,"[academic, academic, academic, acceptable, adj...",Announcement of the College of Dentistry.
4,uc1.31378007786000,239,"[albert, ardell, arthur, assistant, assistant,...",Announcement of the College of Dentistry.
...,...,...,...,...
203,uc1.31378005266823,133,"[acute, adjacent, ailable, allied, almost, alm...",School of Nursing.
204,uc1.31378005266823,195,"[ambulatory, avenue, building, california, car...",School of Nursing.
205,uc1.31378005266823,198,"[academic, account, addition, adjacent, admini...",School of Nursing.
206,uc1.31378005266823,199,"[acute, adjacent, allied, almost, almost, anes...",School of Nursing.


### Format Create Word Vectors for Gensim

Gensim requires a specific formatting for term frequency vectors. We'll go through the existing dataframe line-by-line to create and populate these lists.

For topic modeling or other natural language processing projects, you often need to decide on how broadly or narrowly you want to define a document. For example, you could define a document to be a sentence, paragraph, page, or multi-page publication. 

For this exercise, we'll consider each *page* in the corpus to be a document. 

In [4]:
bow_text_lists = []
text_len = []

for i,r in vol_df.iterrows():
    words = r.page_tokens
    bow_text_lists.append(words)
    bow_text = []
    for word in words:
        bow_text.append(word)
    text_len.append(len(bow_text))

### Word Vectors

We have 208 documents (defined as a page) in our collection, ranging from 72 to 478 words per document. 

In [5]:
print(len(bow_text_lists))

208


In [6]:
print("max:", max(text_len), "min:", min(text_len), "mean:", np.mean(text_len))

max: 478 min: 72 mean: 217.75


In [7]:
print(text_len)

[188, 190, 191, 182, 219, 196, 201, 196, 189, 261, 191, 198, 180, 231, 239, 191, 238, 138, 217, 238, 78, 130, 233, 281, 182, 159, 183, 195, 218, 174, 168, 196, 198, 202, 201, 132, 237, 214, 105, 232, 274, 147, 131, 245, 212, 122, 253, 250, 122, 258, 416, 142, 130, 228, 287, 271, 193, 298, 249, 274, 135, 232, 237, 72, 135, 181, 135, 211, 148, 242, 92, 151, 257, 101, 146, 208, 258, 87, 143, 227, 257, 91, 144, 222, 252, 257, 143, 128, 220, 212, 217, 247, 151, 131, 220, 248, 153, 130, 308, 293, 87, 378, 163, 207, 363, 267, 478, 206, 201, 361, 386, 216, 198, 348, 275, 432, 231, 343, 183, 336, 231, 345, 182, 323, 99, 225, 284, 252, 248, 104, 227, 287, 248, 243, 211, 114, 232, 290, 251, 247, 214, 140, 244, 212, 96, 232, 304, 140, 114, 240, 156, 191, 130, 242, 217, 99, 223, 147, 135, 248, 286, 146, 248, 285, 191, 99, 227, 288, 177, 133, 241, 216, 128, 308, 291, 85, 162, 164, 204, 357, 264, 393, 208, 202, 359, 338, 213, 196, 348, 235, 341, 181, 144, 242, 284, 233, 346, 183, 187, 346, 313, 233, 

### Create Gensim dictionary and corpus

A dictionary provides an id-to-token lookup for every word in the corpus (our "bag-of-words")

Our corpus provides a term frequency vector for each document in our collection

In [22]:
dictionary = corpora.Dictionary(bow_text_lists)
corpus = [dictionary.doc2bow(text) for text in bow_text_lists]

Take a look at our dictionary...

In [9]:
n = 0

for k,v in dictionary.items():
    if n <= 15:
        print(f"{k} : {v}")
        n += 1

0 : albert
1 : alfred
2 : allen
3 : alphabetically
4 : anatomy
5 : ardell
6 : arranged
7 : assistant
8 : assistants
9 : bacteriology
10 : bailey
11 : becks
12 : berger
13 : bjorns
14 : bolin
15 : brassel


In [10]:
#vol_df.iloc[18]['page_tokens']

... and the term frequency for one of the documents (pages) in our corpus 

In [27]:
# taking a small subset to display
corpus[10][20:30]

[(214, 5),
 (215, 1),
 (219, 1),
 (220, 4),
 (221, 1),
 (222, 1),
 (227, 5),
 (229, 4),
 (233, 2),
 (234, 1)]

The term frequency is stored in sparse matrix format, as most words in the entire corpus won't show up in each individual document.

In [30]:
# with the terms included
for c in corpus[10][20:30]:
    print(dictionary[c[0]], c[1])

hygiene 5
hygienist 1
instruction 1
instruments 4
introductory 1
inﬁrmary 1
laboratory 5
lectures 4
may 2
medicine 1


### Decide on a number of clusters and create a topic model

Because we split our collection into two groups, "nursing" and "dentistry", let's choose 2 topics to start. Keep in mind, our topic model may not neatly divide these pages into two different clusters, because the topics discussed may overlap substantially!

Choosing the number of topics is partly art, though there are strategies you can use. For now, try experimenting with different numbers of topics

In [31]:
# this could take a while to run

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

number_of_topics = 2
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = number_of_topics, id2word=dictionary, passes=50)
#ldamodel.save('model10.gensim')

### Topics and word frequency

Let's look at the words most associated with each of the (two) topics Gensim identified

In [33]:
topics = ldamodel.show_topics(num_words=20, formatted=False)
for topic in topics:
    print("topic:", topic[0])
    for term in topic[1]:
        print(term)
    print()

topic: 0
('health', 0.020686977)
('medical', 0.015331215)
('center', 0.014219145)
('school', 0.013439278)
('dental', 0.012886671)
('parnassus', 0.012145497)
('sciences', 0.0119149)
('ucsf', 0.010534712)
('research', 0.01006763)
('care', 0.009978137)
('one', 0.0098702535)
('hospital', 0.009625308)
('california', 0.009512477)
('university', 0.0087313065)
('college', 0.008162143)
('building', 0.007994219)
('nursing', 0.0077070864)
('state', 0.0074101747)
('campus', 0.0073062745)
('san', 0.006985879)

topic: 1
('dentistry', 0.019595204)
('dental', 0.018449912)
('instructor', 0.016787043)
('health', 0.015214322)
('clinical', 0.0133632785)
('assistant', 0.010476853)
('sciences', 0.010434281)
('operative', 0.008627841)
('professor', 0.008539516)
('students', 0.008039751)
('university', 0.0072177066)
('loan', 0.0067823227)
('student', 0.0065705534)
('campus', 0.006421274)
('fund', 0.0058896127)
('applicants', 0.005764631)
('school', 0.0055220807)
('hygiene', 0.0052591735)
('san', 0.0051421747)

### Term frequency

Look up the term frequency for each topic

In [34]:
ldamodel.get_term_topics('nursing', minimum_probability=0)

[(0, 0.007688786), (1, 0.0051018307)]

### Topic visualization

pyLDAvis provides really excellent visualizations for topic models. 

In [None]:
warnings.filterwarnings("ignore", category=DeprecationWarning)

# load the dictionary, corpus, and LDA model we created earlier:
#dictionary = gensim.corpora.Dictionary.load('dictionary.gensim')
#corpus = pickle.load(open('corpus.pkl', 'rb'))

# If you generate a new model and change the number of topics, you may need to change the file name for the model (here, model5.gensim)
#lda = ldamodel#gensim.models.ldamodel.LdaModel.load('model10.gensim')

# import pyLDAvis and ready it for use in a notebook:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed pyLDAvis the pieces generated from Gensim and create the visualization:
lda_display = gensimvis.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)

### Getting topic probabilities for each document

You may want to know the probability estimates gensim assigns for each topic to a specific document

In [38]:
ldamodel.get_document_topics(corpus[120])

[(0, 0.20181142), (1, 0.79818857)]

let's take a look at all documents and add the probabilities to our dataframe

In [40]:
document_topics = []
for c in corpus:
    document_topics.append(ldamodel.get_document_topics(c))

In [41]:
vol_df['topics'] = document_topics

In [42]:
vol_df.sort_values(by='title')

Unnamed: 0,htid,page_number,page_tokens,title,topics
0,uc1.31378007786000,18,"[albert, alfred, allen, alphabetically, anatom...",Announcement of the College of Dentistry.,"[(1, 0.9971179)]"
75,uc1.31378007786018,465,"[activities, advanced, agencies, aim, analyzes...",Announcement of the College of Dentistry.,"[(0, 0.9902214)]"
74,uc1.31378007786018,427,"[activities, addition, administration, agricul...",Announcement of the College of Dentistry.,"[(0, 0.99609435)]"
73,uc1.31378007786018,396,"[anatomy, anatomy, anesthesia, bachelor, bacte...",Announcement of the College of Dentistry.,"[(1, 0.99317986)]"
72,uc1.31378007786018,391,"[act, age, age, also, also, anaesthesia, anato...",Announcement of the College of Dentistry.,"[(0, 0.9976324)]"
...,...,...,...,...,...
118,uc1.31378004834720,279,"[academic, adjacent, allied, almost, almost, a...",UCSF School of Dentistry bulletin.,"[(1, 0.99253035)]"
119,uc1.31378004834720,294,"[aadsas, aadsas, aadsas, aadsas, aadsas, aadsa...",UCSF School of Dentistry bulletin.,"[(1, 0.9971509)]"
120,uc1.31378004834720,335,"[academic, academicians, access, actively, adm...",UCSF School of Dentistry bulletin.,"[(0, 0.20181063), (1, 0.79818934)]"
114,uc1.31378004834720,228,"[accredited, accredited, addition, additional,...",UCSF School of Dentistry bulletin.,"[(0, 0.017226098), (1, 0.9827739)]"


### Exercise:

* Try changing the number of topics.
* Try building a topic model by document, rather than by page (this is probably way too involved for the workshop, but might be interesting to try later)

### Analyzing a single document in the general topic model

Take a look at the results for a single document (you'll need to look up the page number for a document)
for example, you can look up the document for the first record: https://babel.hathitrust.org/cgi/pt?id=uc1.31378007786000&view=1up&seq=9                                            

In [45]:
vol_df.iloc[0]

htid                                          uc1.31378007786000
page_number                                                   18
page_tokens    [albert, alfred, allen, alphabetically, anatom...
title                  Announcement of the College of Dentistry.
topics                                          [(1, 0.9971179)]
Name: 0, dtype: object

In [49]:
vol_df.iloc[0]['page_tokens'][150:160]

['pathology',
 'pathology',
 'pauline',
 'professor',
 'professor',
 'professor',
 'professor',
 'professor',
 'professor',
 'professor']

### Exercise

Try finding a document with a more ambigous categorization. Does the more ambiguous categorization from Gensim match your general intuition about the document?  