
# Find the Dominant Document For Each Topic

Once we have a topic model that has pretty good distribution and the bags of words have fairly coherent topics, we needed to explore the specific topics in the corpus. To do this, we created a pandas dataframe that works similarly to a spreadsheet, but allows all of the functionality of python on top of it. 

The firest two cells import the necessary modules, and load the data. 


In [1]:
import pandas as pd
import json
from gensim import corpora 
from gensim.models.ldamodel import LdaModel 
from gensim.corpora.dictionary import Dictionary

In [2]:
lda_model = LdaModel.load('./models/PrelimTOpicModel2') 
corpus_dict = Dictionary.load_from_text('./models/corpus_dictionary_2')
with open('./models/corpus.json', 'r') as fp:
    corpus = json.load(fp)
with open('./models/text_list.json', 'r') as fp:
    text_list = json.load(fp)
with open('./models/corpus_list.json', 'r') as fp:
    corpus_list = json.load(fp)

The following code is the primary function that creates the dataframe. This dataframe has a row for each page in the document. Which topic is dominant for the words on the page, and what the distinctive words are for the given topic. It also includes the pdf and page number for the document we are analyzing. 

This allowed us to go back and look at the page for further context, in order to better understand the topics. 

In [3]:
# this creates a pandas DataFrame that orders all of the topics and shows the dominant topic for each document
def format_topics_sent(ldamodel, corpus, texts):
    sent_topics_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: x[1], reverse=True)
        
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_topic', 'Perc_Contrib', 'Topic_Keywords']
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    sent_topics_df.rename(columns={0: "Text"}, inplace=True)
    return sent_topics_df

## Exploring the Dominant Topic Models

In order to better understand the specifics of this code, we can explore each particular row, by creating a generator to look at the rows. 

In [4]:
def format_topics_sent_gen(ldamodel, corpus, texts):
    for i, row in enumerate(ldamodel[corpus]):
        yield row

In [5]:
row_generator = format_topics_sent_gen(lda_model, corpus, corpus_list)

In [6]:
row = next(row_generator)

In [7]:
row

([(0, 0.010000001),
  (1, 0.010000001),
  (2, 0.010000001),
  (3, 0.010000001),
  (4, 0.010000001),
  (5, 0.010000001),
  (6, 0.010000001),
  (7, 0.010000001),
  (8, 0.010000001),
  (9, 0.76),
  (10, 0.010000001),
  (11, 0.010000001),
  (12, 0.010000001),
  (13, 0.010000001),
  (14, 0.010000001),
  (15, 0.010000001),
  (16, 0.010000001),
  (17, 0.010000001),
  (18, 0.010000001),
  (19, 0.010000001),
  (20, 0.010000001),
  (21, 0.010000001),
  (22, 0.010000001),
  (23, 0.010000001),
  (24, 0.010000001)],
 [(0, [9])],
 [(0, [(9, 2.9999995)])])

For looking at the details of a specific topic, and its word distribution, you can query the lda_model directly. The `topn` variable shows how many items to display

In [8]:
lda_model.show_topic(21, topn=30)

[('god', 0.20524834),
 ('people', 0.08997392),
 ('power', 0.0812579),
 ('faith', 0.057230312),
 ('christian', 0.05547319),
 ('life', 0.05419738),
 ('word', 0.049899396),
 ('world', 0.045320693),
 ('way', 0.031713385),
 ('human', 0.030553361),
 ('reality', 0.025296446),
 ('experience', 0.023166098),
 ('doe', 0.021962931),
 ('tulud', 0.019804804),
 ('need', 0.016158376),
 ('especially', 0.013886022),
 ('like', 0.01281783),
 ('sense', 0.012462943),
 ('particularly', 0.011948879),
 ('fact', 0.011376779),
 ('just', 0.01132918),
 ('make', 0.011165283),
 ('time', 0.0108135445),
 ('g', 0.010585245),
 ('relation', 0.009754408),
 ('good', 0.00956607),
 ('example', 0.009151671),
 ('culture', 0.008596516),
 ('context', 0.008572249),
 ('challenge', 0.007773363)]

In [9]:
sent_topics_df = format_topics_sent(lda_model, corpus, text_list)

In [10]:
sent_topics_df

Unnamed: 0,Dominant_topic,Perc_Contrib,Topic_Keywords,Text
0,9.0,0.7600,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 0]"
1,9.0,0.6080,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 1]"
2,9.0,0.6800,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 2]"
3,21.0,0.5200,"god, people, power, faith, christian, life, wo...","[../pdfs/Davidson 2018.pdf, 3]"
4,9.0,0.6040,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 4]"
5,9.0,0.8629,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 5]"
6,9.0,0.8080,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 6]"
7,14.0,0.5200,"right, human, word, reality, state, world, tim...","[../pdfs/Davidson 2018.pdf, 7]"
8,0.0,0.0400,"black, experience, life, mean, like, make, poi...","[../pdfs/Davidson 2018.pdf, 8]"
9,9.0,0.6800,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Davidson 2018.pdf, 9]"


The following code was used, and reused to show the details of a specific topic. This allowed us to see the parallels between the different documents. 

In [11]:
sent_topics_df[sent_topics_df['Dominant_topic'] == 21.0].sort_values('Perc_Contrib', ascending=False)

Unnamed: 0,Dominant_topic,Perc_Contrib,Topic_Keywords,Text
647,21.0,0.7600,"god, people, power, faith, christian, life, wo...","[../pdfs/Thompson 2017.pdf, 3]"
1047,21.0,0.7127,"god, people, power, faith, christian, life, wo...",[../pdfs/Izuzquiza - 2011 - Breaking bread not...
992,21.0,0.7060,"god, people, power, faith, christian, life, wo...",[../pdfs/Cruz - 2010 - Chapter Five. A Differe...
1054,21.0,0.6937,"god, people, power, faith, christian, life, wo...",[../pdfs/Izuzquiza - 2011 - Breaking bread not...
548,21.0,0.6903,"god, people, power, faith, christian, life, wo...","[../pdfs/Nnamani 2015.pdf, 3]"
520,21.0,0.6800,"god, people, power, faith, christian, life, wo...","[../pdfs/Strine 2018.pdf, 1]"
810,21.0,0.6260,"god, people, power, faith, christian, life, wo...",[../pdfs/Cruz - 2010 - Chapter Three. Expandin...
1053,21.0,0.6158,"god, people, power, faith, christian, life, wo...",[../pdfs/Izuzquiza - 2011 - Breaking bread not...
153,21.0,0.6090,"god, people, power, faith, christian, life, wo...","[../pdfs/Frederiks and Nagy - 2016 - Religion,..."
546,21.0,0.6017,"god, people, power, faith, christian, life, wo...","[../pdfs/Nnamani 2015.pdf, 1]"


To explore each topic was helpful, but one of the things we wanted to see was a shorter dataframe that had the topics and which document best exemplified those documents. The next cell groups the dataframe by the dominant topic, and the next cell creates a new dataframe so that just the best exemplified topics are portrayed. 

In [12]:
grpd_df = sent_topics_df.groupby('Dominant_topic')

In [13]:
# This code creates a pandas DataFrame that shows which document is exemplified by which topic
new_df = pd.DataFrame()

for i, grp in grpd_df:
    new_df = pd.concat([new_df, grp.sort_values(['Perc_Contrib'], ascending=[0]).head(1)], axis=0)

new_df.reset_index(drop=True, inplace=True)
new_df.columns = ['Topic_Num', 'Topic_Perc_Contrib', 'Keywords', 'Text']
new_df

Unnamed: 0,Topic_Num,Topic_Perc_Contrib,Keywords,Text
0,0.0,0.52,"black, experience, life, mean, like, make, poi...","[../pdfs/Rowlands 2018.pdf, 16]"
1,1.0,0.5854,"identity, challenge, term, experience, context...","[../pdfs/Frederiks and Nagy - 2016 - Religion,..."
2,2.0,0.4688,"worker, domestic, migrant, filipina, condition...",[../pdfs/Cruz - 2010 - Preliminary Material.pd...
3,3.0,0.52,"migrant, country, home, community, family, exp...","[../pdfs/Snyder 2018.pdf, 16]"
4,4.0,0.52,"migration, context, study, challenge, communit...","[../pdfs/Snyder 2018.pdf, 5]"
5,5.0,0.7828,"social, political, economic, immigrant, societ...","[../pdfs/Jimenez 2019.pdf, 7]"
6,6.0,0.6744,"church, christian, american, immigrant, commun...","[../pdfs/Nnamani 2015.pdf, 5]"
7,7.0,0.6938,"theology, experience, theological, tulud, cont...","[../pdfs/cruz2010.pdf, 34]"
8,8.0,0.4646,"group, community, religious, social, role, tim...","[../pdfs/Frederiks and Nagy - 2016 - Religion,..."
9,9.0,0.9751,"œ, dorottya, martha, human, order, case, g, st...","[../pdfs/Frederiks and Nagy - 2016 - Religion,..."


# Details of the Topic Model

One of the problems with topic modeling is that because it is an unsupervised clustering method, sometimes the computer sees connections that are not obvious, or at the vary least, are not _semantic_ clusters. Topic model is a blunt tool, but we picked six of these topics that we thought might be helpful in discovering books over the past 100 years that might build on the topic we had chosen. 

These topics are:

* _topic number_: 0
   * _heading_: *Black Experience*
   * _key terms_: 'black, experience, life, mean, like, make, point, american, challenge, relation'
* _topic number_: 1 
   * _heading_: *Context of Migrant Experience* 
   * _key terms_: 'identity, challenge, term, experience, context, question, migrant, people, state, dorottya'
*  _topic number_: 3
   * _heading_: *Communal Experience*
   * _key terms_: 'migrant, country, home, community, family, experience, life, economic, new, reality'
* _topic number_: 5
   * _heading_: *Social, Political, Economic Migrations*
   * _key terms_: 'social, political, economic, immigrant, society, cultural, perspective, issue, people, life'
* _topic number_: 6
   * _heading_: *Immigration and American Christianity*
   * _key terms_: 'church, christian, american, immigrant, community, role, dorottya, martha, state, faith'
* _topic number_: 11
   * _heading_: *Religion and Culture*
   * _key terms_: 'religion, religious, culture, cultural, christian, identity, faith, experience, example, time'
   
   
These topics were analysed in the context of the pdfs that generated them. These where the topics that we thought were both coherent, and might provide interesting analysis when looked at the political theology corpus generated from HathiTrust. 

These are the only six topics we looked for in the HathiTrust corpus that we had identified. 
