# Motivation

I will run various topic modeling algorithms using the chapters of the Bible as my corpus. I'm curious if the chapters in the Bible will cluster together in some logical way, i.e. New Testament chapters, Psalms chapters, Minor Prophet chapters, etc. If not, will there be some other discernable pattern?

In order to do this I will first collapse all of the text in each chapter into a single observation. I will then process that data and add the tokenized and clean list of words from each chapter to a corpus list. I will then apply various topic modeling algorithms to this corpus and analyze the results for the best fit. Once that's done, I will determine which chapters belong to the various topics. From here, I can start to answer the questions I have about topic modeling the Bible.

I will use the gensim package for my analysis. This package offers several transformation methods that we will explore.

# Set up

This is my typical set up. I import the packages I will use, set my project directory, remove column and row limits, and allow Jupyter to display all of the output from each cell.

In [1]:
import os
import pandas as pd
import numpy as np
import sqlite3
import spacy
from datetime import datetime

# Set project folder as directory
os.chdir(r'C:/Users/david/Projects/Bible Analytics')

# Remove row and column limits
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# Display all output from each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Accessing data

In [2]:
database = 'Data/SQL database.db'

In [3]:
conn = sqlite3.connect(database)
 
df = pd.read_sql_query('SELECT * FROM t_web', conn)
 
conn.close

<function Connection.close()>

In [4]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     31102 non-null  object
 1   old_new  31102 non-null  object
 2   group    31102 non-null  int64 
 3   id       31102 non-null  int64 
 4   b        31102 non-null  int64 
 5   c        31102 non-null  int64 
 6   v        31102 non-null  int64 
 7   t        31102 non-null  object
 8   clean_t  31102 non-null  object
dtypes: int64(5), object(4)
memory usage: 2.1+ MB


Unnamed: 0,name,old_new,group,id,b,c,v,t,clean_t
0,Genesis,OT,1,1001001,1,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.",In the beginning God created the heavens and the earth.
1,Genesis,OT,1,1001002,1,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,Genesis,OT,1,1001003,1,1,3,"God said, ""Let there be light,"" and there was light.","God said, ""Let there be light,"" and there was light."
3,Genesis,OT,1,1001004,1,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness.","God saw the light, and saw that it was good. God divided the light from the darkness."
4,Genesis,OT,1,1001005,1,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.","God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


# Begin

The first thing I will do is combine all of the text in each chapter into a single observation. This is actually quite simple. I will group our data by book and chapter and apply the sum function to our clean text column. Then I will reset the index so that book and chapter show up as columns. Finally, I will merge this data with out key dataset that contains the books names. Finally, I will reorder the columns to suit my personal preferences.

In [5]:
chapter_text = pd.DataFrame(df.groupby(['b','c'])['clean_t'].sum())
chapter_text.reset_index(inplace=True)

key = pd.read_csv('Jupyter/Jupyter data/key_english.csv')

chapter_text = key.merge(chapter_text, how='inner', left_on='b', right_on='b')
chapter_text = chapter_text[['name', 'old_new', 'group', 'b', 'c', 'clean_t']]

chapter_text.head(2)

Unnamed: 0,name,old_new,group,b,c,clean_t
0,Genesis,OT,1,1,1,"In the beginning God created the heavens and the earth.Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.God said, ""Let there be light,"" and there was light.God saw the light, and saw that it was good. God divided the light from the darkness.God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.God said, ""Let there be an expanse in the midst of the waters, and let it divide the waters from the waters.""God made the expanse, and divided the waters which were under the expanse from the waters which were above the expanse, and it was so.God called the expanse sky. There was evening and there was morning, a second day.God said, ""Let the waters under the sky be gathered together to one place, and let the dry land appear,"" and it was so.God called the dry land Earth, and the gathering together of the waters he called Seas. God saw that it was good.God said, ""Let the earth put forth grass, herbs yielding seed, and fruit trees bearing fruit after their kind, with its seed in it, on the earth,"" and it was so.The earth brought forth grass, herbs yielding seed after their kind, and trees bearing fruit, with its seed in it, after their kind: and God saw that it was good.There was evening and there was morning, a third day.God said, ""Let there be lights in the expanse of sky to divide the day from the night; and let them be for signs, and for seasons, and for days and years;and let them be for lights in the expanse of sky to give light on the earth,"" and it was so.God made the two great lights: the greater light to rule the day, and the lesser light to rule the night. He also made the stars.God set them in the expanse of sky to give light to the earth,and to rule over the day and over the night, and to divide the light from the darkness. God saw that it was good.There was evening and there was morning, a fourth day.God said, ""Let the waters swarm with swarms of living creatures, and let birds fly above the earth in the open expanse of sky.""God created the large sea creatures, and every living creature that moves, with which the waters swarmed, after their kind, and every winged bird after its kind. God saw that it was good.God blessed them, saying, ""Be fruitful, and multiply, and fill the waters in the seas, and let birds multiply on the earth.""There was evening and there was morning, a fifth day.God said, ""Let the earth bring forth living creatures after their kind, cattle, creeping things, and animals of the earth after their kind,"" and it was so.God made the animals of the earth after their kind, and the cattle after their kind, and everything that creeps on the ground after its kind. God saw that it was good.God said, ""Let us make man in our image, after our likeness: and let them have dominion over the fish of the sea, and over the birds of the sky, and over the cattle, and over all the earth, and over every creeping thing that creeps on the earth.""God created man in his own image. In God's image he created him; male and female he created them.God blessed them. God said to them, ""Be fruitful, multiply, fill the earth, and subdue it. Have dominion over the fish of the sea, over the birds of the sky, and over every living thing that moves on the earth.""God said, ""Behold, I have given you every herb yielding seed, which is on the surface of all the earth, and every tree, which bears fruit yielding seed. It will be your food.To every animal of the earth, and to every bird of the sky, and to everything that creeps on the earth, in which there is life, I have given every green herb for food."" And it was so.God saw everything that he had made, and, behold, it was very good. There was evening and there was morning, the sixth day."
1,Genesis,OT,1,1,2,"The heavens and the earth were finished, and all the host of them.On the seventh day God finished his work which he had made; and he rested on the seventh day from all his work which he had made.God blessed the seventh day, and made it holy, because he rested in it from all his work which he had created and made.This is the history of the generations of the heavens and of the earth when they were created, in the day that Yahweh God made earth and the heavens.No plant of the field was yet in the earth, and no herb of the field had yet sprung up; for Yahweh God had not caused it to rain on the earth. There was not a man to till the ground,but a mist went up from the earth, and watered the whole surface of the ground.Yahweh God formed man from the dust of the ground, and breathed into his nostrils the breath of life; and man became a living soul.Yahweh God planted a garden eastward, in Eden, and there he put the man whom he had formed.Out of the ground Yahweh God made every tree to grow that is pleasant to the sight, and good for food; the tree of life also in the midst of the garden, and the tree of the knowledge of good and evil.A river went out of Eden to water the garden; and from there it was parted, and became four heads.The name of the first is Pishon: this is the one which flows through the whole land of Havilah, where there is gold;and the gold of that land is good. There is aromatic resin and the onyx stone.The name of the second river is Gihon: the same river that flows through the whole land of Cush.The name of the third river is Hiddekel: this is the one which flows in front of Assyria. The fourth river is the Euphrates.Yahweh God took the man, and put him into the garden of Eden to dress it and to keep it.Yahweh God commanded the man, saying, ""Of every tree of the garden you may freely eat:but of the tree of the knowledge of good and evil, you shall not eat of it: for in the day that you eat of it you will surely die.""Yahweh God said, ""It is not good that the man should be alone; I will make him a helper suitable for him.""Out of the ground Yahweh God formed every animal of the field, and every bird of the sky, and brought them to the man to see what he would call them. Whatever the man called every living creature, that was its name.The man gave names to all cattle, and to the birds of the sky, and to every animal of the field; but for man there was not found a helper suitable for him.Yahweh God caused a deep sleep to fall on the man, and he slept; and he took one of his ribs, and closed up the flesh in its place.He made the rib, which Yahweh God had taken from the man, into a woman, and brought her to the man.The man said, ""This is now bone of my bones, and flesh of my flesh. She will be called Woman, because she was taken out of Man.""Therefore a man will leave his father and his mother, and will join with his wife, and they will be one flesh.They were both naked, the man and his wife, and were not ashamed."


# Pushing chapter data to SQL database

Before moving forward, I want to save this dataset to our SQL dataset.

In [6]:
conn = sqlite3.connect(database)

chapter_text.to_sql('chapter_text', conn, if_exists='replace', index=False)

conn.close()

1189

In [7]:
# *table* means double quotes around table
 
conn = sqlite3.connect(database)
cursor = conn.cursor()
 
cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')
 
for i in cursor.fetchall():
    print(i[0])
    
conn.close()

<sqlite3.Cursor at 0x296364af340>

t_web
people_names
gpe_name
chapter_text


# Compiling stop words from NLP packages

Next, I will clean the data by removing what are called "stop words." These are commonly used words that would tell us very little about any particular topic, e.g. he, they, are, and. There are several NLP packages that have their own lists of stop words. I've examined each of these lists in the past and I'm comfortable removing the stop words contained in each list. As such, I will combine these list into a single list that I will then use for processing. I will also add some additional words that are specific to the Bible. 

In [8]:
from nltk.corpus import stopwords
nltk_stopwords = stopwords.words('english')

from gensim.parsing.preprocessing import STOPWORDS
gen_stopwords = list(STOPWORDS)

nlp = spacy.load('en_core_web_lg')
spacy_stopwords = list(nlp.Defaults.stop_words)

print('There are', len(nltk_stopwords), 'stopwords in nltk,', len(gen_stopwords), 'in gensim, and', len(spacy_stopwords), 'in spacy.')

There are 179 stopwords in nltk, 337 in gensim, and 326 in spacy.


In [9]:
stopwords = list(set(nltk_stopwords+gen_stopwords+spacy_stopwords+['shall', 'let', 'come', 'go', 'know', 'like']))
len(stopwords)

417

# Processing the text

I will also lemmatize the data after removing stopwords. Lemmatizing the text will revert each word back to its root. For instance, "run" will be changed to "ran." By doing this, we ensure run and ran are treated as the same word.

I will begin this by defining an nlp object and loading the large, English language pipeline. I will then define an empty corpus list that we will use for topic modeling. There's some stuff about timing. Then, within a FOR loop I will iterate through the text list for each chapter. First, I'll define an empty list called temp. Then, I will create a TRY block to handle any exceptions. Within this TRY block, I will create a document for each chapter by applying our nlp object to the chapter text. Then, I will created a nested FOR loop and iterate through each word of each chapter. Within this nested FOR loop, I first filter out any proper nouns. I've thought about this and decided I didn't want the topics to be driven by the people and places mentioned in each chapter. By removing proper nouns, I hope these topics will be more content driven. Next, I will use conditional statements to remove any word whose lemmatized version is in our customized list of stop words. For instance, "say" is in our list of stop words, so "said" would also be removed. I also removed punctuation. Finally each lemmatized token is added to the list of words for that chapter. Once every token in a chapter has been iterated through, the entire list of lammetized words is added to our corpus list. The only thing left to mention is that an EXCEPT block is included to handle any errors and tell us where those errors occur should we need to investigate.

In [11]:
nlp = spacy.load("en_core_web_lg")

corpus_list = []

# Ignore this
start = datetime.now()
# Stop ignoring

for index, row in chapter_text.iterrows():
    
    temp=[]
    
    try:
        
        doc = nlp(row['clean_t'])        
    
        for token in doc:
            
            if token.pos_ != 'PROPN':
                
                # Removing stopwords
                if token.lemma_.lower() not in stopwords:
                    if not token.is_punct:
                        
                        temp+=[token.lemma_.lower()]
                    
        corpus_list.append(temp)
                
    except:
        
        print('Check out this chapter:')
        print(row['name'], row['c'])        
        print()

# Ignore this
stop = datetime.now()

print('This process took', stop-start)
print()


This process took 0:02:37.198645



In [12]:
corpus_list[0]

['beginning',
 'create',
 'heaven',
 'earth',
 'earth',
 'formless',
 'darkness',
 'surface',
 'deep',
 'hover',
 'surface',
 'water',
 'light',
 'light',
 'light',
 'good',
 'divide',
 'light',
 'darkness',
 'light',
 'darkness',
 'evening',
 'morning',
 'day',
 'expanse',
 'midst',
 'water',
 'divide',
 'water',
 'water',
 'expanse',
 'divide',
 'water',
 'expanse',
 'water',
 'expanse',
 'expanse',
 'sky',
 'evening',
 'morning',
 'second',
 'day',
 'water',
 'sky',
 'gather',
 'place',
 'dry',
 'land',
 'appear',
 'dry',
 'land',
 'gathering',
 'water',
 'good',
 'earth',
 'forth',
 'grass',
 'herb',
 'yield',
 'seed',
 'fruit',
 'tree',
 'bear',
 'fruit',
 'kind',
 'seed',
 'earth',
 'earth',
 'bring',
 'forth',
 'grass',
 'herb',
 'yield',
 'seed',
 'kind',
 'tree',
 'bear',
 'fruit',
 'seed',
 'kind',
 'good',
 'evening',
 'morning',
 'day',
 'light',
 'expanse',
 'sky',
 'divide',
 'day',
 'night',
 'sign',
 'season',
 'day',
 'years;and',
 'light',
 'expanse',
 'sky',
 'light'

This code took just over two minutes and looks good.

# Converting corpus_list into a corpus

Before I get started, I will first construct a dictionary that will map the words in the Bible to integer ids. I will then use this dictionary to convert each chapter of the Bible into a "bag of words" (bow). This collection of converted chapters is the corpus that I will run topic analysis on.

In [13]:
import gensim.corpora as corpora

id2word = corpora.Dictionary(corpus_list)

# Term Document Frequency
chapter_corpus = [id2word.doc2bow(text) for text in corpus_list]

# Modeling

Gensim provides several transformation models. Documentation for these models is here: https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html

# Term Frequency - Inverse Document Frequency

I will start with TF-IDF. This transformation model is a little different from the rest because it does not reduce dimensionality. I keeps the same vector space as our original corpus but down-weights commonly occurring terms while up-weighting less commonly occurring terms. In this transformed vector space, commonly occurring words carry less weight when determining a topic, while less commonly occurring words carry more weight. The TF-IDF corpus is still a bag of words with the same dimensionality as the original corpus.

I am going to create a TF-IDF corpus and model both to see which performs better.

In [17]:
from gensim import models

In [21]:
tfidf = models.TfidfModel(chapter_corpus)
tfidf_corpus = tfidf[chapter_corpus]

In [22]:
tfidf_corpus[0]

[(0, 0.05579895713527547),
 (1, 0.06616092968307967),
 (2, 0.027957469513631487),
 (3, 0.04610837550812367),
 (4, 0.029583726054600343),
 (5, 0.010157023944924138),
 (6, 0.166255366939819),
 (7, 0.0346621330412532),
 (8, 0.01213402174889326),
 (9, 0.08581469612409519),
 (10, 0.1816294113464808),
 (11, 0.15972505546707227),
 (12, 0.21819218023932366),
 (13, 0.10201341718182848),
 (14, 0.04220053475039981),
 (15, 0.03150717929287379),
 (16, 0.14647336382853357),
 (17, 0.07321445308855057),
 (18, 0.057478585384035205),
 (19, 0.2095434477798778),
 (20, 0.16927016102139514),
 (21, 0.5195016933619402),
 (22, 0.04117384957813832),
 (23, 0.03454216332647295),
 (24, 0.0715694980756864),
 (25, 0.036607226544275284),
 (26, 0.04513673794667273),
 (27, 0.07470163443738201),
 (28, 0.03293473634939516),
 (29, 0.03018673503570108),
 (30, 0.08463856592282425),
 (31, 0.07627773108907832),
 (32, 0.015255815194767793),
 (33, 0.07470163443738201),
 (34, 0.07808889855007879),
 (35, 0.0686205811689194),
 (36

In [20]:
chapter_corpus[0]

[(0, 1),
 (1, 3),
 (2, 1),
 (3, 3),
 (4, 1),
 (5, 1),
 (6, 6),
 (7, 2),
 (8, 2),
 (9, 3),
 (10, 5),
 (11, 4),
 (12, 5),
 (13, 4),
 (14, 10),
 (15, 1),
 (16, 5),
 (17, 2),
 (18, 2),
 (19, 19),
 (20, 6),
 (21, 9),
 (22, 1),
 (23, 1),
 (24, 2),
 (25, 1),
 (26, 2),
 (27, 1),
 (28, 3),
 (29, 1),
 (30, 4),
 (31, 2),
 (32, 1),
 (33, 1),
 (34, 7),
 (35, 2),
 (36, 2),
 (37, 1),
 (38, 1),
 (39, 1),
 (40, 4),
 (41, 1),
 (42, 3),
 (43, 10),
 (44, 2),
 (45, 1),
 (46, 1),
 (47, 13),
 (48, 1),
 (49, 3),
 (50, 1),
 (51, 1),
 (52, 2),
 (53, 1),
 (54, 6),
 (55, 3),
 (56, 3),
 (57, 1),
 (58, 1),
 (59, 3),
 (60, 4),
 (61, 1),
 (62, 1),
 (63, 6),
 (64, 1),
 (65, 1),
 (66, 1),
 (67, 9),
 (68, 1),
 (69, 1),
 (70, 3),
 (71, 3),
 (72, 3),
 (73, 3),
 (74, 11),
 (75, 1),
 (76, 1),
 (77, 4)]

# Topic modeling

The Bible covers a lot of subjects so I will set the number of topics to 100. We'll evaluate this later and either increase or decrease it. Then I'm going to run our LDA on the chapter_corpus

In [None]:
import gensim
from pprint import pprint
# number of topics
num_topics = 5
# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=chapter_corpus,
                                       id2word=id2word,
                                       num_topics=num_topics)

In [None]:
import pyLDAvis
import pyLDAvis.gensim_models

# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, chapter_corpus, id2word)

LDAvis_prepared