# Motivation

Now, that I've run some basic counts analyses on the Bible text, I want to extract all of the people names and geopolitical entities contained within its text. I am interested in how many times each person or nation/city/people group is mentioned. In a later project, I am also going to use these lists to scrape the web for any articles related to Biblical archeology and a specific person or nation/city/people group. As such, I will save these list as tables in our SQL database for later use.

NOTE: This notebook outputs a lot of full dataframes that I used to determine next steps. I am going to clear these results when I upload to Github because I feel it is too distracting. 

# Set up

This is my typical set up. I import the modules I will use, set my project directory, remove column and row limits, and allow Jupyter to display all of the output from each cell.

In [None]:
import os
import pandas as pd
import numpy as np
import sqlite3
import spacy
from datetime import datetime

# Set project folder as directory
os.chdir(r'C:/Users/david/Projects/Bible Analytics')

# Remove row and column limits
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# Display all output from each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Accessing data

In [None]:
database = 'Data/SQL database.db'

In [None]:
conn = sqlite3.connect(database)
 
df = pd.read_sql_query('SELECT * FROM t_web', conn)
 
conn.close

In [None]:
df.info()
df.head()

# Begin

First, I will define an nlp object by loading a large, English language, trained pipeline. I'm using the large pipeline because I want my model to pick up on as many people names and geopolitical entities as possible. Next, I define two empty dictionaries, one for poeple and one for geopolitical entities. Then I begin iterating through the dataframe into which I will store the results of my search. After this, there is a bit of code used for timing the process. This is more for my personal interest and can be ignored. Next, I will loop through each row of data using a FOR loop. 

The very first thing I apply is a TRY block. There are some rows of data that contain an empty cell for clean text, which will through back an error and stop the code. This occurs because some of the verses were not contained within the earliest manuscripts. In these cases, this translation places the entire verse inside a set of curly brackets. When we cleaned the data in the last project, we removed anything within curly brackets from the text, which left nothing in these cells of data. However, we still have these verses within the unclean (Bible joke :)) text, so I am going to use an EXCEPT block to capture any information in those verses.

Within the TRY BLOCK, I apply our nlp object to the clean text data. This nlp object will break the text into individual tokens and predict which tokens are entities. It will also predict the type of entity. Using these predictions and a nested FOR loop, I will loop through each predicted entity in each line of clean data and search for people or geopolitical entities. Next, when an entity is a person, this code will check if that person's name is already listed as a key in our people dictionary. If so, the value associated with this person's key will be increased by 1. If not, this person's name will be added to the people dictionary as a key and assigned a value of 1. The exact same process will be applied to geopolitical entities.

After this bit of code, I start the EXCEPT block and within this EXCEPT block, I am going to initiate another TRY block. The only difference in this block and the TRY block describe above is that I will apply the nlp object to the unclean text. Then I will end with an nexted EXCEPT that sends back a messge to check out a particual verse if it fails to execute in either TRY block.

I end with some timing stuff.

In [None]:
nlp = spacy.load("en_core_web_lg")

people = {}
gpe = {}

# Ignore this
start = datetime.now()
# Stop ignoring


for index, row in df.iterrows():
    
    try:
        
        doc = nlp(row['clean_t'])
    
        for ent in doc.ents:
            
            # People        
            if ent.label_ == 'PERSON':
                
                if ent.text in people:
                    people[ent.text]+=1
                    
                else:
                    people[ent.text]=1
                    
            # GPE        
            if ent.label_ == 'GPE':
                
                if ent.text in gpe:
                    gpe[ent.text]+=1
                    
                else:
                    gpe[ent.text]=1
                    
    except:
        
        try:
            
            doc = nlp(row['t'])
            
            for ent in doc.ents:
                
                # People        
                if ent.label_ == 'PERSON':
                    
                    if ent.text in people:
                        people[ent.text]+=1
                        
                    else:
                        people[ent.text]=1
                        
                # GPE        
                if ent.label_ == 'GPE':
                    
                    if ent.text in gpe:
                        gpe[ent.text]+=1
                        
                    else:
                        gpe[ent.text]=1
                    
        except:
            
            print('Check out this verse.')
            print(row['name'], row['c'], ':', row['v'])
            print()
            print()
    
    
# Ignore this
stop = datetime.now()

print('This process took', stop-start)

This process took a little less than four minutes and did not kick back any errors. That's great! 

# Creating dataframes

Before I even look at these results, I know that I want to store them as a SQL table and that dictionaries are not the easiest objects to navigate. As such, I am going to convert both dictionaries into Pandas dataframes.

In [None]:
people_df = pd.DataFrame(zip(people.keys(), people.values()), columns = ['people', 'mentioned'])
people_df.sort_values(['people'], ascending=True, inplace=True)
people_df.reset_index(drop=True, inplace=True)

In [None]:
gpe_df = pd.DataFrame(zip(gpe.keys(), gpe.values()), columns = ['gpe', 'mentioned'])
gpe_df.sort_values(['gpe'], ascending=True, inplace=True)
gpe_df.reset_index(drop=True, inplace=True)

# Results

First, let's examine the data to see how good of a job our code did.

In [None]:
people_df

# Counting God, Jesus, and Moses

Right away, I noticed that the name "Jesus" was not captured as a person entity. This pipe line was trained on the modern English language and not Biblical text. My guess is that the name "Jesus" in modern English is more likely to be an expletive than a reference to the actual, historical person. The same is true of "God." To handle this I will loop through the text again and specifically search for "Jesus" and "God." There's got to be a preacher joke in there somewhere! Maybe something about "and ye shall find." Oddly, Moses also only shows up as a phrase "Moses Zipporah." I can't come up with an explanation for Moses not being a person entity that this model picks up.

I'm going to loop through the text one time and count the number of times each of these people are mentioned. The first thing I do is define an nlp object by loading the large pipeline. I then define three count variables set to 0. Then I add some code for timing this process. Next I use a FOR loop to iterate through the Bible dataframe. For each row of data I begin with a TRY block. Within that block I create doc by applying the nlp object to the clean data in that row. I then use a nested FOR loop to iterate through each word (token) in the text. Within this nested FOR loop, I use the conditional statements, IF and ELIF, to determine if each token is "God", "Jesus", or "Moses". I picked this order because my hypothesis is that God will show up most, then Jesus, and finally Moses. I didn't have to use the ELIF statements for this. I could have just as easily used IF because the ELIF statements will only save me a fraction of CPU. However, I figured why not and just did it. In other situations, ELIF can save quite a bit of CPU and in this situation, it doesn't hurt.

In [None]:
nlp = spacy.load("en_core_web_lg")

jesus = 0
god = 0
moses = 0

# Ignore this
start = datetime.now()
# Stop ignoring


for index, row in df.iterrows():
    
    try:
        
        doc = nlp(row['clean_t'])
    
        for token in doc:
            
            if token.text == 'God':
                god+=1
            elif token.text == 'Jesus':
                jesus+=1
            elif token.text == 'Moses':
                moses+=1
    except:
        
        try:
            
            doc = nlp(row['t'])
            
            for token in doc:
                
                if token.text == 'Jesus':
                    jesus+=1
                elif token.text == 'God':
                    god+=1
                elif token.text == 'Moses':
                    moses+=1
            
                    
        except:
            
            print('Check out this verse.')
            print(row['name'], row['c'], ':', row['v'])
            print()
            print()
    
# Ignore this
stop = datetime.now()

print('This process took', stop-start)    

In [None]:
god
jesus
moses

As expected, God was mentioned most in the Bible, followed by Jesus and then Moses. I will add these to our people dictionary next.

In [None]:
update = {'God': god, 'Jesus': jesus, 'Moses': moses}
people.update(update)

In [None]:
people['God']

# Cleaning people dictionary
I also need to clean the people list I already have. For instance, "Aaron" is the first entry and the second is "Aaron eighty-three years old," which is clearly a phrase about the same guy. I want to get rid of the second entry, but I want to add its count to the first.

Many of the incorrect data are phrases which contain names. To extract the names from these phrases, I will search for any words in these phrases that begin with an uppercase letter. This is not a perfect solution because I will also capture other proper nouns such as location names. Additionally, any first word in a sentance will be captured. However, I'm willing to accept this level of noise for now.

Below is my code. I have already initialized an nlp object earlier in this notebook, so I don't need to do that again. I use a FOR loop to iterate through the people dataframe and filter to entries that are longer than one word. I then print that entry for later review. Next, I define doc by applying our nlp object to each entry. 

After this, I create my first nested FOR loop to iterate through each token in each entry. Within this nested FOR loop, I create a variable called upper and set it's initial value to False. I then create another nested FOR loop to iterate through the letters of that word (ele) and use an IF statement to determine whether or not each letter is uppercase. If any element within a word is uppercase, the value of upper is changed to True. I then use a break statement to end the search because it doesn't matter if we find multiple uppercase letters in a word. Upper has already been set to true.

Next, we will modify people_dict with the new information. When this code pulls an uppercase word from any of these rows, it will check the people dictionary to determine if this uppercase word is already a key. If so, the value for that key will be increased by one. If not, a new key will be added to the people dictionary with a value of one. 

Last, I am adding a line of code to remove each phrase from the people dictionary after its content has been processed.

In [None]:
for index, row in people_df.iterrows():
    
    if len(row['people'].split()) > 1:
        
        print(row['people'])
        
        doc = nlp(row['people'])
        
        for token in doc:
            
            upper = False
            
            for ele in token.text:
                
                if ele.isupper():
                    
                    upper = True
                    
                    break
            
            if upper:
                
                print(token.text)
                
                if token.text in people:
                    people[token.text]+=1
                    
                else:
                    people[token.text]=1
                    
        # Remove entry for people dictionary                    
        people.pop(row['people'])

In [None]:
people_df2 = pd.DataFrame(zip(people.keys(), people.values()), columns = ['people', 'mentioned'])
people_df2.sort_values(['people'], ascending=True, inplace=True)
people_df2.reset_index(drop=True, inplace=True)

In [None]:
people_df2

There are now three more counts added to "Aaron" and the phrases containing Aaron's name is gone. I pretty happy with this. However, I cannot stress enough that this is not a perfect list. There are certainly entries and counts in this list that came from the names of locations and there are also common words that have been included. 

That said, I think this code did a really great job capturing all of the people in the Bible, and I'm excited analyze this list. One last step, I'm going to remove lowercase words and any one letter words.

In [None]:
people_df2 = people_df2[people_df2['people']!=people_df2['people'].str.lower()]

people_df2 = people_df2[people_df2['people'].str.len()>1]

# Let the analysis begin!!

Let's start with rankings. Who are the most mentioned people in the Bible?

In [None]:
people_df2.sort_values(['mentioned'], ascending=False)

In [None]:
people_df2.sort_values(['people'], ascending=True)

No surprise, God in all his forms is the most mentioned person in the Bible. We have "God" in first place with just over four thousand mentions, "Yahwah" with a little over two thousand mentions and "Jesus" with just under one-thousand mentions. This does not count all of the times that God is mentioned in the New Testament as the "Father", but I think we still get a clear picture. This book is about God!

Next, we have Moses. I'm still surprised and confused about why Moses did not show up as a PERSON entity in our search. The model almost certainly would have predicted "Moses" as a name based on the way it used in the sentence structure, so there must be a decision rule that explicitly tells the model not to count Moses as an entity. I can't think of why this would be the case. This name is still used in common English as a given name and I can't think of any other way it is used. Perhaps, one day we'll figure this out.

The next most commonly mentioned person in the Bible is David, whom I was named after. Then Moses' brother, 
Aaron.

I was actually surprised that Joseph came in seventh with 240 mentions just ahead of his great-grandfather, Abraham, who only had 222 mentions. I may confirm this with another counts analysis. However, Abraham also gets 63 mentions as Abram, so technically Abraham comes in just ahead of Joseph with 285 combined mentions. 

Joshua, Moses' apprentice, comes in ninth, and Joseph's brother, Benjamin, is tenth. This made me wonder why we didn't see Israel's name in the top ten. Both Israel and Jacob show up but with low counts. I don't know why this would be the case. 

# Geopolitical Entities

In [None]:
gpe_df

This list looks better to me. Of course there is still some noise, but not nearly as much as I saw in the people list. I'm going to keep this list as is and look at the top ten most mentioned gpe in the Bible.

In [None]:
gpe_df.sort_values(['mentioned'], ascending=False)

Yep, looks like a pretty solid list. The top ten are exactly what I would expect.

# Pushing Pandas dataframes to SQL tables

As mentioned earlier in this post, I would like to use the name of people and geopolitical entities in a webcrawler to extract articles about Biblical artifacts directly related to specific entities within the text of the Bible. Ultimately, I'd like to compile a dynamic list of these articles and tie them directly to the relevant Biblical text.

As such, I am going to save these lists as SQL tables that I can access later.

In [None]:
conn = sqlite3.connect(database)

people_df2.to_sql('people_names', conn, if_exists='replace', index=False)

gpe_df.to_sql('gpe_name', conn, if_exists='replace', index=False)

conn.close()

In [None]:
# *table* means double quotes around table
 
conn = sqlite3.connect(database)
cursor = conn.cursor()
 
cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')
 
for i in cursor.fetchall():
    print(i[0])
    
conn.close()