# Motivation

Now, that I've run some basic counts analyses on the Bible text, I want to extract all of the people names and geopolitical entities contained within its text. I am interested in how many times each person or nation/city/people group is mentioned. In a later project, I am also going to use these lists to scrape the web for any articles related to Biblical archeology and a specific person or nation/city/people group. As such, I will save these list as tables in our SQL database for later use.

# Set up

This is my typical set up. I import the modules I will use, set my project directory, remove column and row limits, and allow Jupyter to display all of the output from each cell.

In [1]:
import os
import pandas as pd
import numpy as np
import sqlite3
import spacy
from datetime import datetime

# Set project folder as directory
os.chdir(r'C:/Users/david/Projects/Bible Analytics')

# Remove row and column limits
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# Display all output from each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Accessing data

In [2]:
database = 'Data/SQL database.db'

In [3]:
conn = sqlite3.connect(database)
 
df = pd.read_sql_query('SELECT * FROM t_web', conn)
 
conn.close

<function Connection.close()>

In [4]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   name     31102 non-null  object
 1   old_new  31102 non-null  object
 2   group    31102 non-null  int64 
 3   id       31102 non-null  int64 
 4   b        31102 non-null  int64 
 5   c        31102 non-null  int64 
 6   v        31102 non-null  int64 
 7   t        31102 non-null  object
 8   clean_t  31102 non-null  object
dtypes: int64(5), object(4)
memory usage: 2.1+ MB


Unnamed: 0,name,old_new,group,id,b,c,v,t,clean_t
0,Genesis,OT,1,1001001,1,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.",In the beginning God created the heavens and the earth.
1,Genesis,OT,1,1001002,1,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,Genesis,OT,1,1001003,1,1,3,"God said, ""Let there be light,"" and there was light.","God said, ""Let there be light,"" and there was light."
3,Genesis,OT,1,1001004,1,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness.","God saw the light, and saw that it was good. God divided the light from the darkness."
4,Genesis,OT,1,1001005,1,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.","God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


# Begin

First, I will define an nlp object by loading a large, English language, trained pipeline. I'm using the large pipeline because I want my model to pick up on as many people names and geopolitical entities as possible. Next, I define two empty dictionaries, one for poeple and one for geopolitical entities. Then I begin iterating through the dataframe into which I will store the results of my search. After this, there is a bit of code used for timing the process. This is more for my personal interest and can be ignored. Next, I will loop through each row of data using a FOR loop. 

The very first thing I apply is a TRY block. There are some rows of data that contain an empty cell for clean text, which will through back an error and stop the code. This occurs because some of the verses were not contained within the earliest manuscripts. In these cases, this translation places the entire verse inside a set of curly brackets. When we cleaned the data in the last project, we removed anything within curly brackets from the text, which left nothing in these cells of data. However, we still have these verses within the unclean (Bible joke :)) text, so I am going to use an EXCEPT block to capture any information in those verses.

Within the TRY BLOCK, I apply our nlp object to the clean text data. This nlp object will break the text into individual tokens and predict which tokens are entities. It will also predict the type of entity. Using these predictions and a nested FOR loop, I will loop through each predicted entity in each line of clean data and search for people or geopolitical entities. Next, when an entity is a person, this code will check if that person's name is already listed as a key in our people dictionary. If so, the value associated with this person's key will be increased by 1. If not, this person's name will be added to the people dictionary as a key and assigned a value of 1. The exact same process will be applied to geopolitical entities.

After this bit of code, I start the EXCEPT block and within this EXCEPT block, I am going to initiate another TRY block. The only difference in this block and the TRY block describe above is that I will apply the nlp object to the unclean text. Then I will end with an nexted EXCEPT that sends back a messge to check out a particual verse if it fails to execute in either TRY block.

I end with some timing stuff.

In [5]:
nlp = spacy.load("en_core_web_lg")

people = {}
gpe = {}

# Ignore this
start = datetime.now()
# Stop ignoring


for index, row in df.iterrows():
    
    try:
        
        doc = nlp(row['clean_t'])
    
        for ent in doc.ents:
            
            # People        
            if ent.label_ == 'PERSON':
                
                if ent.text in people:
                    people[ent.text]+=1
                    
                else:
                    people[ent.text]=1
                    
            # GPE        
            if ent.label_ == 'GPE':
                
                if ent.text in gpe:
                    gpe[ent.text]+=1
                    
                else:
                    gpe[ent.text]=1
                    
    except:
        
        try:
            
            doc = nlp(row['t'])
            
            for ent in doc.ents:
                
                # People        
                if ent.label_ == 'PERSON':
                    
                    if ent.text in people:
                        people[ent.text]+=1
                        
                    else:
                        people[ent.text]=1
                        
                # GPE        
                if ent.label_ == 'GPE':
                    
                    if ent.text in gpe:
                        gpe[ent.text]+=1
                        
                    else:
                        gpe[ent.text]=1
                    
        except:
            
            print('Check out this verse.')
            print(row['name'], row['c'], ':', row['v'])
            print()
            print()
    
    
# Ignore this
stop = datetime.now()

print('This process took', stop-start)

This process took 0:04:12.253051


This process took a little less than four minutes and did not kick back any errors. That's great! 

# Creating dataframes

Before I even look at these results, I know that I want to store them as a SQL table and that dictionaries are not the easiest objects to navigate. As such, I am going to convert both dictionaries into Pandas dataframes.

In [6]:
people_df = pd.DataFrame(zip(people.keys(), people.values()), columns = ['people', 'mentioned'])
people_df.sort_values(['people'], ascending=True, inplace=True)
people_df.reset_index(drop=True, inplace=True)

In [7]:
gpe_df = pd.DataFrame(zip(gpe.keys(), gpe.values()), columns = ['gpe', 'mentioned'])
gpe_df.sort_values(['gpe'], ascending=True, inplace=True)
gpe_df.reset_index(drop=True, inplace=True)

# Results

First, let's examine the data to see how good of a job our code did.

In [8]:
people_df

Unnamed: 0,people,mentioned
0,Aaron,347
1,Aaron eighty-three years old,1
2,Abanah,1
3,Abba,1
4,Abdi,2
5,Abdon,7
6,Abednego,1
7,Abel,7
8,Abel Beth Maacah,2
9,Abel Maim,1


# Counting God, Jesus, and Moses

Right away, I noticed that the name "Jesus" was not captured as a person entity. This pipe line was trained on the modern English language and not Biblical text. My guess is that the name "Jesus" in modern English is more likely to be an expletive than a reference to the actual, historical person. The same is true of "God." To handle this I will loop through the text again and specifically search for "Jesus" and "God." There's got to be a preacher joke in there somewhere! Maybe something about "and ye shall find." Oddly, Moses also only shows up as a phrase "Moses Zipporah." I can't come up with an explanation for Moses not being a person entity that this model picks up.

I'm going to loop through the text one time and count the number of times each of these people are mentioned. The first thing I do is define an nlp object by loading the large pipeline. I then define three count variables set to 0. Then I add some code for timing this process. Next I use a FOR loop to iterate through the Bible dataframe. For each row of data I begin with a TRY block. Within that block I create doc by applying the nlp object to the clean data in that row. I then use a nested FOR loop to iterate through each word (token) in the text. Within this nested FOR loop, I use the conditional statements, IF and ELIF, to determine if each token is "God", "Jesus", or "Moses". I picked this order because my hypothesis is that God will show up most, then Jesus, and finally Moses. I didn't have to use the ELIF statements for this. I could have just as easily used IF because the ELIF statements will only save me a fraction of CPU. However, I figured why not and just did it. In other situations, ELIF can save quite a bit of CPU and in this situation, it doesn't hurt.

In [9]:
nlp = spacy.load("en_core_web_lg")

jesus = 0
god = 0
moses = 0

# Ignore this
start = datetime.now()
# Stop ignoring


for index, row in df.iterrows():
    
    try:
        
        doc = nlp(row['clean_t'])
    
        for token in doc:
            
            if token.text == 'God':
                god+=1
            elif token.text == 'Jesus':
                jesus+=1
            elif token.text == 'Moses':
                moses+=1
    except:
        
        try:
            
            doc = nlp(row['t'])
            
            for token in doc:
                
                if token.text == 'Jesus':
                    jesus+=1
                elif token.text == 'God':
                    god+=1
                elif token.text == 'Moses':
                    moses+=1
            
                    
        except:
            
            print('Check out this verse.')
            print(row['name'], row['c'], ':', row['v'])
            print()
            print()
    
# Ignore this
stop = datetime.now()

print('This process took', stop-start)    

This process took 0:04:10.679174


In [10]:
god
jesus
moses

4032

977

848

As expected, God was mentioned most in the Bible, followed by Jesus and then Moses. I will add these to our people dictionary next.

In [11]:
update = {'God': god, 'Jesus': jesus, 'Moses': moses}
people.update(update)

In [12]:
people['God']

4032

# Cleaning people dictionary
I also need to clean the people list I already have. For instance, "Aaron" is the first entry and the second is "Aaron eighty-three years old," which is clearly a phrase about the same guy. I want to get rid of the second entry, but I want to add its count to the first.

Many of the incorrect data are phrases which contain names. To extract the names from these phrases, I will search for any words in these phrases that begin with an uppercase letter. This is not a perfect solution because I will also capture other proper nouns such as location names. Additionally, any first word in a sentance will be captured. However, I'm willing to accept this level of noise for now.

Below is my code. I have already initialized an nlp object earlier in this notebook, so I don't need to do that again. I use a FOR loop to iterate through the people dataframe and filter to entries that are longer than one word. I then print that entry for later review. Next, I define doc by applying our nlp object to each entry. 

After this, I create my first nested FOR loop to iterate through each token in each entry. Within this nested FOR loop, I create a variable called upper and set it's initial value to False. I then create another nested FOR loop to iterate through the letters of that word (ele) and use an IF statement to determine whether or not each letter is uppercase. If any element within a word is uppercase, the value of upper is changed to True. I then use a break statement to end the search because it doesn't matter if we find multiple uppercase letters in a word. Upper has already been set to true.

Next, we will modify people_dict with the new information. When this code pulls an uppercase word from any of these rows, it will check the people dictionary to determine if this uppercase word is already a key. If so, the value for that key will be increased by one. If not, a new key will be added to the people dictionary with a value of one. 

Last, I am adding a line of code to remove each phrase from the people dictionary after its content has been processed.

In [13]:
for index, row in people_df.iterrows():
    
    if len(row['people'].split()) > 1:
        
        print(row['people'])
        
        doc = nlp(row['people'])
        
        for token in doc:
            
            upper = False
            
            for ele in token.text:
                
                if ele.isupper():
                    
                    upper = True
                    
                    break
            
            if upper:
                
                print(token.text)
                
                if token.text in people:
                    people[token.text]+=1
                    
                else:
                    people[token.text]=1
                    
        # Remove entry for people dictionary                    
        people.pop(row['people'])

Aaron eighty-three years old
Aaron


1

Abel Beth Maacah
Abel
Beth
Maacah


2

Abel Maim
Abel
Maim


1

Abel Meholah
Abel
Meholah


2

Abel Mizraim
Abel
Mizraim


1

Abel Shittim
Abel
Shittim


1

Abraham buried
Abraham


1

Agrippa the King
Agrippa
King


1

Ahab king
Ahab


11

Allon Bacuth
Allon
Bacuth


1

Almon Diblathaim
Almon
Diblathaim


1

Amram Aaron
Amram
Aaron


1

Amraphel king
Amraphel


1

Artaxerxes king
Artaxerxes


3

Asher Beth-shean
Asher
Beth


1

Ataroth Addar
Ataroth
Addar


1

Atroth Beth Joab
Atroth
Beth
Joab


1

Aznoth Tabor
Aznoth
Tabor


1

Baal Berith
Baal
Berith


2

Baal Gad
Baal
Gad


3

Baal Hanan
Baal
Hanan


5

Baal Peor
Baal
Peor


6

Baal Perazim
Baal
Perazim


4

Baal Zebub
Baal
Zebub


4

Baal Zephon
Baal
Zephon


3

Baale Judah
Baale
Judah


1

Balak king
Balak


1

Balak the
Balak


1

Bamoth Baal
Bamoth
Baal


1

Basemath bore Reuel
Basemath
Reuel


1

Beer Lahai Roi
Beer
Lahai
Roi


1

Behold, David
Behold
David


2

Behold, Elijah [
Behold
Elijah


3

Ben Abinadab
Ben
Abinadab


1

Ben Ammi
Ben
Ammi


1

Ben Deker
Ben
Deker


1

Ben Geber
Ben
Geber


1

Ben Hadad
Ben
Hadad


20

Ben Hail
Ben
Hail


1

Ben Hanan
Ben
Hanan


1

Ben Hesed
Ben
Hesed


1

Ben Hur
Ben
Hur


1

Ben Zoheth
Ben
Zoheth


1

Bene Jaakan
Bene
Jaakan


1

Benjamin: Bela
Benjamin
Bela


1

Beth Anath
Beth
Anath


3

Beth Anoth
Beth
Anoth


1

Beth Arabah
Beth
Arabah


3

Beth Arbel
Beth
Arbel


1

Beth Aven
Beth
Aven


7

Beth Azmaveth
Beth
Azmaveth


1

Beth Baal Meon
Beth
Baal
Meon


1

Beth Barah
Beth
Barah


2

Beth Biri
Beth
Biri


1

Beth Dagon
Beth
Dagon


2

Beth Diblathaim
Beth
Diblathaim


1

Beth El
Beth
El


2

Beth El:
Beth
El


1

Beth Emek
Beth
Emek


1

Beth Ezel
Beth
Ezel


1

Beth Gader
Beth
Gader


1

Beth Gamul
Beth
Gamul


1

Beth Gilgal
Beth
Gilgal


1

Beth Haccherem
Beth
Haccherem


2

Beth Haram
Beth
Haram


1

Beth Haran
Beth
Haran


1

Beth Hoglah
Beth
Hoglah


3

Beth Horon
Beth
Horon


14

Beth Kar
Beth
Kar


1

Beth Lebaoth
Beth
Lebaoth


1

Beth Maacah
Beth
Maacah


2

Beth Marcaboth
Beth
Marcaboth


2

Beth Meon
Beth
Meon


1

Beth Merhak
Beth
Merhak


1

Beth Nimrah
Beth
Nimrah


2

Beth Ophrah
Beth
Ophrah


1

Beth Pazzez
Beth
Pazzez


1

Beth Pelet
Beth
Pelet


2

Beth Peor
Beth
Peor


4

Beth Rapha
Beth
Rapha


1

Beth Rehob
Beth
Rehob


2

Beth Tappuah
Beth
Tappuah


1

Beth Zur
Beth
Zur


4

Bildad the Shuhite
Bildad
Shuhite


3

Caesarea Philippi
Caesarea
Philippi


2

Caleb Ephrathah
Caleb
Ephrathah


1

Chephar Ammoni
Chephar
Ammoni


1

Chisloth Tabor
Chisloth
Tabor


1

Concerning Shemaiah the Nehelamite
Concerning
Shemaiah
Nehelamite


1

Dan Jaan
Dan
Jaan


1

Daniel great
Daniel


1

Daniel, Ginnethon
Daniel
Ginnethon


1

David king
David


2

David swore
David


1

David to Jonathan
David
Jonathan


1

Eglath Shelishiyah
Eglath
Shelishiyah


2

Eleazar Aaron's
Eleazar
Aaron


1

Eliphaz Amalek
Eliphaz
Amalek


1

Elon Beth Hanan
Elon
Beth
Hanan


1

Emek Keziz
Emek
Keziz


1

En Gannim
En
Gannim


1

En Gedi
En
Gedi


1

En Haddah
En
Haddah


1

En Hazor
En
Hazor


1

En Tappuah
En
Tappuah


1

Esar Haddon
Esar
Haddon


2

Esau Eliphaz
Esau
Eliphaz


1

Esau Jeush
Esau
Jeush


1

Gath Hepher
Gath
Hepher


2

Geba to Beersheba
Geba
Beersheba


1

Geba to Rimmon
Geba
Rimmon


1

God Yahweh
God
Yahweh


1

Greet Amplias
Greet
Amplias


1

Greet Apelles
Greet
Apelles


1

Greet Epaenetus
Greet
Epaenetus


1

Greet Mary
Greet
Mary


1

Greet Persis
Greet
Persis


1

Greet Prisca
Greet
Prisca


2

Greet Rufus
Greet
Rufus


1

Greet Tryphaena
Greet
Tryphaena


1

Greet Urbanus
Greet
Urbanus


1

Had Moses
Had
Moses


1

Hagar bore
Hagar


1

Hagar bore Ishmael
Hagar
Ishmael


1

Hagar the Egyptian
Hagar
Egyptian


2

Hamon Gog
Hamon
Gog


2

Hash Baz
Hash
Baz


2

Havvoth Jair
Havvoth
Jair


3

Hazar Addar
Hazar
Addar


1

Hazar Enan
Hazar
Enan


1

Hazar Enon
Hazar
Enon


1

Hazar Gaddah
Hazar
Gaddah


1

Hazar Shual
Hazar
Shual


2

Hazar Susah
Hazar
Susah


1

Hazazon Tamar
Hazazon
Tamar


2

Hazer Hatticon
Hazer
Hatticon


1

Hazor Hadattah
Hazor
Hadattah


1

Hezekiah king
Hezekiah


1

Ho Ariel
Ho
Ariel


1

Ho Assyrian
Ho
Assyrian


1

Hor Haggidgad
Hor
Haggidgad


1

Isaac loved Esau
Isaac
Esau


1

Israel Tola
Israel
Tola


1

Jabesh Gilead
Jabesh
Gilead


11

Jabin king
Jabin


2

Jacob blessed Pharaoh
Jacob
Pharaoh


1

Jacob kissed Rachel
Jacob
Rachel


1

Jacob loved Rachel
Jacob
Rachel


1

Jael Heber's
Jael
Heber


1

James the son
James


1

James the son of Alphaeus
James
Alphaeus


1

James the son of Zebedee
James
Zebedee


1

James, Joses
James
Joses


2

James; John
James
John


1

Jegar Sahadutha
Jegar
Sahadutha


1

Jehoiachin king
Jehoiachin


1

Joash king
Joash


3

Joash king of Israel
Joash
Israel


1

Joel, Shemaiah
Joel
Shemaiah


1

John the Baptizer
John
Baptizer


7

Jonathan brought David to Saul
Jonathan
David
Saul


1

Jonathan called David
Jonathan
David


1

Jonathan to David
Jonathan
David


1

Jonathan, David's
Jonathan
David


1

Jonathan, Saul's
Jonathan
Saul


2

Joram king
Joram


1

Joseph commanded
Joseph


1

Joseph of Arimathaea
Joseph
Arimathaea


2

Joseph prevailed
Joseph


1

Joseph twelve thousand
Joseph


1

Joseph, Manasseh and Ephraim
Joseph
Manasseh
Ephraim


1

Judah Jeroboam
Judah
Jeroboam


1

Judah Rezin
Judah
Rezin


1

Judas Iscariot
Judas
Iscariot


6

Kemuel the
Kemuel


1

Keren Happuch
Keren
Happuch


1

Keturah, Abraham's
Keturah
Abraham


1

King Agrippa
King
Agrippa


3

King Darius
King
Darius


1

Kir Hareseth
Kir
Hareseth


2

Kir Heres
Kir
Heres


1

Kiriath Arba
Kiriath
Arba


7

Kiriath Arim
Kiriath
Arim


1

Kiriath Baal
Kiriath
Baal


1

Kiriath Huzoth
Kiriath
Huzoth


1

Kiriath Sannah
Kiriath
Sannah


1

Kiriath Sepher
Kiriath
Sepher


4

Let Abishag the Shunammite
Let
Abishag
Shunammite


1

Levi Moses
Levi
Moses


1

Lo Debar
Lo
Debar


2

Maareh Geba
Maareh
Geba


1

Maher Shalal
Maher
Shalal


2

Mary Magdalene
Mary
Magdalene


11

Merib Baal
Merib
Baal


4

Meribath Kadesh
Meribath
Kadesh


1

Meriboth Kadesh
Meriboth
Kadesh


1

Merodach Baladan
Merodach
Baladan


1

Misrephoth Maim
Misrephoth
Maim


2

Mordecai Esther's
Mordecai
Esther


1

Mordecai the Jew
Mordecai
Jew


3

Moses Zipporah
Moses
Zipporah


1

Mount Seir
Mount
Seir


2

My shield
My


1

Nergal Sharezer
Nergal
Sharezer


2

Nun Joshua
Nun
Joshua


1

O King
O
King


1

Paddan Aram
Paddan
Aram


5

Perez Uzza
Perez
Uzza


1

Perez Uzzah
Perez
Uzzah


1

Pharaoh Hophra
Pharaoh
Hophra


1

Pharaoh Necoh
Pharaoh
Necoh


1

Porcius Festus
Porcius
Festus


1

Praise Yah
Praise
Yah


10

Praise Yahweh
Praise
Yahweh


8

Ramath Lehi
Ramath
Lehi


1

Ramath Mizpeh
Ramath
Mizpeh


1

Ramathaim Zophim
Ramathaim
Zophim


1

Rebekah loved Jacob
Rebekah
Jacob


1

Regem Melech
Regem
Melech


1

Rimmon Perez
Rimmon
Perez


2

Sela Hammahlekoth
Sela
Hammahlekoth


1

Send Zenas
Send
Zenas


1

Set Uriah
Set
Uriah


1

Shall Saul
Shall
Saul


1

Shall Shimei
Shall
Shimei


1

Shimea David's
Shimea
David


1

Simon Bar Jonah
Simon
Bar
Jonah


1

Simon Iscariot
Simon
Iscariot


2

Simon Peter
Simon
Peter


18

Simon Peter's
Simon
Peter


2

Take Joshua
Take
Joshua


1

Take Micaiah
Take
Micaiah


2

The Mighty One
The
Mighty
One


1

Tiglath Pileser
Tiglath
Pileser


3

Tilgath Pilneser
Tilgath
Pilneser


3

Truly Yahweh
Truly
Yahweh


1

Tubal Cain
Tubal
Cain


1

Tubal Cain's
Tubal
Cain


1

Tyre Nebuchadrezzar
Tyre
Nebuchadrezzar


1

Uzzen Sheerah
Uzzen
Sheerah


1

Yahweh God
Yahweh
God


2

Zedekiah to Jeremiah
Zedekiah
Jeremiah


1

Zerah by Tamar
Zerah
Tamar


1

Ziphah, Tiria
Ziphah
Tiria


1

Zipporah, Moses'
Zipporah
Moses


1

[Ben Hadad]
Ben
Hadad


1

brook Besor
Besor


1

brook Cherith
Cherith


2

brook Kishon
Kishon


1

declare Yah's
Yah


1

the Mighty One
Mighty
One


1

In [14]:
people_df2 = pd.DataFrame(zip(people.keys(), people.values()), columns = ['people', 'mentioned'])
people_df2.sort_values(['people'], ascending=True, inplace=True)
people_df2.reset_index(drop=True, inplace=True)

In [15]:
people_df2

Unnamed: 0,people,mentioned
0,Aaron,350
1,Abanah,1
2,Abba,1
3,Abdi,2
4,Abdon,7
5,Abednego,1
6,Abel,12
7,Abi,1
8,Abiasaph,1
9,Abidan,3


There are now three more counts added to "Aaron" and the phrases containing Aaron's name is gone. I pretty happy with this. However, I cannot stress enough that this is not a perfect list. There are certainly entries and counts in this list that came from the names of locations and there are also common words that have been included. 

That said, I think this code did a really great job capturing all of the people in the Bible, and I'm excited analyze this list. One last step, I'm going to remove lowercase words and any one letter words.

In [18]:
people_df2 = people_df2[people_df2['people']!=people_df2['people'].str.lower()]

people_df2 = people_df2[people_df2['people'].str.len()>1]

# Let the analysis begin!!

Let's start with rankings. Who are the most mentioned people in the Bible?

In [16]:
people_df2.sort_values(['mentioned'], ascending=False)

Unnamed: 0,people,mentioned
504,God,4034
1324,Yahweh,2061
737,Jesus,977
973,Moses,852
338,David,571
0,Aaron,350
777,Joseph,240
26,Abraham,222
781,Joshua,216
231,Benjamin,137


In [17]:
people_df2.sort_values(['people'], ascending=True)

Unnamed: 0,people,mentioned
0,Aaron,350
1,Abanah,1
2,Abba,1
3,Abdi,2
4,Abdon,7
5,Abednego,1
6,Abel,12
7,Abi,1
8,Abiasaph,1
9,Abidan,3


No surprise, God in all his forms is the most mentioned person in the Bible. We have "God" in first place with just over four thousand mentions, "Yahwah" with a little over two thousand mentions and "Jesus" with just under one-thousand mentions. This does not count all of the times that God is mentioned in the New Testament as the "Father", but I think we still get a clear picture. This book is about God!

Next, we have Moses. I'm still surprised and confused about why Moses did not show up as a PERSON entity in our search. The model almost certainly would have predicted "Moses" as a name based on the way it used in the sentence structure, so there must be a decision rule that explicitly tells the model not to count Moses as an entity. I can't think of why this would be the case. This name is still used in common English as a given name and I can't think of any other way it is used. Perhaps, one day we'll figure this out.

The next most commonly mentioned person in the Bible is David, whom I was named after. Then Moses' brother, 
Aaron.

I was actually surprised that Joseph came in seventh with 240 mentions just ahead of his great-grandfather, Abraham, who only had 222 mentions. I may confirm this with another counts analysis. However, Abraham also gets 63 mentions as Abram, so technically Abraham comes in just ahead of Joseph with 285 combined mentions. 

Joshua, Moses' apprentice, comes in ninth, and Joseph's brother, Benjamin, is tenth. This made me wonder why we didn't see Israel's name in the top ten. Both Israel and Jacob show up but with low counts. I don't know why this would be the case. 

# Geopolitical Entities

In [28]:
gpe_df

Unnamed: 0,gpe,mentioned
0,Abagtha,1
1,Abdon,1
2,Abilene,1
3,Abimael,1
4,Abinadab,3
5,Abishai,5
6,Achaia,11
7,Adnah,2
8,Adoraim,1
9,Ahaz,6


This list looks better to me. Of course there is still some noise, but not nearly as much as I saw in the people list. I'm going to keep this list as is and look at the top ten most mentioned gpe in the Bible.

In [32]:
gpe_df.sort_values(['mentioned'], ascending=False)

Unnamed: 0,gpe,mentioned
228,Israel,2326
262,Jerusalem,804
131,Egypt,618
379,Moab,170
280,Jordan,152
64,Babylon,94
98,Canaan,83
199,Hebron,72
534,Syria,72
312,Lebanon,70


Yep, looks like a pretty solid list. The top ten are exactly what I would expect.

# Pushing Pandas dataframes to SQL tables

As mentioned earlier in this post, I would like to use the name of people and geopolitical entities in a webcrawler to extract articles about Biblical artifacts directly related to specific entities within the text of the Bible. Ultimately, I'd like to compile a dynamic list of these articles and tie them directly to the relevant Biblical text.

As such, I am going to save these lists as SQL tables that I can access later.

In [33]:
conn = sqlite3.connect(database)

people_df2.to_sql('people_names', conn, if_exists='replace', index=False)

gpe_df.to_sql('gpe_name', conn, if_exists='replace', index=False)

conn.close()

1392

646

In [34]:
# *table* means double quotes around table
 
conn = sqlite3.connect(database)
cursor = conn.cursor()
 
cursor.execute('SELECT name FROM sqlite_master WHERE type="table"')
 
for i in cursor.fetchall():
    print(i[0])
    
conn.close()

<sqlite3.Cursor at 0x201e57a0440>

t_web
people_names
gpe_name
