# Finding Verse References

In this notebook I will create a system to extract Bible verse references from the labeled websites.

In [77]:
import pandas as pd
import sqlite3
import re

In order to determine my approach I will create a short string containing three Bible verse references from two books.

In [222]:
string = "hey this is Gensis 3:1 and Gensis 4:5 or Gensis 4 or Numbers 1:3 and Mth 21:18 and 1 John 1:2"

Next, I will use regular expressions to extract any Bible verses in the form of BOOK CHAPTER:VERSE

In [315]:
chapter=re.compile(r'\w+ \d+')
chapter_verse=re.compile(r'\w+ \d+:\d+')
n_chapter_verse=re.compile(r'\d \w+ \d+:\d+')

c=chapter.findall(string)
cv=chapter_verse.findall(string)
ncv=n_chapter_verse.findall(string)

print(c)
print(set(c))
print(cv)
print(ncv)


print(list(set(chapter.findall(string))))

['Gensis 3', 'Gensis 4', 'Gensis 4', 'Numbers 1', 'Mth 21', 'and 1', 'John 1']
{'Numbers 1', 'John 1', 'Gensis 3', 'and 1', 'Gensis 4', 'Mth 21'}
['Gensis 3:1', 'Gensis 4:5', 'Numbers 1:3', 'Mth 21:18', 'John 1:2']
['1 John 1:2']
['Numbers 1', 'John 1', 'Gensis 3', 'and 1', 'Gensis 4', 'Mth 21']


In [316]:
c + cv

['Gensis 3',
 'Gensis 4',
 'Gensis 4',
 'Numbers 1',
 'Mth 21',
 'and 1',
 'John 1',
 'Gensis 3:1',
 'Gensis 4:5',
 'Numbers 1:3',
 'Mth 21:18',
 'John 1:2']

Extracting the chapter and verse from each reference.

In [345]:
test_verse = "1 John 2:4"
test_verse = "Matthew 3:16"
test_verse = "Rev 5"

In [346]:
test_verse.rsplit(' ', 1)[1].split(':')[0]

'5'

In [347]:
test_verse.rsplit(' ', 1)[1].split(':')[1]

IndexError: list index out of range

I'll have to create an exception for references without a verse.

# Creating a Dataset of Extracted Bible Verse References
Now that I have a way to extract Bible verse references from a string, it's time to search through the text data I pulled from the labeled websites. I will save these Bible verse refernces in a seperate file that contains the website and also a column for the book, chapter and verse.

First, I will connect to the SQL database and pull the text data from the labeled websites that are relevant. Eventually, this will be part of a pipeline that new websites are processed through, but I want to perfect the process using the data I've already labeled.

In [23]:
conn = sqlite3.connect(r"C:\Bible Research\SQL database\biblesql.db")

In [40]:
pd.read_sql('select count(*) from labeled_text where relevant = 1', conn)

Unnamed: 0,count(*)
0,421


In [41]:
text_data = pd.read_sql('select * from labeled_text where relevant = 1', conn)

In [42]:
text_data.columns

Index(['website', 'relevant', 'text'], dtype='object')

Next, I will define regular expressions that will recognize Bible references as:
1. A word followed by a number
2. A number followed by a word that is also followed by a number
3. A word that is followed by a number then a colon then a number
4. A number followed by a word that is followed by a number then a colon then a number

I will also create reguralar expressions that capture things like "First Samuel" or "Second Peter".

In [280]:
chapter=re.compile(r'\w+ \d+')
n_chapter=re.compile(r'\d \w+ \d+')
f_chapter=re.compile(r'first \w+ \d+')
F_chapter=re.compile(r'First \w+ \d+')
s_chapter=re.compile(r'second \w+ \d+')
S_chapter=re.compile(r'Second \w+ \d+')
t_chapter=re.compile(r'third \w+ \d+')
T_chapter=re.compile(r'Third \w+ \d+')

chapter_verse=re.compile(r'\w+ \d+:\d+')
n_chapter_verse = re.compile(r'\d \w+ \d+:\d+')
f_chapter_verse = re.compile(r'first \w+ \d+:\d+')
F_chapter_verse = re.compile(r'First \w+ \d+:\d+')
s_chapter_verse = re.compile(r'second \w+ \d+:\d+')
S_chapter_verse = re.compile(r'Second \w+ \d+:\d+')
t_chapter_verse = re.compile(r'third \w+ \d+:\d+')
T_chapter_verse = re.compile(r'Third \w+ \d+:\d+')

Next, I'll create a dictionary of book names and abbreviations in order to extract scriptural references from the labeled text. Later, I'll use it to assign the book number to a variable *b*. This site had a lot of the information I needed: https://www.aresearchguide.com/bibleabb.html

In [473]:
books = {'genesis': 1, 'gen': 1, 'ge': 1, 'gn': 1,
          'exodus': 2, 'ex': 2, 'exod': 2,
          'leviticus': 3, 'lev': 3, 'le': 3, 'lv': 3,
          'numbers': 4, 'num': 4, 'nu': 4, 'nm': 4, 'nb': 4,
          'deuteronomy': 5, 'deut': 5, 'de': 5, 'dt': 5,
          'joshua': 6, 'josh': 6, 'jos': 6, 'jsh': 6,
          'judges': 7, 'judg': 7, 'jdg': 7, 'jg': 7, 'jdgs': 7,
          'ruth': 8, 'rth': 8, 'ru': 8,
          'first samuel': 9, '1 samuel': 9, '1 sam': 9, '1 sm': 9, '1 sa': 9, '1 s': 9,
          'second samuel': 10, '2 samuel': 10, '2 sam': 10, '2 sm': 10, '2 sa': 10, '2 s': 10,
          'first kings': 11, '1 kings': 11, '1 kgs': 11,  '1 kin': 11, '1 ki': 11, '1 k': 11,
          'second kings': 12, '2 kings': 12, '2 kgs': 12, '2 kin': 12, '2 ki': 12, '2 k': 12,
          'first chronicles': 13, '1 chronicles': 13, '1 chr': 13, '1 ch': 13, '1 chron': 13,
          'second chronicles': 14, '2 chronicles': 14, '2 chr': 14, '2 ch': 14, '2 chron': 14,
          'ezra': 15, 'ezra': 15, 'ezr': 15, 'ez': 15, 
          'nehemiah': 16, 'neh': 16, 'ne': 16,
          'esther': 17, 'esth': 17, 'est': 17, 'es': 17,
          'job': 18, 'job': 18, 'jb': 18, 
          'psalms': 19, 'ps': 19, 'psalm': 19, 'pslm': 19, 'psa': 19, 'psm': 19, 
          'proverbs': 20, 'prov': 20, 'pro': 20, 'prv': 20, 'pr': 20,
          'ecclesiastes': 21, 'eccl': 21, 'eccles': 21, 'eccle': 21, 'ecc': 21, 'ec': 21,
          'solomon': 22, 'song': 22,
          'isaiah': 23, 'isa': 23, 'is': 23, 
          'jeremiah': 24, 'jer': 24, 'je': 24, 'jr': 24,
          'lamentations': 25, 'lam': 25, 'la': 25,
          'ezekial': 26, 'ezek': 26, 'eze': 26, 'ezk': 26,
          'daniel': 27, 'dan': 27, 'da': 27, 'dn': 27,
          'hosea': 28, 'hos': 28, 'ho': 28,
          'joel': 29, 'jl': 29, 
          'amos': 30, 'am': 30,
          'obadiah': 31, 'obad': 31, 'ob': 31,
          'jonah': 32, 'jon': 32, 'jnh': 32,
          'micah': 33, 'mic': 33, 'mc': 33,
          'nahum': 34, 'nah': 34, 'na': 34,
          'habakkuk': 35, 'hab': 35,
          'zephaniah': 36, 'zeph': 36, 'zep': 36, 'zp': 36,
          'haggai': 37, 'hag': 37, 'hg': 37,
          'zechariah': 38, 'zech': 38, 'zec': 38, 'zc': 38,
          'malachi': 39, 'mal': 39, 'ml': 39,
          'matthew': 40, 'mt': 40, 'matt': 40,
          'mark': 40, 'mk': 41, 'mrk': 41,
          'luke': 42, 'lk': 42, 'luk': 42,
          'john': 43, 'jn': 43, 'jhn': 43,
          'apostles': 44, 'acts': 44,
          'romans': 45, 'rom': 45, 'ro': 45, 'rm': 45,
          'first corinthians': 46, '1 corinthians': 46, '1 cor': 46, '1 co': 46,
          'second corinthians': 47, '2 corinthians': 47, '2 cor': 47, '2 co': 47,
          'galatians': 48, 'gal': 48, 'ga': 48, 'ephesians': 49,
          'eph': 49, 'ephes': 49, 
          'philippians': 50, 'phil': 50, 'php': 50, 'pp': 50,
          'colossians': 51, 'col': 51,
          'first thessalonians': 52, '1 thessalonians': 52, '1 thess': 52, '1 thes': 52, '1 th': 52,
          'second thessalonians': 53, '2 thessalonians': 53, '2 thess': 53, '2 thes': 53, '2 th': 53,
          'first timothy': 54, '1 timothy': 54, '1 tim': 54, '1 ti': 54, 
          'second timothy': 55, '2 timothy': 55, '2 tim': 55, '2 ti': 55,
          'titus': 56, 'tit': 56, 'ti': 56,
          'philemon': 57, 'philem': 57, 'phm': 57, 'pm': 57,
          'hebrews': 58, 'heb': 58, 
          'james': 59, 'jas': 59, 'jm': 59,
          'first peter': 60, '1 peter': 60, '1 pet': 60, '1 pe': 60, '1 pt': 60, '1 p': 60,
          'second peter': 61, '2 peter': 61, '2 pet': 61, '2 pe': 61, '2 pt': 61, '2 p': 61,
          'first john': 62, '1 john': 62, '1 jn': 62, '1 jhn': 62, '1 j': 62,
          'second john': 63, '2 john': 63, '2 jn': 63, '2 jhn': 63, '2 j': 63,
          'third john': 64, '3 john': 64, '3 jn': 64, '3 jhn': 64, '3 j': 64, 
          'jude': 65, 'jud': 65, 'jd': 65,
          'revelation': 66, 'rev': 66}

Next, I will create a function that will filter to the entries that contain one of the common Bible book references found in *books*. These two sites were helpful: 1) https://www.geeksforgeeks.org/python-filter-list-of-strings-based-on-the-substring-list/ and 2) https://stackoverflow.com/questions/6266727/python-cut-off-the-last-word-of-a-sentence.

In [479]:
def Filter(string, substr):
    return [str for str in string if str.lower().rsplit(' ', 1)[0] in substr]

In [480]:
bad_list = ['1 John 3:6','Obadiah 6:1', 'The 475', 'Lk 1:2', 'Harry Potter 234', 'issue 51', 'Isaiah 53', '1 John 3']

In [481]:
Filter(bad_list, books)

['1 John 3:6', 'Obadiah 6:1', 'Lk 1:2', 'Isaiah 53', '1 John 3']

Another thing I will need to do is remove Bible references that are already captured in greater detail by another regular expression. Here is a useful function for doing this: https://stackoverflow.com/questions/21720199/python-remove-any-element-from-a-list-of-strings-that-is-a-substring-of-anothe

In [482]:
def deduplicator(string_list):
    out = []
    for s in string_list:
        if not any([s in r for r in string_list if s != r]):
            out.append(s)
    return out

In [483]:
deduplicator(Filter(bad_list, books))

['1 John 3:6', 'Obadiah 6:1', 'Lk 1:2', 'Isaiah 53']

Finally, I will create a FOR loop to extract unique Bible references from the text data related to each labeled website in *text_data*.

In [484]:
df = pd.DataFrame(columns = ['website', 'relevant', 'reference'])
n = 0

for index, row in text_data.iterrows():
    
    c=Filter(list(set(chapter.findall(row[2]))), books)
    nc=Filter(list(set(n_chapter.findall(row[2]))), books)
    fc=Filter(list(set(f_chapter.findall(row[2]))), books)
    Fc=Filter(list(set(F_chapter.findall(row[2]))), books)
    sc=Filter(list(set(s_chapter.findall(row[2]))), books)
    Sc=Filter(list(set(S_chapter.findall(row[2]))), books)
    tc=Filter(list(set(t_chapter.findall(row[2]))), books)
    Tc=Filter(list(set(T_chapter.findall(row[2]))), books)
    
    cv=Filter(list(set(chapter_verse.findall(row[2]))), books)
    ncv=Filter(list(set(n_chapter_verse.findall(row[2]))), books)
    fcv=Filter(list(set(f_chapter_verse.findall(row[2]))), books)
    Fcv=Filter(list(set(F_chapter_verse.findall(row[2]))), books)
    scv=Filter(list(set(s_chapter_verse.findall(row[2]))), books)
    Scv=Filter(list(set(S_chapter_verse.findall(row[2]))), books)
    tcv=Filter(list(set(t_chapter_verse.findall(row[2]))), books)
    Tcv=Filter(list(set(T_chapter_verse.findall(row[2]))), books)
        
    master_list =  deduplicator(ncv+fcv+Fcv+scv+Scv+tcv+Tcv+cv+nc+fc+Fc+sc+Sc+tc+Tc+c)
    
    temp = pd.DataFrame(master_list, columns = ['reference'])
    temp['website'] = row[0]
    temp['relevant'] = row[1]
    
    df = df.append(temp,ignore_index=True,sort=False)
    
    n+=1    

In [485]:
df.head()

Unnamed: 0,website,relevant,reference
0,http://apologeticspress.org/apcontent.aspx?cat...,1.0,Joshua 11:3
1,http://apologeticspress.org/article/1217,1.0,1 Corinthians 3:10
2,http://apologeticspress.org/article/1217,1.0,2 Timothy 4:20
3,http://apologeticspress.org/article/1217,1.0,Romans 16:23
4,http://apologeticspress.org/article/1217,1.0,Acts 19:22


In [486]:
df.website.nunique()

258

In [495]:
text_data.website.nunique()

421

In [487]:
df.reference.nunique()

1486

In [498]:
len(df.reference)

1824

Only 258 of the 421 relevant websites contain specific Biblical references, but this is a great start. I can now tie these 258 articles to 1,486 unique scriputes with 1,824 links. 

In order to do this I will create the variables *b* for book, *c* for chapter, and *v* for verse. 

This function will return the book number when the book name or abbreviation is input.

In [489]:
def find_book(item):
    if item in books:
        return books[item]

In [490]:
find_book('mal')

39

Finally, I will create a FOR loop to iterate through the website references and assign values for "b", "c" and "v".

In [493]:
linked = pd.DataFrame(columns = ['website', 'relevant', 'reference','b', 'c','v'])
n = 0

for index, row in df.iterrows():
    
    try:
        
        linked.loc[n] = [row.website] + [row.relevant] + [row.reference] + [find_book(row[2].lower().rsplit(' ', 1)[0])] + [row[2].rsplit(' ', 1)[1].split(':')[0]] + [row[2].rsplit(' ', 1)[1].split(':')[1]]
    
    except:
        
        linked.loc[n] = [row.website] + [row.relevant] + [row.reference] + [find_book(row[2].lower().rsplit(' ', 1)[0])] + [row[2].rsplit(' ', 1)[1].split(':')[0]] + ['Nan']
    
    n+=1

In [508]:
linked.head()

Unnamed: 0,website,relevant,reference,b,c,v
0,http://apologeticspress.org/apcontent.aspx?cat...,1.0,Joshua 11:3,6,11,3
1,http://apologeticspress.org/article/1217,1.0,1 Corinthians 3:10,46,3,10
2,http://apologeticspress.org/article/1217,1.0,2 Timothy 4:20,55,4,20
3,http://apologeticspress.org/article/1217,1.0,Romans 16:23,45,16,23
4,http://apologeticspress.org/article/1217,1.0,Acts 19:22,44,19,22


This looks good. I will save this to my SQL database.

In [509]:
linked.to_sql('linked_websites', conn, if_exists='replace', index=False)

I will also create a list of articles that do not contain a directly reference to a specific scripture. I will need to manual review these articles to determine which scriptures they should link to.

In [515]:
unlinked = text_data[['website','relevant']][~(text_data['website'].isin(linked['website']))]

In [516]:
unlinked.reset_index(inplace = True) 

In [517]:
unlinked.head()

Unnamed: 0,index,website,relevant
0,4,http://bible7evidence.blogspot.com/2014/09/abr...,1.0
1,8,http://en.hebron.org.il/history/355,1.0
2,9,http://en.hebron.org.il/history/505,1.0
3,11,http://helpmewithbiblestudy.org/17Archeology/I...,1.0
4,13,http://jezreel-expedition.com/?page_id=21,1.0


In [518]:
unlinked.to_sql('unlinked_websites', conn, if_exists='replace', index=False)

In [519]:
cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[('bible_bbe',), ('book_key',), ('books',), ('bible_metrics',), ('labeled_text2',), ('labeled_text',), ('reduced_text',), ('linked_websites',), ('unlinked_websites',)]
