# SET UP

In [1]:
import os
import pandas as pd
import numpy as np
import sqlite3
import re

# Set project folder as directory
os.chdir(r'C:/Users/david/Projects/Bible Analytics')

# Remove row and column limits
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

# Display all output from each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# PULLING DATA

I will use the World English Bible translation as my data source. I want to do text analytics and using every day English will make this easier. I was able to download the text from Kaggle, and I've already processed the data. You can find the text I used here: https://www.kaggle.com/oswinrh/bible#t_asv.csv

In [2]:
df = pd.read_csv('Translations/World English Bible/t_web.csv')

In [3]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31102 non-null  int64 
 1   b       31102 non-null  int64 
 2   c       31102 non-null  int64 
 3   v       31102 non-null  int64 
 4   t       31102 non-null  object
dtypes: int64(4), object(1)
memory usage: 1.2+ MB


Unnamed: 0,id,b,c,v,t
0,1001001,1,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth."
1,1001002,1,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,1001003,1,1,3,"God said, ""Let there be light,"" and there was light."
3,1001004,1,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness."
4,1001005,1,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


The World English Bible contains extra text as definitions, which is can be helpful, but I am only interested in the actual text. The definitions appear to be contains within curly brackets, so I will create another column with these definitions removed. To do this I will create a function called *remove_definitions* and apply it to the text.

I've also discovered that this translation will sometimes combine verses such as Romans 14:23-25 and that it contains several non-cannonical books. 

For now, I will keep the text for the connonical books and remove the data within parentheses, as well.

In [4]:
df[(df['b']==45) & (df['c']==14) & (df['v']==23)]

Unnamed: 0,id,b,c,v,t
28303,45014023,45,14,23,"But he who doubts is condemned if he eats, because it isn't of faith; and whatever is not of faith is sin. (14:24) Now to him who is able to establish you according to my Gospel and the preaching of Jesus Christ, according to the revelation of the mystery which has been kept secret through long ages, (14:25) but now is revealed, and by the Scriptures of the prophets, according to the commandment of the eternal God, is made known for obedience of faith to all the nations; (14:26) to the only wise God, through Jesus Christ, to whom be the glory forever! Amen.{TR places verses 24-26 after Romans 16:24 as verses 25-27.}"


In [5]:
def clean(my_str):    
    
    clean = re.sub('{[^>]+}', '', my_str)
    clean = re.sub('\(+[\d+]+[:]+[\d+]+\)', '', clean)
    clean = re.sub('  ', ' ', clean)
    
    return clean

In [6]:
df[(df['b']==1) & (df['c']==1) & (df['v']==1)]['t']

0    In the beginning God{After "God," the Hebrew has the two letters "Aleph Tav" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.
Name: t, dtype: object

In [7]:
df[(df['b']==1) & (df['c']==1) & (df['v']==1)]['t'].apply(clean)

0    In the beginning God created the heavens and the earth.
Name: t, dtype: object

In [8]:
df[(df['b']==45) & (df['c']==14) & (df['v']==23)]['t']

28303    But he who doubts is condemned if he eats, because it isn't of faith; and whatever is not of faith is sin. (14:24) Now to him who is able to establish you according to my Gospel and the preaching of Jesus Christ, according to the revelation of the mystery which has been kept secret through long ages, (14:25) but now is revealed, and by the Scriptures of the prophets, according to the commandment of the eternal God, is made known for obedience of faith to all the nations; (14:26) to the only wise God, through Jesus Christ, to whom be the glory forever! Amen.{TR places verses 24-26 after Romans 16:24 as verses 25-27.}
Name: t, dtype: object

In [9]:
df[(df['b']==45) & (df['c']==14) & (df['v']==23)]['t'].apply(clean)

28303    But he who doubts is condemned if he eats, because it isn't of faith; and whatever is not of faith is sin. Now to him who is able to establish you according to my Gospel and the preaching of Jesus Christ, according to the revelation of the mystery which has been kept secret through long ages, but now is revealed, and by the Scriptures of the prophets, according to the commandment of the eternal God, is made known for obedience of faith to all the nations; to the only wise God, through Jesus Christ, to whom be the glory forever! Amen.
Name: t, dtype: object

In [10]:
df['clean_t'] = df['t'].apply(clean)

In [11]:
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31102 entries, 0 to 31101
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       31102 non-null  int64 
 1   b        31102 non-null  int64 
 2   c        31102 non-null  int64 
 3   v        31102 non-null  int64 
 4   t        31102 non-null  object
 5   clean_t  31102 non-null  object
dtypes: int64(4), object(2)
memory usage: 1.4+ MB


Unnamed: 0,id,b,c,v,t,clean_t
0,1001001,1,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.",In the beginning God created the heavens and the earth.
1,1001002,1,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,1001003,1,1,3,"God said, ""Let there be light,"" and there was light.","God said, ""Let there be light,"" and there was light."
3,1001004,1,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness.","God saw the light, and saw that it was good. God divided the light from the darkness."
4,1001005,1,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.","God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


# MERGING WITH BOOK NAMES
Before storing this data as a SQL table, I want to make one update to the dataframe. I want to add the actual book names. I have another dataset called "key_english" that contains the actual book names and the book numbers. It also contains useful information about the Old and New Testaments and Bible groupings.

I will import this data and merge it will my text data. One note on merging, I want the book names to show up at the beginning of my new dataframe, so I'm going to merge df with key rather than key with df. This is a subtle difference and completely based on my preference, but I think I'll be happier with the data formated in this way.

In [12]:
key = pd.read_csv('Jupyter/Jupyter data/key_english.csv')

In [13]:
df = key.merge(df, how='inner', left_on='b', right_on='b')

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31102 entries, 0 to 31101
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   b        31102 non-null  int64 
 1   name     31102 non-null  object
 2   old_new  31102 non-null  object
 3   group    31102 non-null  int64 
 4   id       31102 non-null  int64 
 5   c        31102 non-null  int64 
 6   v        31102 non-null  int64 
 7   t        31102 non-null  object
 8   clean_t  31102 non-null  object
dtypes: int64(5), object(4)
memory usage: 2.4+ MB


Unnamed: 0,b,name,old_new,group,id,c,v,t,clean_t
0,1,Genesis,OT,1,1001001,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.",In the beginning God created the heavens and the earth.
1,1,Genesis,OT,1,1001002,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,1,Genesis,OT,1,1001003,1,3,"God said, ""Let there be light,"" and there was light.","God said, ""Let there be light,"" and there was light."
3,1,Genesis,OT,1,1001004,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness.","God saw the light, and saw that it was good. God divided the light from the darkness."
4,1,Genesis,OT,1,1001005,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.","God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


I still don't like the column order becasue "b" is separate from 'c' and 'v'. I'll fix that in the next line of code.

In [14]:
df.columns

Index(['b', 'name', 'old_new', 'group', 'id', 'c', 'v', 't', 'clean_t'], dtype='object')

In [15]:
df = df[['name', 'old_new', 'group', 'id', 'b', 'c', 'v', 't', 'clean_t']]

In [16]:
df.head()

Unnamed: 0,name,old_new,group,id,b,c,v,t,clean_t
0,Genesis,OT,1,1001001,1,1,1,"In the beginning God{After ""God,"" the Hebrew has the two letters ""Aleph Tav"" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.",In the beginning God created the heavens and the earth.
1,Genesis,OT,1,1001002,1,1,2,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.,Now the earth was formless and empty. Darkness was on the surface of the deep. God's Spirit was hovering over the surface of the waters.
2,Genesis,OT,1,1001003,1,1,3,"God said, ""Let there be light,"" and there was light.","God said, ""Let there be light,"" and there was light."
3,Genesis,OT,1,1001004,1,1,4,"God saw the light, and saw that it was good. God divided the light from the darkness.","God saw the light, and saw that it was good. God divided the light from the darkness."
4,Genesis,OT,1,1001005,1,1,5,"God called the light Day, and the darkness he called Night. There was evening and there was morning, one day.","God called the light Day, and the darkness he called Night. There was evening and there was morning, one day."


# CREATING SQL DATABASE

To begin, I'm going to create a SQL database for this project. This database will contain all of the data I pull or produce.

In [17]:
database = 'Data/SQL database.db'

In [18]:
conn = sqlite3.connect(database) 
print(sqlite3.version)
conn.close()

2.6.0


# PUSHING BIBLE DATAFRAME TO SQL DATABASE

In [19]:
conn = sqlite3.connect(database)
df.to_sql('t_web', conn, if_exists='replace', index=False)
conn.close()

31102

# VIEWING TABLES IN SQL DATABASE

In [20]:
conn = sqlite3.connect(database)
cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")

for i in cursor.fetchall():
    print(i[0])
    
conn.close()

<sqlite3.Cursor at 0x2cf1bb09140>

t_web


# VIEW COLUMN NAMES IN t_web

In [21]:
conn = sqlite3.connect(database)
cursor = conn.cursor()

cursor.execute("SELECT * FROM t_web")

for i in list(cursor.description):
    
    print(i[0])
    
conn.close()

<sqlite3.Cursor at 0x2cf1bb093c0>

name
old_new
group
id
b
c
v
t
clean_t


# VIEW FIRST TEN ROWS OF t_web

In [22]:
conn = sqlite3.connect(database)
cursor = conn.cursor()

print(pd.read_sql_query("SELECT * FROM t_web LIMIT 10", conn))

conn.close

      name old_new  group       id  b  c   v  \
0  Genesis      OT      1  1001001  1  1   1   
1  Genesis      OT      1  1001002  1  1   2   
2  Genesis      OT      1  1001003  1  1   3   
3  Genesis      OT      1  1001004  1  1   4   
4  Genesis      OT      1  1001005  1  1   5   
5  Genesis      OT      1  1001006  1  1   6   
6  Genesis      OT      1  1001007  1  1   7   
7  Genesis      OT      1  1001008  1  1   8   
8  Genesis      OT      1  1001009  1  1   9   
9  Genesis      OT      1  1001010  1  1  10   

                                                                                                                                                                                               t  \
0  In the beginning God{After "God," the Hebrew has the two letters "Aleph Tav" (the first and last letters of the Hebrew alphabet) as a grammatical marker.} created the heavens and the earth.   
1                                                       Now the earth was forml

<function Connection.close()>

# WRAP UP

That's it. I've pulled and cleaned the text for the World English Bible, created a SQL database using SQLite, and pushed the data to the SQL database. I've also viewed the data in the SQL database to ensure it was stored as expected. I can now use this data for additional analyses.