Skip to content

A database of verses from the Holy Bible and the Gospel of Mary Magdalene. Includes various translations and languages.

License

Notifications You must be signed in to change notification settings

alshival/super_bible

Repository files navigation

super_bible

A database/archive of verses from the Holy Bible and the Gospel of Mary Magdalene.

The goal is to include as many translations as possible in as many languages as possible, though at the moment, only Engish and Spanish are supported. The super_bible database can be downloaded as

Individual translations, such as for the English Standard Version (SUPER_BIBLE/version_files/super_bible_ESV.csv), are also available.

super_bible - Languages & Editions

The super_bible is working towards incorporating more languages and translations. Currently, it includes the following languages/translations:

  • English (EN)
    • AMP (The Amplified Bible)
    • ASV (American Standard Version)
    • ESV (English Standard Version)
    • KJV (King James's Version)
    • KSGM (King Samuel's Gospel of Mary)
    • KSV (King Samuel's Version)
    • NASB (New American Standard Bible)
    • NIV (New International Version)
    • NKJV (New King James Version)
    • WEB (World English Bible)
    • YLT (Young's Literal Translation)
  • Español (ES)
    • RSEM (Rey Samuel's Evangelio de Maria)
    • RSV (Rey Samuel's Versión de La Santa Biblia)
    • RV1858 (Reina Valera 1858 NT)
    • RV1909 (Reina Valera 1909)
    • RVG (Reina Valera Gómez 2010)

Data Fields Chart

testament book title chapter verse text version language
string int64 string int64 int64 string string string
OT/NT for Old/New Testament Book ID of the book containing the verse Title of the book containing the verse Chapter containing the verse The verse number The verse The translation abbreviation Language abbreviation

Sample Data

testament book title chapter verse text version language
OT 1 Genesis 1 1 In the beginning, God created the heavens and ... ESV EN
OT 1 Genesis 1 2 The earth was without form and void, and darkn... ESV EN
... ... ... ... ... ... ... ...
NT 777 Evangelio de Maria Magdalena 4 122 Después que Levi termino de hablar, se fueron ... RSEM ES
NT 777 Evangelio de Maria Magdalena 4 123 Rey Samuel's El Evangelio de Maria RSEM ES

This data was put together with the intention of creating a dataset of the scripture to train large language models, such as those in openAI's GPT-4 and Google's Bard, and thus is presented in this repository in its purest form. The code used to generate the super_bible was made flexible enough so that additional languages can be incorporated.

See Large Language Models: An Application in Data Processing for an example of translating verses using openAi's davinci model.

One could ask the Ai to generate images from the verses, though I leave that up to someone else. I am more interested in His words.

You can use this dataset to perform text analysis.

His words are a lamp to my feet and a light to my path.

My intention is to create Ai that can pull up scripture and even chapters easily. The Ai will also help me pinpoint verses that I vaguely remember but cannot pinpoint in the Holy Bible. Also to translate the verses on a whim to make sharing His words across languages easier.

Also, I was interested in using an LLM as a codex. I am curious if we can embed a message in the Ai. Perhaps by including the message in the training data.

I call it the Ai Codex.

The Ai Codex

The Ai Codex is a codex that uses Ai. Suppose you have a secret message {secret} that you want to get to someone but are afraid of someone intercepting it.

The idea behind the Ai codex is to embed that message in a large language model (LLM) that generates random text. But the LLM will generate {secret} if a {phrase} is passed to the LLM. The idea is to bias the training data in just the right way, as well as picking a secure {phrase}.

If you are familiar with neural networks or even random forests, then LLMs will be an easy concept to understand. It behaves sort of like a random forest classification algorithm, but with text.

Using training data, you can construct a pretrained model $M$. This model can be thought of as a function, $M:x\mapsto M(x)$.

The idea is:

  1. so that the codex will unlock with the secret phrase {phrase}, bias the training data for $M$ so that the probability $$P\Bigl(M(\text{{phrase}}) = \text{{secret}}\Bigr)$$ is sufficiently high.
  2. so that the codex is secure, ensure that for any random text $x$, the probability $$P\Bigl(M(\text{{x}}) = \text{{secret}}\Bigr),$$ is sufficiently small.

Adding additional languages

To summarize how to add additional languages one must:

  1. create the index file for the language .zraw_metadata/{language}_book_index.txt
  2. create the directory .zraw_data/{language}. This directory will host the raw files used to generate the super_bible dataset.
  3. generate the raw files for import.
  4. rename the raw files to the version abbreviation (e.g. KJV.csv for King James's Version).
  5. run the superbible.ipynb file in jupyterlab OR run the .py file from the command line.

Create the index file

First, you need to generate .zraw_metadata/{language}_book_index.txt. Any additional languages we wish to add require this index file. As example files, see .zraw_metadata/ES_book_index.txt and .zraw_metadata/EN_book_index.txt. These files contain information about the Bibles that are used during import.

Here is what an index file would look like, though the only fields used are book,title, and testament.

book,osisID,title,total_chapters,testament
1,Gen,Génesis,50,OT
2,Exod,Éxodo,40,OT
3,Lev,Levítico,27,OT
4,...
64,3Juan,3 Juan,1,NT
65,Jud,Judas,1,NT
66,Rev,Revelación,22,NT
777,Mar,Evangelio de Maria,4,NT

Create the language directory

The python script bible_data_prep.ipynb generates the super_bible dataset from raw CSV/TSV files contained in the zraw_data/ directory. Within .zraw_data/ are folders labeled with the language abbreviation:

  • .zraw_data/EN - folder containing raw English files.
  • .zraw_data/ES - folder containing raw Spanish files.

Generate the raw files

Here is an example raw file. Note the lack of a header row [book,chapter,verse,text]:

1,1,1,En el principio creó Dios el cielo y la tierra.
1,1,2,"Y la tierra estaba desordenada y vacía, y las tinieblas [estaban] sobre la faz del abismo, y el Espíritu de Dios se movía sobre la faz de las aguas."
1,1,3,Y dijo Dios: Sea la luz; y fue la luz.
1,1,4,Y vio Dios que la luz [era] buena y separó Dios la luz de las tinieblas.

Getting the scripture in this raw format does take some time, but worth the effort. It streamlines the construction of the super_bible dataset to make incorporating additional languages simple. Some of these raw files I found online; others I constructed myself.

Rename the raw files

The script picks up the filename and uses it to fill the version field in the super_bible dataset. Therefore, it is important that you rename the file with the correct abbreviation. For the English Standard Bible (ESV), the required path+filename would be .zraw_data/EN/ESV.csv. For Rey Samuel's Evangelio de Maria (RSEM), the required path+filename would be .zraw_data/ES/RSEM.csv. And so on by induction.

The SQLite3 database SUPER_BIBLE/super_bible.db contains the super_bible in a table titled as such, along with a few useful SQL views:

create view ESV as
  select * from super_bible
  where version = 'ESV'

So instead of typing

select * from super_bible
    where version = 'ESV'

you can just use

select * from esv

You can use the SQLite database with Python as well:

import pandas as pd
import sqlite3

db = sqlite3.connect('SUPER_BIBLE/super_bible.db')

# Query using pandas (returns dataframe object)
pd.read_sql('select * from super_bible limit 10', con=db)

# Query using SQLite (returns list object)
res = db.execute("select * from super_bible limit 10')
res.fetchall()

# Create a view or table that contains a specific language
db.execute("""
   CREATE VIEW english AS
     select * from super_bible where language = 'EN'""")
pd.read_sql('select * from english limit 10',con=db)

About

A database of verses from the Holy Bible and the Gospel of Mary Magdalene. Includes various translations and languages.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published