# Word2Vec on Bible Data

Playing around with machine learning word embeddings using the Bible as a data source.
An excellent tutorial for word embeddings can be found at http://jalammar.github.io/illustrated-word2vec/.

1. Read bible data into table. 1 verse per row. A verse is treated as a sentence for training the model.
2. Normalize verses into tokens.
3. Train w2v model on the tokens.

In [2]:
import pandas as pd, spacy, gensim, re
bible = pd.read_csv("https://github.com/scrollmapper/bible_databases/raw/master/csv/t_web.csv")
bible.sample(3)

Unnamed: 0,id,b,c,v,t
22517,31001007,31,1,7,All the men of your alliance have brought you ...
3979,4009014,4,9,14,"If a foreigner lives among you, and desires to..."
5032,5004028,5,4,28,"There you shall serve gods, the work of men's ..."


The verse texts are in the "t" column

In [3]:
# Using spacy NLP to normalize/lemmatize/tokenize the verses.
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def cleaning(doc):
    txt = [token.lemma_ for token in doc if not token.is_stop]
    if len(txt) > 2:
        return txt
# Remove punctuation, make everything lowercase
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in bible.t)
corp = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1) if doc != None]
corp = [n for n in corp if n != None]
print(corp[:2])

[['beginning', 'god', 'god', 'hebrew', 'letter', 'aleph', 'tav', 'letter', 'hebrew', 'alphabet', 'grammatical', 'marker', 'create', 'heavens', 'earth'], ['earth', 'formless', 'darkness', 'surface', 'deep', 'god', 'spirit', 'hover', 'surface', 'water']]


In [4]:
import multiprocessing
from gensim.models import Word2Vec
w2v_model = Word2Vec(min_count=5,
                     window=10,
                     size=200,
                     sample=0, 
                     alpha=0.025, min_alpha=0.001,
                     workers=1)

w2v_model.build_vocab(corp, progress_per=10000, update=False)
w2v_model.train(corp, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)
w2v_model.init_sims(replace=True)

# Results

Let's examine the results. We'll look at the similarity between words.

 First, find the terms most similar to "Jesus" and "Christ." The pairings are very interesting. "Ask" with "answer." "Listen" with "testify." 

In [5]:
w2v_model.wv.most_similar(positive=["jesus", "christ"])

[('ask', 0.882995069026947),
 ('answer', 0.866520345211029),
 ('listen', 0.8625555038452148),
 ('testify', 0.8625285029411316),
 ('gospel', 0.8615338802337646),
 ('believe', 0.8590909242630005),
 ('inquire', 0.8554442524909973),
 ('fellowship', 0.8421381711959839),
 ('think', 0.8364361524581909),
 ('confession', 0.8305535316467285)]

### Similar with Joy
What's semantically similar with joy? What words share a similar context? **Salvation, justice, hope**. Good things. Joyful things.

In [6]:
w2v_model.wv.most_similar(positive=["rejoice", "joy"])

[('salvation', 0.940564751625061),
 ('justice', 0.9337260127067566),
 ('glad', 0.9284518957138062),
 ('hope', 0.923967719078064),
 ('perfect', 0.9228248596191406),
 ('world', 0.9227786064147949),
 ('selah', 0.9190787672996521),
 ('mercy', 0.9159674048423767),
 ('righteousness', 0.9137004017829895),
 ('repent', 0.9014021754264832)]

### Doesn't match
Which of these words is not like the others? ***Anger*** is certainly not one of the fruits of the spirit.

In [8]:
w2v_model.wv.doesnt_match(['love', 'anger', 'joy', 'patience', 'peace'])

'anger'

### Analogy

Using postive and negative examples to find analogy. A is to B as C is to D.

**Covenant** is to **promise** as **law** is to? ... **commandment** or *statue* or *ordinance*.
 
 I really like these results. They seem to capture the essence of the concept of "law."

In [9]:
w2v_model.wv.most_similar(positive=["covenant", "law"], negative=["promise"], topn=5)

[('commandment', 0.8323131203651428),
 ('statute', 0.8309653401374817),
 ('ordinance', 0.829348087310791),
 ('testimony', 0.7592933177947998),
 ('accord', 0.7574331760406494)]

**Christ** is to **Jesus** as **Apostle** is to **Peter.**

In [20]:
w2v_model.wv.most_similar(positive=["christ", "peter"], negative=["jesus"], topn=5)

[('apostle', 0.9046498537063599),
 ('inquire', 0.9010353088378906),
 ('faithful', 0.894794225692749),
 ('tr', 0.8869181871414185),
 ('beg', 0.8834301829338074)]

### Apostles

In [11]:
w2v_model.wv.most_similar(positive=["peter", "simon", "andrew"])[:5]

[('john', 0.9560213088989258),
 ('james', 0.9497538805007935),
 ('zebedee', 0.9342719316482544),
 ('iscariot', 0.9204025268554688),
 ('judas', 0.9202483892440796)]