# Word2Vec on Bible Data

Playing around with word embeddings using the Christian Bible as a data source.

1. Read bible data into table. 1 verse per row.
2. Normalize verses into tokens
3. Train w2v model on the tokens

In [112]:
import pandas as pd, spacy, gensim, re
bible = pd.read_csv("https://github.com/scrollmapper/bible_databases/raw/master/csv/t_web.csv")
bible.sample(3)

Unnamed: 0,id,b,c,v,t
8593,10021013,10,21,13,and he brought up from there the bones of Saul...
15200,19080002,19,80,2,"Before Ephraim and Benjamin and Manasseh, stir..."
15807,19112004,19,112,4,"Light dawns in the darkness for the upright, G..."


The verse texts are in the "t" column

In [51]:
# Using spacy NLP to normalize/lemmatize/tokenize the verses.
nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])

def cleaning(doc):
    txt = [token.lemma_ for token in doc if not token.is_stop]
    if len(txt) > 2:
        return txt
# Remove punctuation, make everything lowercase
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in bible.t)
corp = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1) if doc != None]
corp = [n for n in corp if n != None]
print(corp[:2])

[['beginning', 'god', 'god', 'hebrew', 'letter', 'aleph', 'tav', 'letter', 'hebrew', 'alphabet', 'grammatical', 'marker', 'create', 'heavens', 'earth'], ['earth', 'formless', 'darkness', 'surface', 'deep', 'god', 'spirit', 'hover', 'surface', 'water']]


In [52]:
import multiprocessing
from gensim.models import Word2Vec
w2v_model = Word2Vec(min_count=5,
                     window=10,
                     size=200,
                     sample=0, 
                     alpha=0.025, min_alpha=0.001,
                     workers=1)

w2v_model.build_vocab(corp, progress_per=10000, update=False)
w2v_model.train(corp, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)
w2v_model.init_sims(replace=True)

# Results

Let's examine the results. We'll look at the similarity between words.

 First, find the terms most similar to "Jesus" and "Christ." The pairings are very interesting. "Ask" with "answer." "Listen" with "testify." 

In [107]:
w2v_model.wv.most_similar(positive=["jesus", "christ"])

[('ask', 0.8884044885635376),
 ('answer', 0.87091463804245),
 ('listen', 0.8665204048156738),
 ('testify', 0.8645660877227783),
 ('believe', 0.8633058071136475),
 ('gospel', 0.8624488115310669),
 ('inquire', 0.8571044206619263),
 ('fellowship', 0.842028021812439),
 ('think', 0.8410632610321045),
 ('understand', 0.8329905271530151)]

### Similar with Joy
What's semantically similar with joy? **Salvation, justice, hope**. Good things. Joyful things.

In [118]:
w2v_model.wv.most_similar(positive=["rejoice", "joy"])

[('salvation', 0.9415661096572876),
 ('justice', 0.934103786945343),
 ('glad', 0.9299027919769287),
 ('hope', 0.9257656335830688),
 ('world', 0.9226891994476318),
 ('selah', 0.9220409393310547),
 ('perfect', 0.9212071895599365),
 ('mercy', 0.9183230996131897),
 ('righteousness', 0.9138761758804321),
 ('truth', 0.899572491645813)]

### Doesn't match
Which of these words is not like the others? ***Anger*** is certainly not one of the fruits of the spirit.

In [116]:
w2v_model.wv.doesnt_match(['love', 'anger', 'joy', 'patience', 'peace'])

'anger'

### Analogy

**Covenant** is to **promise** as **law** is to... **commandment** or *statue* or *ordinance*.
 
 I really like these results. They seem to capture the essence of the concept of "law."

In [135]:
w2v_model.wv.most_similar(positive=["covenant", "law"], negative=["promise"], topn=5)

[('commandment', 0.8345969915390015),
 ('statute', 0.8313448429107666),
 ('ordinance', 0.83086758852005),
 ('accord', 0.7626932263374329),
 ('keep', 0.7570266723632812)]

### Apostles

In [134]:
w2v_model.wv.most_similar(positive=["peter", "simon", "andrew"])[:5]

[('john', 0.9550307989120483),
 ('james', 0.9499552249908447),
 ('zebedee', 0.9353220462799072),
 ('judas', 0.9187829494476318),
 ('iscariot', 0.9185448288917542)]