<a href="https://colab.research.google.com/github/chi-yan/notebooks/blob/master/NLP_(Bible_Text_Encoding_with_the_Universal_Text_Encoder).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Google Colab notebook uses Natural Language Processing (NLP) techniques 
to detect Bible verses that are most similar to a random block of text. Word numerical (high-ordered vectors) obtained via the "Universal Text Encoder" is used to numerically model these verses.`

Cosine similarity is to quantify similarity (in meaning) between sentences. A similiarity close to 0 indicates that two sentences are similar in meaning.

References:

http://www.ijstr.org/final-print/dec2019/Lexical-And-Semantic-Analysis-Of-Sacred-Texts-Using-Machine-Learning-And-Natural-Language-Processing.pdf (see Section 3.4)

https://tfhub.dev/google/universal-sentence-encoder/1

In [2]:
import numpy as np # linear algebra
import pandas as pd #
import tensorflow_hub as hub
import csv
import pprint
from tabulate import tabulate
from collections import OrderedDict
from operator import itemgetter
text = 'The world is created ' #@param {type:"string", run:"auto"}
 
def cosine_similarity(A, B):
    '''
    Input:
        A: a numpy array which corresponds to a word vector
        B: A numpy array which corresponds to a word vector
    Output:
        cos: numerical number representing the cosine similarity between A and B.
    '''
    # you have to set this variable to the true label.
    cos = -10
    dot = np.dot(A, B)
    norma = np.linalg.norm(A)
    normb = np.linalg.norm(B)
    cos = dot / (norma * normb)
 
    return cos
 
if 'init' not in locals():
  init = True
  !curl https://raw.githubusercontent.com/EswarGitHub/BibleSearch/master/bible_data_set.csv -o bible_data_set.csv
  reader = csv.reader(open('bible_data_set.csv', 'r'))
  d = {}
  for row in reader:
    k,_,_,_,v = row
    d[k] = v[:-2] #    
  embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
  embeddings = embed(list(d.values()))
  versedict = {}
  for i, verse in enumerate(d):
    versedict[i] = verse
 
 
embedded_text = embed([text])
similarities = cosine_similarity(embeddings.numpy(), embedded_text.numpy().T).flatten()
idx = list(np.argpartition(similarities, -5)[-5:])
verses = [[versedict[i],list(d.values())[i][0:100], similarities[i]] for i in idx]
sorted_verses = sorted(verses, key=itemgetter(2))
print(tabulate(sorted_verses, headers=["Verse","Text", "Cosine similarity"]))

Verse          Text                                                                                                    Cosine similarity
-------------  ----------------------------------------------------------------------------------------------------  -------------------
Genesis 2:4    These are the generations of the heavens and of the earth when they were created, in the day that th           0.00203045
Genesis 1:1    In the beginning God created the heaven and the earth.                                                         0.002063
Proverbs 8:23  I was set up from everlasting, from the beginning, or ever the earth was.                                      0.00208032
John 1:10      He was in the world, and the world was made by him, and the world knew him not.                                0.00218441
John 9:5       As long as I am in the world, I am the light of the world.                                                     0.00220383
