# 0. Introduction

*Christiaan M. Erwich*

This notebook works with the matrices produced by the 'levenshtein_pgn_shifts_part1' notebook (which is still under construction for cleaner code, publication on Github will follow). The basic idea is to find parallel verses in the Hebrew Bible for all verses in the poetry of the Psalms on the basis of participant information, and calculate their similarity ratio (i.e. an evaluation of sameness) with the [Longest Common Subsequence](https://en.wikipedia.org/wiki/Longest_common_subsequence_problem) method. For the calculation of the similarity ratio the [Levenshtein python module](http://www.coli.uni-saarland.de/courses/LT1/2011/slides/Python-Levenshtein.html) was used. 

The matrices (pickle files) produced by part1 can be found in this folder. The three notebooks are part of my research project 'Who is Who in the Psalms: A Computational Analysis of Participants and Their Networks.', funded by NWO. Part of the project is to find patterns of participant shifts (i.e. shifts in person, number, gender) in the Psalms, to enable identification of those participants. The PNG-patterns generated with this notebook are an experimental starting point for the identification of participants in the Psalms. 

The notebooks were used for a presentation that I gave at the "Plotting Poetry: On Mechanically Enhanced Reading" conference in Basel on October 6th 2017. The presentation can be downloaded from this folder as well. 

Most of the code in this notebook is borrowed from two notebooks written by Dirk Roorda:


* [Kings and parallels](https://github.com/ETCBC/parallels/blob/master/programs/kings_ii.ipynb/)
* [Parallels](https://github.com/ETCBC/parallels/blob/master/programs/parallels.ipynb/)

# 1. The program
The program starts in the next cell, with the loading of several modules that are important for the analysis of the Psalms and their participant patterns. 

In [3]:
import sys, os, re, pickle
import collections, difflib
from Levenshtein import ratio

# (sudo -H) pip(3) install matplotlib

from IPython.display import HTML, display_pretty, display_html
import networkx as nx
import numpy as np
from pandas import DataFrame, read_csv
import pandas as pd
%matplotlib inline
from random import random

from pprint import pprint
from tf.fabric import Fabric
from tf.transcription import Transcription

## 1.1 Data source
The ETCBC database is used in version 4c. Downloadable from the GitHub repo [text-fabric-data](https://github.com/ETCBC/bhsa). The format of the data obtained through Github is immediately ready to be used by Text-Fabric, and hence by this notebook as well.

In [11]:
source = 'etcbc'
version = '4c'
ETCBC = 'hebrew/{}{}'.format(source, version)

DATABASE = '~/github/etcbc'
BHSA = 'bhsa/tf/2016'
TF = Fabric(locations=[DATABASE], modules=[BHSA], silent=False )

This is Text-Fabric 3.0.3
Api reference : https://github.com/Dans-labs/text-fabric/wiki/Api
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

107 features found and 0 ignored


## 1.2 Data features
Some features from the ETCBC database are used. You see them in the code below. Their documentation can be found through the SHEBANQ help function or via this direct link: [insert link]

In [8]:
api = TF.load('''
    otype
    lex lex_utf8 g_word_utf8 trailer_utf8
    book chapter verse label number
    nu ps gn vs vt prs ls lex g_cons
    function txt domain rela code gloss
    sp kind typ pdp language
''')

api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.00s Feature overview: 102 for nodes; 4 for edges; 1 configs; 7 computed
  0.07s All features loaded/computed - for details use loadLog()


## 1.3 Configuration information
Here the constants that refer to the files we read and write are found. These are the results of part1. The notebook has to be told where to find its results.
CHUNK_GREP, MATRIX_GREP and SIM_THRESHOLD_GREP define the similarity method by which we grab the verses that are similar to verses in the Psalms
Throughout this notebook English names for the books of the Bible are used.

In [12]:
# the language of the book names
LANG = 'en'

# the book and chapters that are central to our study
REFBOOKS = {'Psalms'} # Psalms
REFCHAPTERS = set(range(89,90)) # Psalms 89


TF_OUTPUT = os.path.expanduser('~/text-fabric-output')
# the results of the parallel notebook.
CROSSREF_APP = 'parallels'
# directory of computed intermediary results of parallel.
PRECOMP_DIR = '{}/{}{}/{}/{}'.format(TF_OUTPUT, source, version, CROSSREF_APP, 'stored')
# precomputed list of verse chunks
CHUNK_GREP = '{}/chunks/chunk_{}_{}'.format(PRECOMP_DIR, 'O', 'verse') 
# precomputed matrix of similarities based on verse chunks and the LCS method
MATRIX_GREP = '{}/matrices/matrix_{}_{}_{}_{}'.format(PRECOMP_DIR, 'O', 'verse', 'LCS', 75)

MATRIX_GREP_PATTERNS_VERSE = '{}/matrices/matrix_{}_{}_{}_{}_withpatterns'.format(PRECOMP_DIR, 'O', 'verse', 'LCS', 75)


# the similarity threshold above which we consider verses similar
SIM_THRESHOLD_GREP = 75

# output files
NCOL_FILE = 'psalms_crossrefs.ncol'           # graph of similar verses to Psalms 
SIMILAR_FILE = 'psalms_similarities.tsv'      # refined similarities based on sentences

## 1.4 Book name index

Book of interest is the Psalms, which is analysed seperately from the other books. Therefore a virtual book 'Psalmsr' is introduced for the reference chapters that we want to study. 
The ETCBC database uses Latin names for the Bible books. For ease of referencee, they are translated to conventional English names.

In [14]:
book_node = dict()
for b in F.otype.s('book'):
    book_name = T.bookName(b, lang=LANG)
    book_node[book_name] = b
    if book_name == 'Psalms': 
        book_node[book_name+'r'] = b

def passage_key(p):
    (bk, ch, vs) = p
    return (-1, ch, vs) if bk in REFBOOKS and ch in REFCHAPTERS else (book_node[bk], ch, vs)

# the format of verse references
PASSAGE_FMT = '{}~{}:{}'
PASSAGER_FMT = '{}r~{}:{}' # used for the pseudo-book Psalmsr

# 2. Parallels within the Masoretic Text
## 2.1 Grep all parallel verses

The aim is to find the verses that are similar to any verse in the Psalms. One of the similarity matrices that has been computed by the parallels notebook is used. To be precise, we took the matrix computed for the LCS method applied to verses. The similarities higher than 75 are then extracted. These specifications are stored in the variables CHUNK_GREP, MATRIX_GREP and  SIM_THRESHOLD_GREP. Every similarity that is found is a pair of verse references, at least one of which is in the Psalms/reference chapter. That reference is rendered as being in the book Psalmsr because in a later visualisation the focus book/chapter is placed in a separate colummn.

If you don't want to visualise the data, skip 2.3-2-4 and run all cells from 3. onwards. If you have already run 2.3-2-4, make sure you save the notebook, resart it and then run the cells from 3. onwards.

## 2.2 Run the cell below for the network visualisation

In [None]:
with open(CHUNK_GREP, 'rb') as f: chunks = pickle.load(f)
with open(MATRIX_GREP, 'rb') as f: grep_dist = pickle.load(f)

def get_verse_ref(chunk):
    sec = T.sectionFromNode(chunks[chunk][0], lang=LANG)
    vn = T.nodeFromSection(sec, lang=LANG)
    return (vn, sec)

all_verse_nodes = set()
n_internal = 0
x = 0
crossrefs = set()
for ((c1, c2), r) in grep_dist.items():
    if r < SIM_THRESHOLD_GREP: continue
    (v1, (bk1, ch1, vs1)) = get_verse_ref(c1)
    (v2, (bk2, ch2, vs2)) = get_verse_ref(c2)
    # remove comments for exluding internal references
    
    #if bk1 in REFBOOKS and ch1 in REFCHAPTERS and bk2 in REFBOOKS and ch2 in REFCHAPTERS:
        #n_internal += 1
        #continue
    if bk1 in REFBOOKS and ch1 in REFCHAPTERS:
        bkx = bk1
        chx = ch1
        vsx = vs1
        bky = bk2
        chy = ch2
        vsy = vs2
    elif bk2 in REFBOOKS and ch2 in REFCHAPTERS:
        bkx = bk2
        chx = ch2
        vsx = vs2
        bky = bk1
        chy = ch1
        vsy = vs1
    else:
        continue
    crossrefs.add(((bkx, chx, vsx), (bky, chy, vsy), r))
    all_verse_nodes |= {v1, v2}

info('{} external crossrefs saved; {} internal crossrefs skipped; from total {} crossrefs'.format(
    len(crossrefs), n_internal, len(grep_dist),
))

print('\n'.join('{}r\t{}\t{}\t{}\t{}\t{}\t{}'.format(*x[0], *x[1], round(x[2])) for x in sorted(crossrefs)[0:20]))

## 2.3 Store similarities as a graph

The similarities found so far are now visualised as a graph, where the verses are nodes and the similarities are edges. To that end the similarities are stored in a format such that graph software can read it.

We write out the graph data as a file in '.ncol' format, and we will use the python package networkx to read and process that file.

We also produce:
* a set of all verses encountered
* a set of all chapters encountered
* a set of all books encountered

In [18]:
info('Exporting graph info, assembling sets')
ncolfile = open(NCOL_FILE, 'w')
for (x, y, r) in sorted(crossrefs, key=lambda z: (
        book_node[z[0][0]], z[0][1], z[0][2], 
        book_node[z[1][0]], z[1][1], z[1][2],
)):
    ncolfile.write('{} {} {}\n'.format(PASSAGER_FMT.format(*x), PASSAGE_FMT.format(*y), round(r)))
ncolfile.close()

all_verses = {(x[0][0]+'r', x[0][1], x[0][2]) for x in crossrefs} | {x[1] for x in crossrefs}
all_chapters = {(x[0], x[1]) for x in all_verses}
all_books = {x[0] for x in all_chapters}
info('{} edges, {} verses, {} chapters, {} books'.format(
    len(crossrefs), len(all_verses), len(all_chapters), len(all_books),
))
print(' '.join(sorted(all_books)))

SyntaxError: invalid syntax (<ipython-input-18-cbeb8eea0af5>, line 16)

## 2.4 Graph visualization

The similarities are visualised in a graph using networkx. The layout is done manually, not following any of the methods provided by networkx.
The verses are put into columns by the book they occur in, and the focus chapters occupy a separate column, thanks to the pseudo book Psalms. Psalmsr stands for the other chapters of Psalms. The rows of verses are ordered textually.
Finally, the rows of verses are shifted up and down in order to align them with their parallel stretches. The thickness of the edges correponds to the degree of similarity, and likewise, the blacker the edge, the more similar the pair of verses.
Then the graph data file is read, the layout settings are adjusted, plotted and saved as pdf.

In [None]:
# read the graph data
g = nx.read_weighted_edgelist(NCOL_FILE)

# order the books for handy layout in columns
all_books_cust = '''
    Job 1_Kings 1_Samuel 2_Chronicles 2_Kings Proverbs Amos Daniel Deuteronomy Ecclesiastes Exodus 
    Ezekiel Hosea Genesis Habakkuk Ezra 1_Chronicles Psalmsr Numbers Joshua Isaiah Joel Jeremiah Judges Lamentations 
    Leviticus Malachi Micah Nahum Nehemiah Zechariah 2_Samuel Song_of_songs Psalms
'''.strip().split()

# Colors used for the graph

#b: blue
#g: green
#r: red
#c: cyan
#m: magenta
#y: yellow
#k: black


gcolors = {'1_Chronicles':'b', '1_Kings':'g', '1_Samuel':'r', '2_Chronicles':'c', '2_Kings':'m', '2_Samuel':'y', 
            'Amos':'k', 'Daniel':'b', 'Deuteronomy':'g','Ecclesiastes':'r', 'Exodus':'c', 'Ezekiel':'m', 
            'Ezra':'y', 'Genesis':'k', 'Habakkuk':'b', 'Hosea':'g', 'Isaiah':'r', 'Psalmsr':'c', 'Psalms':'m', 
            'Jeremiah':'y', 'Job':'k', 'Joel':'b', 'Joshua':'g', 'Judges':'r', 'Lamentations':'c', 'Leviticus':'m', 
            'Malachi':'y', 'Micah':'k', 'Nahum':'b', 'Nehemiah':'g', 'Numbers':'r', 'Proverbs':'c', 'Song_of_songs':'c', 
            'Zechariah':'k'}

#  specify vertical positions of passages
offset_y = {'1_Chronicles': 1, '1_Kings': 30, '1_Samuel': 20, '2_Chronicles': 12, '2_Kings': 3, '2_Samuel': 15, 
            'Amos': 30, 'Daniel': 40, 'Deuteronomy': 20,'Ecclesiastes': 3, 'Exodus': 50, 'Ezekiel': 140, 
            'Ezra': 1, 'Genesis': 140, 'Habakkuk': 17, 'Hosea': 10, 'Isaiah': 120, 'Psalmsr': 60, 'Psalms': 1, 
            'Jeremiah': 5, 'Job': 40, 'Joel': 25, 'Joshua': 1, 'Judges': 18, 'Lamentations': 120, 'Leviticus': 5, 
            'Malachi': 145, 'Micah': 78, 'Nahum': 5, 'Nehemiah': 6, 'Numbers': 1, 'Proverbs': 110, 'Song_of_songs': 1, 
            'Zechariah': 150}

# compute positions of verses
ncolors = [gcolors[x.split('~')[0]] for x in g.nodes()]
nlabels = dict((x, x.split('~')[1]) for x in g.nodes())
ncols = len(all_books)
pos_x = dict((x, i) for (i,x) in enumerate(all_books_cust))
verse_lists = collections.defaultdict(lambda: [])
for (bk, ch, vs) in sorted(all_verses):
    verse_lists[bk].append('{}:{}'.format(ch, vs))
nrows = max(len(verse_lists[bk]) for bk in all_books_cust)
pos = {}
for bk in verse_lists:
    for (i, chvs) in enumerate(verse_lists[bk]):
        pos['{}~{}'.format(bk, chvs)] = (pos_x[bk], i+offset_y[bk])

# start plotting
plt.figure(figsize=(50,30))

nx.draw_networkx(g, pos,
    width=[g.get_edge_data(*x)['weight']/40 for x in g.edges()],
    edge_color=[g.get_edge_data(*x)['weight'] for x in g.edges()],
    edge_cmap=plt.cm.Greys,
    edge_vmin=50,
    edge_vmax=100,
    node_color=ncolors,
    node_size=200,
    labels=nlabels,
    alpha=0.4,
    linewidths=0,
)
plt.ylim(-2, 160) #-2, 310
book_font_size = 12 # 25
plt.grid(b=True, which='both', axis='x')
plt.title('Parallel Patterns of Participant Shifts in Psalm 89', fontsize=40)
plt.text(-1,120, '''
Parallel patterns of participant shifts 
(i.e. shifts in person, number and gender) 
in all verses of Ps 89 compared 
to other books in the Hebrew Bible.

Parallel verses shown in this graph have 
a similarity of 75% or higher. 

''', #bbox=dict(width=145, height=200, facecolor='yellow', alpha=0.4), fontsize=12)
     # suddenly the width and height keyword args are no longer accepted.
     # bbox performs an auto fit # plt.text(-1,26, 
     bbox=dict(facecolor='yellow', alpha=0.4), fontsize=15)

# add additional book labels
for (ypos, books) in (
    (-1, all_books_cust),
    (50, ['1_Chronicles']),
    (25, ['1_Kings']),
    (18, ['1_Samuel']),
    (9, ['2_Chronicles']),
    (8, ['2_Kings']),
    (12, ['2_Samuel']),
    (25, ['Amos']),
    (37, ['Daniel']),
    (17, ['Deuteronomy']),
    (10, ['Ecclesiastes']),
    (45, ['Exodus']),
    (137, ['Ezekiel']),
    (30, ['Ezra']),
    (137, ['Genesis']),
    (15, ['Habakkuk']),
    (18, ['Hosea']),
    (135, ['Isaiah']),
    (106, ['Psalmsr']),
    (158, ['Psalms']),
    (10, ['Jeremiah']),
    (103, ['Job']),
    (27, ['Joel']),
    (38, ['Joshua']),
    (15, ['Judges']),
    (130, ['Lamentations']),
    (10, ['Leviticus']),
    (141, ['Malachi']),
    (81, ['Micah']),
    (8, ['Nahum']),
    (23, ['Nehemiah']),
    (52, ['Numbers']),
    (105, ['Proverbs']),
    (4, ['Song_of_songs']),
    (145, ['Zechariah']),
    (158, all_books_cust),
):
    for bk in books:
        plt.text(pos_x[bk], ypos, bk, fontsize=book_font_size, horizontalalignment='center')

# save the plot as .pdf
plt.savefig('psalms_parallels-ps89.pdf')

# 3. Create a dataset with extra information

Now we have had a quick glance at the parallel verses with the help of the network visualisation, we create a dataset in which we add some extra information. 

That extra information consists 1. Of a categorisation of the Hebrew Bible books in three main genres: prose, prophecy and poetry. Since we want to know what the genre is of the parallels that are found within the book or chapter we want to study. 2. The network shows that Psalm 89 has a great amount of parallels within the Psalms. Therefore, as a second piece of information, the Psalm collections to which the individual Psalms belong are added to the dataset. The added information of genre and collection enables future analyses of the genre of the Psalm that is studied. 

If you already ran cells 2.3-2-4, make sure you save the notebook, resart it and then run the cells from 3. onwards.

## 3.1 Genre dictionary for all books in the Hebrew Bible

In [1]:
prose = ['Genesis', 'Exodus', 'Leviticus', 'Numbers', 'Deuteronomy', 'Joshua', 'Judges', '1_Samuel', '2_Samuel', 
         '1_Kings', '2_Kings', 'Jonah', 'Ruth', 'Esther', 'Daniel', 'Ezra', 'Nehemiah', '1_Chronicles', '2_Chronicles']
prophecy = ['Isaiah', 'Jeremiah', 'Ezekiel', 'Hosea', 'Joel', 'Obadiah', 'Micah', 'Zephaniah', 'Haggai', 'Zechariah', 
            'Malachi', 'Amos', 'Nahum', 'Habakkuk']
poetry = ['Song_of_songs','Proverbs','Ecclesiastes', 'Lamentations', 'Psalms', 'Job']
genre_dict = {}

for genre in [prose, prophecy, poetry]:
    for book in genre:
        if book in prose:
            genre_dict[book] = 'prose'
        elif book in prophecy:
            genre_dict[book] = 'prophecy'
        elif book in poetry:
            genre_dict[book] = 'poetry'

## 3.2 Organize all 150 Psalms in collections in a dictionary

In [2]:
introduction = [1,2]

davidic = [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
           29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 138, 139, 140, 141, 142, 143, 144, 145]

elohistic_korahite = [42, 43, 44, 45, 46, 47, 48, 49]

elohistic_davidic = [51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72]

elohistic_asaphite = [50, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83]

korahite = [84, 85, 87, 88]

song_of_ascents = [120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134]

doxology = [146, 147, 148, 149, 150]

unclassified = [86, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109,
                110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 135, 136, 137]


collection_dict = {}

for collection in [introduction, davidic, elohistic_korahite, elohistic_davidic, elohistic_asaphite, korahite, 
                   song_of_ascents, doxology, unclassified]:
    for psalm_number in collection:
        if psalm_number in introduction:
            collection_dict[psalm_number] = 'introduction'
        elif psalm_number in davidic:
            collection_dict[psalm_number] = 'davidic'
        elif psalm_number in elohistic_korahite:
            collection_dict[psalm_number] = 'elohistic_korahite'
        elif psalm_number in elohistic_davidic:
            collection_dict[psalm_number] = 'elohistic_davidic'
        elif psalm_number in elohistic_asaphite:
            collection_dict[psalm_number] = 'elohistic_asaphite'
        elif psalm_number in korahite:
            collection_dict[psalm_number] = 'korahite'
        elif psalm_number in song_of_ascents:
            collection_dict[psalm_number] = 'song_of_ascents'
        elif psalm_number in doxology:
            collection_dict[psalm_number] = 'doxology'
        elif psalm_number in unclassified:
            collection_dict[psalm_number] = 'unclassified'

## 3.3 Create dataset from matrix 

In [None]:
with open(CHUNK_GREP, 'rb') as f: chunks = pickle.load(f)
with open(MATRIX_GREP_PATTERNS_VERSE, 'rb') as f: grep_dist_patterns = pickle.load(f)

def get_verse_ref(chunk):
    sec = T.sectionFromNode(chunks[chunk][0], lang=LANG)
    vn = T.nodeFromSection(sec, lang=LANG)
    return(vn, sec)

all_verse_nodes = set()
crossrefs_patterns_test = set()
all_verse_nodes_2 = []

n_internal = 0
x = 0
crossrefs_patterns = set()
for (c1, c2, r, c4, c5) in grep_dist_patterns:
    (v1, (bk1, ch1, vs1)) = get_verse_ref(c1)
    (v2, (bk2, ch2, vs2)) = get_verse_ref(c2)
    if r < SIM_THRESHOLD_GREP: continue
    if bk1 in REFBOOKS and ch1 in REFCHAPTERS:
        bkx = bk1
        chx = ch1
        vsx = vs1
        bky = bk2
        chy = ch2
        vsy = vs2
        pattern1 = c4
        pattern2 = c5
    elif bk2 in REFBOOKS and ch2 in REFCHAPTERS:
        bkx = bk2
        chx = ch2
        vsx = vs2
        bky = bk1
        chy = ch1
        vsy = vs1
        pattern1 = c5
        pattern2 = c4
    else:
        continue
    book_genre = genre_dict[bky]
    psalm_collection = ''
    if bky == 'Psalms' and chy in collection_dict:
        psalm_collection = collection_dict[chy]
    else:
        psalm_collection = 'other'
    crossrefs_patterns.add((bkx, str(chx), str(vsx), bky, str(chy), str(vsy), book_genre, psalm_collection, str(round(r)), pattern1, pattern2))
    crossrefs_patterns_test.add(((bkx, chx, vsx), (bky, chy, vsy), book_genre, str(round(r)), c4, c5))
    all_verse_nodes |= {v1, v2}
    all_verse_nodes_2.append((v1, v2))

info('{} external crossrefs saved; from total {} crossrefs'.format(
    len(crossrefs_patterns), len(grep_dist_patterns),
))

# quick glance at the data
for i in sorted(crossrefs_patterns)[:20]:
    pprint(i)

## 3.4 Write CSV to disk

The CSV file is analysed with Pandas in the levenshtein_png_shifts_visualisation_part3 notebook. 

In [None]:
PATTERNS_OUTPUT = "participant-patterns-ratio-s75-verses-psalm-89-with-collection.csv"

In [None]:
info('Writing csv file')

with open(PATTERNS_OUTPUT, 'w') as f:
    header = ['book', 'chapter', 'verse', 'parallel_book', 'parallel_chapter', 'parallel_verse', 'book_genre_parallel', 
              'psalm_collection', 'ratio_pattern', 'pattern_psalm', 'pattern_parallel']
    f.write('{}\n'.format(','.join(header)))

    for x in sorted(crossrefs_patterns): 
        f.write('{}\n'.format(','.join((x[0], x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10]))))
        
info('Done writing')