# Synthesizing Hebrew Dictionaries

The following notebook iterates over a series of dicitionaries found through open source repositories to create a singular, cohesize reference for words found in the Torah including the following information:

* A variety of word definitions from different souces such as:
    1. Strong's Concordance of the KJV
        1. Unformatted strong's json -> https://raw.githubusercontent.com/openscriptures/strongs/master/hebrew/strongs-hebrew-dictionary.js
        2. Strong's Concordance XML with extra definitions -> https://raw.githubusercontent.com/openscriptures/strongs/master/hebrew/StrongHebrewG.xml
    2. 

In [1]:
#importing the important libraries useful for the following work that will be done here.
import json
import unicodedata
import re
import pprint
from statistics import median
import xml.etree.ElementTree as ET
from bidi.algorithm import get_display
import sys
from dynamodb_json import json_util as dynamo
from boto3.dynamodb.types import TypeSerializer

In [None]:
fileIn = open("definitions/unformatted_strongs.json", "r", encoding="utf-8")
fileOut = open("definitions/added_non_nikud_defs.json", "w", encoding="utf-8")
for line in fileIn:
    hebID = line[:line.index(':')]
    nikud_word = line[line.index('"lemma"') + 9 : line.index('","xlit"')]
    non_nikud_word = "".join([c for c in unicodedata.normalize('NFKD', nikud_word) if not unicodedata.combining(c)])
    new_def = f'{{ "hebID" : {hebID}, "no_nikud" : "{non_nikud_word}", '
    new_def += line[line.index('"lemma"'):len(line) - 2] + "\n"
    fileOut.write(new_def)
fileIn.close()
fileOut.close()

We want to include the morphology of the word in the definition of the word (i.e. the structure of the word [gender, plurality, etc.]). Unfortunately, that is in a different file, so we will have to splice it into the main definition.

In [None]:
fileIn = open('definitions/xml_defs.xml', 'r', encoding="utf-8")
morph_list = []
for line in fileIn:
    if('morph=' in line):
        morph = line[line.index('morph=') + 6 : line.index(' POS=')]
        morph_list.append(morph)
fileIn.close()

While the default definitions in strongs reference are good, they can be verbose and obscure. There are better definitions in the file "xml_defs.xml", so we will take those and add them to our default definitions. We will start be adding all the definitions to a list to be used later.

In [None]:
all_defs = []
fileIn = open("definitions/xml_defs.xml", 'r', encoding="utf-8")
line = fileIn.readline()
index = 1
while line:
    #If we come across a list of definitions
    if '<list>' in line:
        defs = "["
        line = fileIn.readline()
        #While there are still more definitions to be read
        while '</list>' not in line:
            line = line.strip()
            startTagIndex = line.index('<item>')
            endTagIndex = line.index('</item>')
            #Escape the quotation marks within the line.
            line = line[startTagIndex + 6 : endTagIndex].replace('"', '\\"')
            defs += f'"{line}", '
            line = fileIn.readline()
        defs = defs[:len(defs) - 2] + "]"
        all_defs.append(defs)
        defs = []
        index += 1
    line = fileIn.readline()
fileIn.close()

We will now take the definitions we have acquired along with the morphology of the word and insert it into the definition along with all other previous definition information.|

In [None]:
#Used to insert the morphology into the definitions line.
def insertMorph(main_line, morph):
    xlit_index = main_line.index('"xlit":')
    return main_line[:xlit_index] + f' "morph":{morph}, ' + main_line[xlit_index:]

#Used to insert the additional defs into the definitions line.
def insertDefs(main_line, defs):
    defs = f', "all_defs" : {defs}}}\n'
    return main_line[:len(main_line) - 2] + defs

fileIn = open("definitions/added_non_nikud_defs.json", "r", encoding = "utf-8")
fileOut = open("definitions/added_more_defs.json", "w", encoding = "utf-8")
for i in range(8674):
    main_line = fileIn.readline()
    main_line = insertMorph(main_line,morph_list[i])
    main_line = insertDefs(main_line, all_defs[i])
    main_line = main_line.replace("nikud", "niqqud", 1)
    main_line = main_line.replace("lemma", "niqqud", 1)
    fileOut.write(main_line)
fileIn.close()
fileOut.close()

Though quite inefficient now that I look back on it, we will take all the new definitions that we have put into the file added_more_defs and now add them them to a dictionary that we will use to process the definitions into a more manageable format given that there are many sub-definitions that are labeled using an inefficient system of numbers and letters as opposed to one of depths (i.e. top definition then sub-definition then another sub-definition and so on).

In [None]:
dictionary_defs = []
fileIn = open("definitions/added_more_defs.json", 'r', encoding = 'utf-8')
for word in fileIn:
    all_defs_index = word.index('"all_defs"')
    dictionary_defs.append(json.loads("{" + word[all_defs_index:]))
fileIn.close()

The following then takes in all the hebrew words and dictionary definitions and imports them into a dictionary that will process them.

The following code does several things in order to clean up all the extra definitions that we will be adding to our words.
The first thing that the code will do is that for each of the word's definitions, the definitions will remove the obsolete and single depth ordering method in favor of one that utilizes JSON objects and lists to differentiate the definitions and the sub definitions.

The first part is not entirely perfect, and leaves many empty definitions in addition to the proper ones, we will utilize a flattening algorithm that will take the definitions and remove all the empty definitions and leave only the original definitions. We are utilizing this method instead of trying to sort out the problem with the main organization algorithm because we will only run this once and the organization algorithm is not corrupting the original definitions which is the most important part.

The third part will remove the inefficient numbering still remaining and the fourth part will convert the format of definitions being contained in lists into dictionaries in order to have them be easily stored as json objects.

In [None]:
#Organize the defs of each word according to their depth
#relative to a parent/ancestor.
def organize_defs(defs):
    results = []
    #if a parentheses could not be found, make the
    #base depth 1 (i.e. Abi = "my father" (H21))
    try:
        cur_depth = len(defs[0][:defs[0].index(")")])
    except:
        cur_depth = 1
    index = 0
    #for each definition in the list
    while(index != len(defs)):
        #if a parenthesis could not be found in the def
        #simply append it to the results list and continue.
        try:
            def_depth = len(defs[index][:defs[index].index(")")])
        except:
            results.append(defs[index])
            index += 1
            continue
        #Prevents words with non-essential parentheses from having a
        #greater depth than they should. 5 was chosen because this type
        #of thing occured more when there was 5 or more depths to a word's defintions.
        if(def_depth > 5 and '=' in defs[index]):
            results.append(defs[index])
            index += 1
            continue
        #if on the same level, simply append.
        if(def_depth == cur_depth):
            results.append(defs[index])
            index += 1
        elif(def_depth > cur_depth):
            #if the depth has increased, call the function again on
            #every def after that one, inclusive. Append the resulting list.
            sub_list = organize_defs(defs[index:])
            results.append(sub_list[1])
            #if the functions have reached the end of all defs.
            if(sub_list[0] == -1):
                return (-1, results)
            #move up to where the recursive call left off.
            index += sub_list[0]
        else:
            #we return the index of where the recursive call ended
            #and the sublist to be appended.
            return (index, results)
    #return this if reached the end of the defs.
    return (-1, results)

#flattens the defs list to not include numbers
def flatten(defs):
    if(type(defs[0]) is str):
        return defs
    if(type(defs[0]) is int):
        return flatten(defs[1])
    return defs

#Removes the numbering from the definitions
def removeNumbering(defs):
    index = 0
    while(index != len(defs)):
        if(type(defs[index]) == list):
            defs[index] = removeNumbering(defs[index])
        else:
            try:
                p_index = defs[index].index(")")
            except:
                index += 1
                continue
            defs[index] = defs[index][p_index + 1:].strip()
        index += 1
    return defs

#Transform the the nesting of the definitions from lists into dictionaries.
def changeNestedTypes(defs):
    results = []
    index = 0
    for definition in defs:
        if(type(definition) is str):
            results.append({"definition" : definition})
        elif(type(definition) is list):
            sub_results = changeNestedTypes(definition)
            results.append({"senses" : sub_results})
        index += 1
    return results

#Organize all the definitions, flatten and remove
#numbering from the results
for i, word in enumerate(dictionary_defs):
    organized_defs = organize_defs(word['all_defs'])
    flattened_defs = flatten(organized_defs)
    unnumbered_defs = removeNumbering(flattened_defs)
    word['all_defs'] = changeNestedTypes(unnumbered_defs)

This last part will now replace the old definitions format with the new and improved one.

In [None]:
fileIn = open("definitions/added_more_defs.json", 'r', encoding = 'utf-8')
fileOut = open("definitions/final_defs.json", "w", encoding = 'utf-8')
for index, word in enumerate(fileIn):
    all_defs_index = word.index('"all_defs" : ')
    word_defs = json.dumps(dictionary_defs[index], ensure_ascii=False)
    fileOut.write(word[:all_defs_index] + word_defs[1:len(word_defs) - 1] + "}\n")
fileIn.close()
fileOut.close()

1/7/2021: A problem with the current definitions is that when they are in the db and a call is made to retrieve the definitions matching the word "mother", there is no ranking system to give more weight to certain words. So definitions that are a match, but whose frequency in the bible is limited can occur way before the most frequent match.

We will use a new file containing the words along with the number of references the word has to calculate the frequency of the word and improve the results returned by calls to the hebrew db.

The first task will be to organize the words since they are in hebraic-alphabetic order as opposed to being order by strong's concordance ids, which is useful for comparing our previous output in the last cell to the one here to ensure that the data isn't wonky.

In [None]:
fileIn = open('definitions/words_refs.xml', 'r', encoding='utf-8')
line = fileIn.readline()
sorted_words = []
while line:
    if "<w" in line:
        #Corresponds to strongs concordance number.
        key_value = line[line.index('a="') + 3:line.rindex("\"")]
        key_value = int(re.sub("[^0-9]", "", key_value))
        complete_line = ""
        while "</w>" not in line:
            complete_line += line
            line = fileIn.readline()
        sorted_words.append((key_value, complete_line))
    line = fileIn.readline()
sorted_words = sorted(sorted_words)
fileIn.close()

A problem with the words that we have organized is that the same word may be registered as many different words with many references, when we just need them organized by their primary form like is done in strong's concordance and thus in the previous output's we have done. We will thus consolidate the words into their main forms to make it easy to transfer the information into the document we have previously completed.

In [None]:
aggregated_words = {}
for word in sorted_words:
    if word[0] in aggregated_words:
        cur_refs = aggregated_words[word[0]]
        cur_refs += word[1]
        aggregated_words[word[0]] = cur_refs
    else:
        aggregated_words[word[0]] = word[1]
#Inefficiency exists here in that we repeat a step two times. Probably can be improved, but computes fast, so will stay.
for word in sorted(aggregated_words.items()):
    aggregated_words[word[0]] = word[1].count('<r')
aggregated_words = sorted(aggregated_words.items())

We will now take the words and insert the frequency into the main document.

In [None]:
fileIn = open('definitions/final_defs.json', 'r', encoding='utf-8')
fileOut = open('definitions/final_defs_v2.json', 'w', encoding='utf-8')
#The median frequency of all the words
median_frequency = int(median([num[1] for num in aggregated_words]))
for index, line in enumerate(fileIn):
    #Some words have no frequency apparently. If that happens, use the median frequency instead to give them some weight.
    try:
        line = line[:len(line) - 2] + f', "frequency" : {aggregated_words[index + 1][1]}}}\n'
    except IndexError:
        line = line[:len(line) - 2] + f', "frequency" : {median_frequency} }}\n'
    fileOut.write(line)
fileIn.close()
fileOut.close()

This is going to find out how many references each word has in the document called "sorted_words_refs.xml"

In [16]:
incomplete_defs = {}
book_mapping = json.loads('{"Gen":1,"Exod":2,"Lev":3,"Num":4,"Deut":5,"Josh":6,"Judg":7,"1Sam":8,"2Sam":9,"1Kgs":10,"2Kgs":11,"Isa":12,"Jer":13,"Ezek":14,"Hos":15,"Joel":16,"Amos":17,"Obad":18,"Jonah":19,"Mic":20,"Nah":21,"Hab":22,"Zeph":23,"Hag":24,"Zech":25,"Mal":26,"Ps":27,"Prov":28,"Job":29,"Song":30,"Ruth":31,"Lam":32,"Eccl":33,"Esth":34,"Dan":35,"Ezra":36,"Neh":37,"1Chr":38,"2Chr":39}')
with open("./definitions/final_defs_v2.json", "r", encoding="utf-8") as previous_defs:
    for index, definition in enumerate(previous_defs):
        try:
            cur_def = json.loads(definition)
            cur_def['deprecated_freq'] = cur_def['frequency']
            del cur_def['frequency']
            cur_def['total_freq'] = 0
            cur_def['variants_with_refs'] = []
            incomplete_defs[str(index + 1)] = cur_def
        except ValueError:
            print(definition)

In [17]:
with open("./definitions/sorted_words_refs.xml", "r", encoding="utf-8") as sorted_words:
    cur_hid = 0
    cur_niqqud = ''
    cur_non_niqqud = ''
    cur_ref_dict = {}
    cur_total_freq = 0
    for i, line in enumerate(sorted_words):
        trim_line = line.strip()
        if trim_line[1] == 'w':
            if cur_hid in incomplete_defs:
                definition = incomplete_defs[cur_hid]
                def_to_add = {'niqqud' : cur_niqqud,
                              'non_niqqud' : cur_non_niqqud,
                              'references' : cur_ref_dict}
                definition['variants_with_refs'].append(def_to_add)
                definition['total_freq'] += cur_total_freq
                cur_ref_dict = {}
                cur_niqqud = ''
                cur_non_niqqud = ''
                cur_total_freq = 0
            trim_line = trim_line[1: len(trim_line) - 1]
            word_parts = [part[3:len(part) - 1] for part in trim_line.split()[1:]]
            cur_non_niqqud, cur_niqqud, cur_hid = word_parts
            if not cur_hid[-1].isdigit():
                cur_hid = cur_hid[:len(cur_hid) - 1]
        else:
            lst_rght_crt = trim_line[:-1].rindex('>')
            lst_lft_crt = trim_line[:-1].rindex('<')
            book, chapter, verse = trim_line[:-1][lst_rght_crt + 1:lst_lft_crt].split('.')
            if book not in cur_ref_dict:
                cur_ref_dict[book] = {chapter : {verse : 1}}
            elif chapter not in cur_ref_dict[book]:
                cur_ref_dict[book][chapter] = {verse : 1}
            elif verse not in cur_ref_dict[book][chapter]:
                cur_ref_dict[book][chapter][verse] = 1
            else:
                cur_ref_dict[book][chapter][verse] += 1
            cur_total_freq += 1
#Handles the last word
if cur_hid in incomplete_defs:
    definition = incomplete_defs[cur_hid]
    def_to_add = {'niqqud' : cur_niqqud,
                  'non_niqqud' : cur_non_niqqud,
                  'references' : cur_ref_dict}
    definition['variants_with_refs'].append(def_to_add)
    definition['total_freq'] += cur_total_freq

In [18]:
with open('./definitions/newest.json', 'w', encoding='utf-8') as variants:
    for item in incomplete_defs.values():
        item["hebID"] = int(item['hebID'][1:])
        del item["deprecated_freq"]
        #variants.write(json.dumps({k: serializer.serialize(v) for k, v in item.items()}, ensure_ascii=False) + '\n')
        variants.write(json.dumps(item, ensure_ascii=False) + "\n")

In [70]:
import boto3
from botocore.config import Config
my_config = Config(
    region_name = '',
    signature_version = 'v4',
    retries = {
        'max_attempts': 10,
        'mode': 'standard'
    }
)

client = boto3.client('dynamodb',
                      config=my_config,
                      aws_access_key_id='',
                      aws_secret_access_key='')
for item in incomplete_defs.values():
    item_to_put = {k: serializer.serialize(v) for k, v in item.items()}
    client.put_item(TableName='hebrew_dict',
                   Item=item_to_put
    )

In [None]:
client