# Mapping Wordlists from BDB to ETCBC

The goal of this notebook is to map lexemes already categorised into categories useful for valency research from BDB to the ETCBC database. Each lexeme will be converted to the ETCBC transliterated [`lex` feature](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/lex.html) to facilitate its use with Text-Fabric.

This notebook uses the part of speech list generated in [bdb2etcbc_pos_sorting.ipynb](https://github.com/codykingham/textfabric_notebooks/blob/master/valency_wordlists/bdb2etcbc_pos_sorting.ipynb)

The source for the BDB resource is openscriptures' [BrownDriverBriggs.xml](https://github.com/openscriptures/HebrewLexicon).

In [28]:
from tf.fabric import Fabric
TF = Fabric(modules='Hebrew/etcbc4c')
print()
api = TF.load('otype lex g_lex_utf8 voc_utf8 g_word_utf8 g_cons_utf8 gloss')
api.makeAvailableIn(globals())

This is Text-Fabric 2.0.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
106 features found and 0 ignored

  0.00s loading features ...
   |     0.04s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.23s B g_cons_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.24s B g_lex_utf8           from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.26s B g_word_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.20s B lex                  from /Users/Cody/github/text-

In [29]:
from lxml import etree
import csv
import collections as col

tree = etree.parse("BrownDriverBriggs.xml")
root = tree.getroot()
namespace = {'None':'http://openscriptures.github.com/morphhb/namespace'}

with open('BDB_pos_tags.csv','r') as file:
    reader = csv.DictReader(file)
    pos_tags = list(dic for dic in reader)
    
pos_tags[:5]

[{'of kind': 'person', 'pos': 'n.pr.gent', 'type': 'agent'},
 {'of kind': 'abstract', 'pos': 'n.pl.m', 'type': 'object'},
 {'of kind': 'name', 'pos': 'n.pr.font', 'type': 'place'},
 {'of kind': 'person', 'pos': 'n.pr.pl.gent.', 'type': 'agent'},
 {'of kind': 'person', 'pos': 'n.pr.pers.m', 'type': 'agent'}]

In [30]:
# Gather all of the lemmas to test against ETCBC

test_group = list()

for tag in pos_tags:
    pos = tag['pos']
    for entry in root.findall('None:part/None:section/None:entry/None:pos', namespace):
        cur_pos = entry.text
        if cur_pos == pos:
            parent = entry.getparent()
            text = parent.findall('None:w', namespace)[0]
            test_group.append(text.text)
len(test_group)

2389

In [31]:
def collect_letters():
    '''
    returns all consonants/vowels from etcbc
    omits diacritical marks
    '''
    consonants = set()
    vowels = set()
    sample_words = F.otype.s('word')
    for word in sample_words:
        for letter in F.g_cons_utf8.v(word):
            if letter not in {' ','ׁ','ׂ'}:
                consonants.add(letter)
    for word in sample_words:
        for letter in F.g_lex_utf8.v(word):
            if letter not in consonants and letter not in {' '}:
                vowels.add(letter)
    return {'consonants' : consonants, 'vowels' : vowels}
            
def strip_diacritic(word, consonants, vowels):
    '''
    strip diacritical markings and return clean word
    '''
    new_word = ''
    for letter in word:
        if letter in consonants or letter in vowels:
            new_word += letter
            
    return new_word

def fix_holem(word):
    '''
    fixes a vocalisation error on etcbc4c words:
    ex: 'גֹּויִם into גּוֹיִם'
    '''
    if 'ֹו' in word:
        return word.replace('ֹ', '').replace('ו','וֹ')
    else:
        return word

def strip_dagesh(word):
    '''
    remove first dagesh from word
    '''
    if len(word) > 1 and word[1] == 'ּ': #dagesh
        return word.replace('ּ','',1)
    else:
        return word
    
def fix_word(word):
    '''
    apply diacritical stripping and other corrections
    return clean word
    '''
    clean_word = strip_diacritic(word, letters['consonants'], letters['vowels'])
    clean_word = fix_holem(clean_word)
    clean_word = strip_dagesh(clean_word)
    return clean_word
    
def text_to_lex():
    '''
    creates mapping from a cleaned etcbc word
    to the corresponding etcbc lex features
    '''
    text_dict = col.defaultdict(set)
    for word in F.otype.s('word'):
        clean_word = fix_word(F.g_word_utf8.v(word))
        lex = F.lex.v(word)
        text_dict[F.g_lex_utf8.v(word)].add(lex)
        text_dict[clean_word].add(lex)
    return text_dict
    
def match_etcbc(bdb_lex, bib_lex):
    '''
    matches bdb lexemes with etcbc lexemes
    requires the text_to_lex dict
    '''
    clean_bdb_lex = fix_word(bdb_lex) # the bdb lexs occasionally have diacriticals too!
    if clean_bdb_lex in bib_lex:
        return bib_lex[clean_bdb_lex]
    
letters = collect_letters() # required for fix_word()

In [32]:
# complete mapping from an etcbc clean word to its corresponding ascii lemmas
bib_lex = text_to_lex()

In [33]:
match_group = set()

for lex in test_group:
    if match_etcbc(lex, bib_lex):
        match_group.add(lex)

In [34]:
print('Not yet mapped to ETCBC: ', len(set(test_group) - match_group))

Not yet mapped to ETCBC:  247


In [35]:
# sorted(set(test_group) - match_group)[:50]

Problems in the mapping so far:
* Many issues seem to be caused by proper nouns
* differences in vocalization

Fixed problems:
* √ removed diacritical markers from both etcbc and bdb lemmas (`clean_word()`)
    * brought the un-mapped down from ~700 to ~350
* √ removed first position dageshes to solve pointing discrepancies
    * unmapped down from ~350 to 247
    
Any given lemma might contain a mapping to more than one etcbc lex feature. This is unfavorable in cases where the word form may be exact, but the sense is different from that intended by the part of speech tag that we've chosen to keep. That the vocalised text is more specific has been a motivating factor for keeping the vowels. In order to test the spread of lex objects in the matches, let's average the length of matches per lemma:

In [36]:
total_lemma = 0
len_lex = 0

for lemma in match_group:
    total_lemma += 1
    len_lex += len(match_etcbc(lemma, bib_lex))
    
print('total lexemes: ', total_lemma)
print('Average len of ascii lemma mapping: ', round(len_lex/total_lemma, 2))

total lexemes:  2079
Average len of ascii lemma mapping:  1.28


It is good that the number is close to 1.0 since that means there are less double(+)-matches.

In [39]:
for lemma in sorted(match_group)[:150]:
    print(lemma, '--', match_etcbc(lemma, bib_lex))

אֱדוֹם -- {'>DWM/'}
אֱוִי -- {'>WJ/'}
אֱוִיל מְרֹדַךְ -- {'>WJL_MRDK/'}
אֱלִיאֵל -- {'>LJ>L/'}
אֱלִיאָב -- {'>LJ>B/'}
אֱלִיהוּ -- {'>LJHW>/', '>LJHW=/'}
אֱלִיחֹ֫רֶף -- {'>LJXRP/'}
אֱלִימֶ֫לֶךְ -- {'>LJMLK/'}
אֱלִיעֶ֫זֶר -- {'>LJ<ZR/'}
אֱלִיעָם -- {'>LJ<M/'}
אֱלִיפְלֵ֫הוּ -- {'>LJPLHW/'}
אֱלִיפַז -- {'>LJPZ/'}
אֱלִיצָפָן -- {'>LJYPN/'}
אֱלִיצוּר -- {'>LJYWR/'}
אֱלִיקָא -- {'>LJQ>/'}
אֱלִישֶׁ֫בַע -- {'>LJCB</'}
אֱלִישָׁע -- {'>LJC</'}
אֱלִישָׁפָט -- {'>LJCPV/'}
אֱלִישׁוּעַ -- {'>LJCW</'}
אֱלוּל -- {'>LWL/'}
אֱמֹרִי -- {'>MRJ/'}
אֱנוֹשׁ -- {'>NWC==/', '>NWC/'}
אֲבִיָּ֫הוּ -- {'>BJHW/'}
אֲבִי־עַלְבוֹן -- {'>BJ_<LBWN/'}
אֲבִיאֵל -- {'>BJ>L/'}
אֲבִיאָסָף -- {'>BJ>SP/'}
אֲבִיגַ֫יִל -- {'>BJGJL/'}
אֲבִידָן -- {'>BJDN/'}
אֲבִידָע -- {'>BJD</'}
אֲבִיהוּא -- {'>BJHW>/'}
אֲבִיהוּד -- {'>BJHWD/'}
אֲבִיחַ֫יִל -- {'>BJXJL/'}
אֲבִיטָ֑ל -- {'>BJVL/'}
אֲבִיטוּב -- {'>BJVWB/'}
אֲבִימֶ֫לֶךְ -- {'>BJMLK/'}
אֲבִימָאֵל -- {'>BJM>L/'}
אֲבִינֵר -- {'>BNR/'}
אֲבִינָדָב -- {'>BJNDB/'}
אֲבִינֹ֫עַם -- {'>BJN<M/'}
