# Mapping Wordlists from BDB to ETCBC

The goal of this notebook is to map lexemes already categorised into categories useful for valency research from BDB to the ETCBC database. Each lexeme will be converted to the ETCBC transliterated [`lex` feature](https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/lex.html) to facilitate its use with Text-Fabric.

This notebook uses the part of speech list generated in [BDB_pos_categorization.ipynb](https://github.com/codykingham/textfabric_notebooks/blob/master/valency_wordlists/BDB_pos_categorization.ipynb)

The source for the BDB resource is openscriptures' [BrownDriverBriggs.xml](https://github.com/openscriptures/HebrewLexicon).

In [71]:
from tf.fabric import Fabric
TF = Fabric(modules='Hebrew/etcbc4c')
print()
api = TF.load('otype lex g_lex_utf8 voc_utf8 g_word_utf8 g_cons_utf8 gloss')
api.makeAvailableIn(globals())

This is Text-Fabric 2.0.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
106 features found and 0 ignored

  0.00s loading features ...
   |     0.04s B otype                from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.23s B g_cons_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.25s B g_lex_utf8           from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.26s B g_word_utf8          from /Users/Cody/github/text-fabric-data/Hebrew/etcbc4c
   |     0.18s B lex                  from /Users/Cody/github/text-

In [72]:
from lxml import etree
import csv

tree = etree.parse("BrownDriverBriggs.xml")
root = tree.getroot()
namespace = {'None':'http://openscriptures.github.com/morphhb/namespace'}

with open('BDB_pos_tags.csv','r') as file:
    reader = csv.DictReader(file)
    pos_tags = list(dic for dic in reader)
    
pos_tags[:5]

[{'of kind': 'person', 'pos': 'n.pr.gent', 'type': 'agent'},
 {'of kind': 'abstract', 'pos': 'n.pl.m', 'type': 'object'},
 {'of kind': 'name', 'pos': 'n.pr.font', 'type': 'place'},
 {'of kind': 'person', 'pos': 'n.pr.pl.gent.', 'type': 'agent'},
 {'of kind': 'person', 'pos': 'n.pr.pers.m', 'type': 'agent'}]

In [73]:
# Gather all of the lemmas to test against ETCBC

test_group = list()

for tag in pos_tags:
    pos = tag['pos']
    for entry in root.findall('None:part/None:section/None:entry/None:pos', namespace):
        cur_pos = entry.text
        if cur_pos == pos:
            parent = entry.getparent()
            text = parent.findall('None:w', namespace)[0]
            test_group.append(text.text)
len(test_group)

2389

In [270]:
import collections as col

def collect_letters():
    consonants = set()
    vowels = set()
    sample_words = F.otype.s('word')
    for word in sample_words:
        for letter in F.g_cons_utf8.v(word):
            if letter not in {' ','ׁ','ׂ'}:
                consonants.add(letter)
    for word in sample_words:
        for letter in F.g_lex_utf8.v(word):
            if letter not in consonants and letter not in {' '}:
                vowels.add(letter)
    return {'consonants' : consonants, 'vowels' : vowels}
            
def strip_diacritic(word, consonants, vowels):
    new_word = ''
    for letter in word:
        if letter in consonants or letter in vowels:
            new_word += letter
            
    return new_word

def fix_holem(word):
    '''
    fixes a vocalisation error on etcbc4c words:
    ex: 'גֹּויִם into גּוֹיִם'
    '''
    if 'ֹו' in word:
        return word.replace('ֹ', '').replace('ו','וֹ')
    else:
        return word

def fix_word(word):
    clean_word = strip_diacritic(word, letters['consonants'], letters['vowels'])
    clean_word = fix_holem(clean_word)
    try:
        if word[1] == 'ּ': #dagesh
            clean_word = clean_word.replace('ּ','',1)
    except:
        pass
    return clean_word
    
def text_to_lex():
    text_dict = col.defaultdict(set)
    for word in F.otype.s('word'):
        no_diacritics = fix_word(F.g_word_utf8.v(word))
        lex = F.lex.v(word)
        text_dict[F.g_lex_utf8.v(word)].add(lex)
        text_dict[no_diacritics].add(lex)
    return text_dict
    
def match_etcbc(bdb_lex, bib_lex):     
    clean_bdb_lex = fix_word(bdb_lex)
    if clean_bdb_lex in bib_lex:
        return bib_lex[clean_bdb_lex]

In [271]:
bib_lex = text_to_lex()
letters = collect_letters()

In [272]:
match_group = set()

for lex in test_group:
    if match_etcbc(lex, bib_lex):
        match_group.add(lex)

In [273]:
print('Not yet mapped to ETCBC: ', len(set(test_group) - match_group))

Not yet mapped to ETCBC:  249


In [274]:
sorted(set(test_group) - match_group)[:50]

['(ו)יעשׂו',
 '(וְ)יַעֲזִיאֵל',
 ']עֵת] קָצִין',
 'אֱלִיפָל',
 'אֲזַנְיָ֫הוּ',
 'אֲחַשְׁוֵרוֹשׁ',
 'אֲלִיפֶ֫לֶט',
 'אֲלִישָׁה',
 'אֲנָֽחֲרָ֑ת',
 'אֲפָֽרְסַתְּכָיֵא',
 'אֲרְיוֹךְ',
 'אֲרִידַי',
 'אֲרַק',
 'אֲרָא',
 'אֲרָב',
 'אֲרוּמָה',
 'אֳלִיאָ֫תָה',
 'אִישׁ הוֹד',
 'אִישׁ־בֹ֫שֶׁת',
 'אֵסַרְחַדֹּן',
 'אֶ֫בֶץ',
 'אֶדְרַע',
 'אֶשְׁתְּמוֹעַ',
 'אַחְבָן',
 'אַחֲרַח',
 'אַלַּמֶּלֶךְ',
 'אַרְוָד',
 'אַרְכְּוָיֵ',
 'אַשְׁכְּנַז',
 'אָזֵן',
 'אׇֽסְנַפַּר',
 'בַ֫עַל גַּד',
 'בְּאֵר לַחַי רֹאִי',
 'בְּאֵר שֶׁ֫בַע',
 'בֵּֽיתְאֵל',
 'בֵּית אַֽרְבֵֿאל',
 'בֵּית הַיְשִׁימוֹת',
 'בֵּית הַמֶּרְחָק',
 'בֵּית חוֹרֹן',
 'בֵּית לְעַפְרָה',
 'בֵּית מַעֲכָה',
 'בֵּית רָפָא',
 'בֶּן־אֲבִינָדָב',
 'בֶּן־גֶּ֫בֶר',
 'בֶּן־דֶּ֫קֶר',
 'בֶּן־חֶ֫סֶד',
 'בֶּן־חוּר',
 'בַּעַלְיָה',
 'בַּקָּרָה',
 'בָּרַכְאֵל']

Problems in the mapping so far:
* Most of the issues seem to be arising from proper nouns?
* differences in vocalization, arising from??

Fixed problems:
* √ removed diacritical markers from both etcbc and bdb lemmas (`clean_word()`)
    * brought the un-mapped down from ~700 to ~350
* √ removed first position dageshes to solve pointing discrepancies
    * unmapped down from ~350 to 249

Good thing is, most of these lemmas appear to be relatively rare.

## Testing Field:
For testing problems between the etcbc and bdb mappings

In [242]:
test = T.nodeFromSection(('Habakkuk',3,7))
test_words = [w for w in L.d(test, otype='word')]

for w in test_words:
    word = F.g_word_utf8.v(w)
    print(fix_word(word))

תַּחַת
אָוֶן
רָאִיתִי
אָהֳלֵי
כוּשָׁן
יִרְגְּזוּן
יְרִיעוֹת
אֶרֶץ
מִדְיָן


In [263]:
# The presence of a first position dagesh causes problems!
# a new function could remove the dagesh and try again?

bdb = 'כּוּשָׁן'
etcbc = 'כוּשָׁן'

bdb == etcbc

False

In [264]:
bdb[1] == 'ּ' # dagesh

True

In [265]:
etcbc[1] == 'ּ' # dagesh

False

In [266]:
bdb.replace('ּ','',1)

'כוּשָׁן'