# WordNet to ETCBC

This is an attempt to map categories to substantive lex objects in ETCBC for valency research. Senses in Princeton Wordnet contain hypernym and hyponym relations, allowing for grouping senses into general categories. The ETCBC contains a gloss feature for every Hebrew term in the database. In order to accomplish some mappings, we need to follow several steps.

1. retrieve all noun, lexeme objects in etcbc with a mapping to 10 (max) random sample passages
2. pass each etcbc gloss from the lexeme to Morphy to get the lexical form in English.
    * remap the cleaned etcbc gloss to the 10 sample passages
3. extract the sample passages from English translations with [XML Bible API](https://www.4-14.org.uk/xml-bible-web-service-api)
    * tokenize and parse each word in the sample passages
    * filter out any sample passages that do not contain the lexeme
    * map each lexeme to its verified sample passage
4. feed lexeme and sample passsages to the [Word Sense Disambiguation algorithm](http://www.nltk.org/howto/wsd.html)
    * map lexeme to wordnet sense
5. test the hypernyms for each sense to create the categories
6. export the new list

Along the way, keep track of how many lexemes we account for from ETCBC so that we can weed out solvable problems in the mapping process.

In [1]:
from collections import *
from tf.fabric import Fabric

TF = Fabric(modules='hebrew/etcbc4c')
api = TF.load('book chapter verse gloss lex g_word_utf8 sp')
api.makeAvailableIn(globals())

This is Text-Fabric 2.1.3
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data/features/hebrew/etcbc4c/0_overview.html
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
108 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B chapter              from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.20s B g_word_utf8          from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s B gloss                from /Users/Cody/github/text-f

# 1. retrieve noun, lexeme objects in etcbc with a mapping to 10 (max) random sample passages

In [2]:
etcbcLex = defaultdict(set) # mapping from etcbc word nodes to 10 max sample passages

indent(reset = True)
info('beginning search...')
for lex in F.otype.s('lex'):
    if F.sp.v(lex) != 'subs': continue
    samples = set(L.d(lex, otype = 'word')) # randomize with set
    samples = list(samples)[:10] if len(samples) > 10 else samples # pull only 10
    samples = set(T.sectionFromNode(n) for n in samples) # pull passage info
    etcbcLex[lex] = set(samples)
    if len(etcbcLex) % 500 == 0: info('{} words processed...'.format(len(etcbcLex)))
        
avgSampleLen = round(sum( len(passages) for word in etcbcLex for passages in etcbcLex[word] ) / len(etcbcLex),2)
        
info('Complete with {} words and average {} sample passages'.format(len(etcbcLex),avgSampleLen))

  0.00s beginning search...
  0.21s 500 words processed...
  0.35s 1000 words processed...
  0.47s 1500 words processed...
  0.57s 2000 words processed...
  0.63s 2500 words processed...
  0.69s 3000 words processed...
  0.72s 3500 words processed...
  0.79s 4000 words processed...
  0.80s Complete with 4077 words and average 13.35 sample passages


In [3]:
list(etcbcLex.items())[:2]

[(1436896,
  {('Deuteronomy', 11, 12),
   ('Ezekiel', 44, 30),
   ('Genesis', 1, 1),
   ('Jeremiah', 2, 3),
   ('Jeremiah', 49, 34),
   ('Jeremiah', 49, 35),
   ('Nehemiah', 12, 44),
   ('Numbers', 15, 20),
   ('Numbers', 15, 21),
   ('Proverbs', 3, 9)}),
 (1436898,
  {('1_Chronicles', 19, 13),
   ('2_Kings', 4, 7),
   ('Ezekiel', 28, 13),
   ('Genesis', 1, 1),
   ('Genesis', 1, 2),
   ('Genesis', 43, 23),
   ('Joshua', 18, 3),
   ('Psalms', 7, 4),
   ('Psalms', 55, 15)})]

In [4]:
# a set to track which lexemes are weeded out
lexTracker = set(etcbcLex.keys())
len(lexTracker)

4077

# 2. pass each etcbc gloss from the lexeme to Morphy to get the lexical form in English

In [5]:
from nltk.corpus import wordnet as wn

In [6]:
cleanGlosses = {}
changes = set()

for lexNode, samples in etcbcLex.items():
    lex = F.gloss.v(lexNode)
    morph = wn.morphy(lex, wn.NOUN)
    if morph:
        cleanGlosses[lexNode] = {'WNgloss':morph, 'samples': samples}
        if morph != lex:
            changes.add('{} --> {}'.format(lex, morph))
        
print('glosses found: {}'.format(len(cleanGlosses)))
print('etcbc words lost: {}'.format(len(lexTracker)-len(cleanGlosses)))
print('total changes: {}'.format(len(changes)))
print('sample changes: {}'.format('\n\t\t'.join(list(changes)[:10])))

glosses found: 3172
etcbc words lost: 905
total changes: 84
sample changes: pans --> pan
		captives --> captive
		seniors --> senior
		generations --> generation
		thistles --> thistle
		presents --> present
		antlers --> antler
		arguments --> argument
		pipes --> pipe
		members --> member


In [7]:
etcbcLex = cleanGlosses
lexTracker = set(etcbcLex.keys()) & lexTracker
print('remaining words:', len(lexTracker))

remaining words: 3172


# 3. extract the sample passages from English translations with XML Bible API

In [8]:
import requests
import time
from lxml import etree
from spacy.en import English

verseCache = defaultdict(dict)

In [9]:
# first retrieve all verses 

passageText = {}
passagesToGet = set(sect for data in etcbcLex for sect in etcbcLex[data]['samples'])
translationPreferences = ('nasb','ylt','akjv','web','kjv') # see https://www.4-14.org.uk/xml-bible-web-service-api
apiURL = 'http://api.preachingcentral.com/bible.php?passage={passage}&version={version}'

def fix_book(bookString):
    if '_' in bookString:
        return ' '.join(bookString.split('_'))
    else:
        return bookString

def getRequest(passages, version):
    passages = set('{} {}:{}'.format(fix_book(passa[0]), passa[1], passa[2]) for passa in passages)
    passages = ', '.join(passages)
    requestURL = apiURL.format(passage=passages, version=version)
    rawData = requests.get(requestURL).content
    bibleRoot = etree.fromstring(rawData)
    for result in bibleRoot.findall('range/result'):
        passage = result.text
        try:
            text = result.findall('../item/text')[0].text
        except IndexError:
            info('not found: {}'.format(result.text))
        verseCache[version].update({passage: text})

def getPassages(passageSet, version):
    assembledRequest = set()
    for count, passage in enumerate(passageSet):
        if passage in verseCache[version]:
            continue
        assembledRequest.add(passage)
        if any([count != 0 and count % 400 == 0, # create requests of max 400 verses at a time 
                count == len(passageSet)-1  
               ]):
            getRequest(assembledRequest, version) # stores the verse in the cache every 400 verses
            time.sleep(3)
            assembledRequest = set()
            info('verses logged at count: {}'.format(count))
            info('total verses: {}'.format(len(verseCache[version])))
            
for translation in translationPreferences:
    break # the results of this query have been pushed to a json file (cf. below)
    indent(reset=True)
    info('preparing {}'.format(translation))
    getPassages(passagesToGet, translation)

In [10]:
import json 

with open('lexTranslations.json', 'r') as file:
    samples = json.load(file)
    
print(len(samples['nasb']))

8876


## tokenize and parse each word in the sample passages

In [11]:
parser = English()

**First a test/demonstration of Spacy. We link אֶרֶץ (earth) with the parsed Spacy text from the NASB. The output results in True**

In [12]:
test = samples['nasb']['Genesis 1:1']
parsedTest = parser(test)
erets = list(lex for lex in F.otype.s('lex') if F.lex.v(lex) == '>RY/')[0]
testSentence = list(parsedTest.sents)

print(testSentence)
print('lemmatized:\n', list(token.lemma_ for token in testSentence[0]))
print('partOfSpeech:\n', list(token.pos_ for token in testSentence[0]))

# true or false? Now we demonstrate the link from ETCBC to the translation: 
(F.gloss.v(erets),'NOUN') in set((token.lemma_, token.pos_) for token in testSentence[0])

[In the beginning God created the heavens and the earth.]
lemmatized:
 ['in', 'the', 'begin', 'god', 'create', 'the', 'heaven', 'and', 'the', 'earth', '.']
partOfSpeech:
 ['ADP', 'DET', 'VERB', 'PROPN', 'VERB', 'DET', 'NOUN', 'CONJ', 'DET', 'NOUN', 'PUNCT']


True

**tokenize the verses and keep a part of speech**

In [13]:
lemmatizedSamps = defaultdict(dict)
passageCount = 0

indent(reset = True)
for version, passages in samples.items():
    info('starting {}'.format(version))
    for passage, text in passages.items():
        passageCount += 1
        pText = list(parser(text))
        lemmatizedSamps[version][passage]= {'tokens' : tuple(w.orth_ for w in pText), # for sense disambiguation
                                            'posLem' : set((w.lemma_,w.pos_) for w in pText) # to check for lemma
                                            }
        if passageCount != 0 and passageCount % 1000 == 0:
            info('{} passages processed...'.format(passageCount))
info('\nCOMPLETE! {} passages processed'.format(passageCount))

  0.00s starting nasb
  5.01s 1000 passages processed...
  8.54s 2000 passages processed...
    12s 3000 passages processed...
    14s 4000 passages processed...
    18s 5000 passages processed...
    20s 6000 passages processed...
    23s 7000 passages processed...
    25s 8000 passages processed...
    27s starting ylt
    28s 9000 passages processed...
    30s 10000 passages processed...
    33s 11000 passages processed...
    35s 12000 passages processed...
    38s 13000 passages processed...
    40s 14000 passages processed...
    43s 15000 passages processed...
    46s 16000 passages processed...
    49s 17000 passages processed...
    52s starting akjv
    53s 18000 passages processed...
    56s 19000 passages processed...
    58s 20000 passages processed...
 1m 00s 21000 passages processed...
 1m 02s 22000 passages processed...
 1m 05s 23000 passages processed...
 1m 08s 24000 passages processed...
 1m 11s 25000 passages processed...
 1m 14s 26000 passages processed...
 1m 16s 

In [14]:
from pprint import pprint

pprint(lemmatizedSamps['nasb']['Genesis 1:1'])

{'posLem': {('.', 'PUNCT'),
            ('and', 'CONJ'),
            ('begin', 'VERB'),
            ('create', 'VERB'),
            ('earth', 'NOUN'),
            ('god', 'PROPN'),
            ('heaven', 'NOUN'),
            ('in', 'ADP'),
            ('the', 'DET')},
 'tokens': ('In',
            'the',
            'beginning',
            'God',
            'created',
            'the',
            'heavens',
            'and',
            'the',
            'earth',
            '.')}


### filter out any sample passages that do not contain the lexeme; map each lexeme to its verified sample passage

In [15]:
# if the sample passage does not contain the exact lexeme,
# move on to the next preferred translation in the preferredTranslations list 

pprint(list(etcbcLex.items())[:2])

[(1436896,
  {'WNgloss': 'beginning',
   'samples': {('Deuteronomy', 11, 12),
               ('Ezekiel', 44, 30),
               ('Genesis', 1, 1),
               ('Jeremiah', 2, 3),
               ('Jeremiah', 49, 34),
               ('Jeremiah', 49, 35),
               ('Nehemiah', 12, 44),
               ('Numbers', 15, 20),
               ('Numbers', 15, 21),
               ('Proverbs', 3, 9)}}),
 (1436901,
  {'WNgloss': 'heavens',
   'samples': {('2_Chronicles', 33, 3),
               ('Deuteronomy', 2, 25),
               ('Deuteronomy', 28, 24),
               ('Deuteronomy', 30, 19),
               ('Deuteronomy', 33, 28),
               ('Genesis', 1, 1),
               ('Genesis', 27, 39),
               ('Isaiah', 49, 13),
               ('Psalms', 33, 6),
               ('Zephaniah', 1, 3)}})]


In [16]:
translationPreferences = ('nasb','akjv','web','kjv') # see https://www.4-14.org.uk/xml-bible-web-service-api

def formatSection(sectionTuple):
    book = fix_book(sectionTuple[0])
    chapter = sectionTuple[1]
    verse = sectionTuple[2]
    return '{} {}:{}'.format(book, chapter, verse)

def getPreferred(lexNode, sectionSet):
    preferredSamples = list()
    lexGloss = etcbcLex[lexNode]['WNgloss']
    for sect in sectionSet:
        sect = formatSection(sect)
        for prefTrans in translationPreferences:
            try:
                if (lexGloss, 'NOUN') in lemmatizedSamps[prefTrans][sect]['posLem']:
                    tokens = lemmatizedSamps[prefTrans][sect]['tokens']
                    preferredSamples.append({'translation':prefTrans, 'reference' : sect, 'tokens' : tokens})
                    break
            except KeyError:
                continue
    return preferredSamples

# create the new mapping:

lexToGoodSamples = defaultdict(dict)
noSamplesRemaining = set()

indent(reset=True)
info('processing samples...')
for lex, lexDat in etcbcLex.items():
    samples = lexDat['samples']
    processedSamples = getPreferred(lex, samples)
    if not processedSamples:
        noSamplesRemaining.add(lex)
        continue
    lexToGoodSamples[lex] = {'WNgloss': lexDat['WNgloss'], 
                             'samples' : processedSamples}
    
avgSampleLen = round(sum( len(passages) for word in lexToGoodSamples \
                         for passages in lexToGoodSamples[word] ) / len(lexToGoodSamples),2)

info('COMPLETE! Average remaining samples per lex: {}'.format(avgSampleLen))
info('Culled lex objects: {}'.format(len(noSamplesRemaining)))
info('Remaining lex objects: {}'.format(len(lexToGoodSamples)))

  0.00s processing samples...
  0.09s COMPLETE! Average remaining samples per lex: 14.0
  0.09s Culled lex objects: 1127
  0.09s Remaining lex objects: 2045


In [17]:
# sample 
etcbcLex = lexToGoodSamples
pprint(list(etcbcLex.items())[:5])

[(1436896,
  {'WNgloss': 'beginning',
   'samples': [{'reference': 'Deuteronomy 11:12',
                'tokens': ('a',
                           'land',
                           'for',
                           'which',
                           'the',
                           'LORD',
                           'your',
                           'God',
                           'cares',
                           ';',
                           'the',
                           'eyes',
                           'of',
                           'the',
                           'LORD',
                           'your',
                           'God',
                           'are',
                           'always',
                           'on',
                           'it',
                           ',',
                           'from',
                           'the',
                           'beginning',
                           'even',
                

## 4. feed lexeme and sample passsages to the [Word Sense Disambiguation algorithm](http://www.nltk.org/howto/wsd.html)
### map lexeme to wordnet sense



In [21]:
from nltk.wsd import lesk # disambiguation algorithm

threshold = 2

def disambiguate(gloss, samples):
    scores = Counter()
    for sample in samples:
        proposedSynset = lesk(sample, gloss, 'n')
        if proposedSynset:
            scores[proposedSynset] += 1
    if not scores or sum(scores.values()) < threshold: return None
    return max(scores) # return the simple majority

def pullSamples(lexNode):
    samples = tuple(text['tokens'] for text in etcbcLex[lexNode]['samples'])
    return samples

# create the mapping

mappedToSynset = defaultdict(dict)
culledLexs = set()

indent(reset=True)
info('looking for synsets...')

count = 0

for lexNode, lexDat in list(etcbcLex.items()):
    count += 1    
    samples = pullSamples(lexNode)
    synset = disambiguate(lexDat['WNgloss'], samples)
    if not synset:
        culledLexs.add(lexNode)
        continue
    mappedToSynset[lexNode] = lexDat
    mappedToSynset[lexNode].update({'synset': synset})

info('Culled lex objects: {}'.format(len(culledLexs)))
info('Remaining lex objects: {}'.format(len(mappedToSynset)))

  0.00s looking for synsets...
  0.31s Culled lex objects: 739
  0.31s Remaining lex objects: 1306


In [23]:
for lex, lexDat in list(mappedToSynset.items())[:150]:
    print('{:>5} --> {:>5}\n{}\n'.format(lexDat['WNgloss'], lexDat['synset'].name(), lexDat['synset'].definition()))

beginning --> beginning.n.02
the time at which something is supposed to begin

earth --> worldly_concern.n.01
the concerns of this life as distinguished from heaven and the afterlife

darkness --> iniquity.n.01
absence of moral or spiritual values

 face --> face.n.07
the part of an animal corresponding to the human face

water --> water_system.n.02
a facility that provides a source of water

light --> sparkle.n.01
merriment expressed by a brightness or gleam or animation of countenance

  day --> sidereal_day.n.01
the time for one complete rotation of the earth relative to a particular star, about 4 minutes shorter than a mean solar day

night --> night.n.07
the time between sunset and midnight

evening --> evening.n.03
the early part of night (from dinner until bedtime) spent in a special way

morning --> morning.n.01
the time period between dawn and noon

  one --> one.n.02
a single person or thing

firmament --> celestial_sphere.n.01
the apparent surface of the imaginary sphere on 