In [32]:
from datetime import datetime
last_modified = datetime.now()
print('Notebook last modified on {}'.format(last_modified.__str__()))

Notebook last modified on 2017-01-15 21:52:27.788876


# WordNet to ETCBC

This is an attempt to map categories to substantive lex objects in ETCBC for valency research. Senses in Princeton Wordnet contain hypernym and hyponym relations, allowing for grouping senses into general categories. The ETCBC contains a gloss feature for every Hebrew term in the database. In order to accomplish some mappings, we need to follow several steps.

1. retrieve all noun, lexeme objects in etcbc with a mapping to 10 (max) random sample passages
2. pass each etcbc gloss from the lexeme to Morphy to get the lexical form in English.
    * remap the cleaned etcbc gloss to the 10 sample passages
3. extract the sample passages from English translations with [XML Bible API](https://www.4-14.org.uk/xml-bible-web-service-api)
    * tokenize and parse each word in the sample passages
    * filter out any sample passages that do not contain the lexeme
    * map each lexeme to its verified sample passage
4. feed lexeme and sample passsages to the [Word Sense Disambiguation algorithm](http://www.nltk.org/howto/wsd.html)
    * map lexeme to wordnet sense
5. test the hypernyms for each sense to create the categories
6. export the new list

Along the way, keep track of how many lexemes we account for from ETCBC so that we can weed out solvable problems in the mapping process.

## Updates:

√ 15.01.17, removed some spurious lexemes. See [Special Adjustments](#Special-Adjustments:)

In [1]:
from collections import *
from tf.fabric import Fabric

TF = Fabric(modules='hebrew/etcbc4c')
api = TF.load('book chapter verse gloss lex g_word_utf8 sp')
api.makeAvailableIn(globals())

This is Text-Fabric 2.3.0
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
109 features found and 0 ignored
  0.00s loading features ...
   |     0.01s B book                 from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B chapter              from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B verse                from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.20s B g_word_utf8          from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.01s B gloss                from /Users/Cody/github/text-fabric-data/hebrew/etcbc4c
   |     0.15s

# 1. retrieve noun, lexeme objects in etcbc with a mapping to 10 (max) random sample passages

In [2]:
etcbcLex = defaultdict(set) # mapping from etcbc word nodes to 10 max sample passages

indent(reset = True)
info('beginning search...')
for lex in F.otype.s('lex'):
    if F.sp.v(lex) != 'subs': continue
    samples = set(L.d(lex, otype = 'word')) # randomize with set
    samples = list(samples)[:10] if len(samples) > 10 else samples # pull only 10
    samples = set(T.sectionFromNode(n) for n in samples) # pull passage info
    etcbcLex[lex] = set(samples)
    if len(etcbcLex) % 500 == 0: info('{} words processed...'.format(len(etcbcLex)))
        
avgSampleLen = round(sum( len(passages) for word in etcbcLex for passages in etcbcLex[word] ) / len(etcbcLex),2)
        
info('Complete with {} words and average {} sample passages'.format(len(etcbcLex),avgSampleLen))

  0.00s beginning search...
  0.21s 500 words processed...
  0.35s 1000 words processed...
  0.47s 1500 words processed...
  0.57s 2000 words processed...
  0.64s 2500 words processed...
  0.70s 3000 words processed...
  0.74s 3500 words processed...
  0.84s 4000 words processed...
  0.86s Complete with 4077 words and average 13.35 sample passages


In [3]:
list(etcbcLex.items())[:2]

[(1436896,
  {('Deuteronomy', 11, 12),
   ('Ezekiel', 44, 30),
   ('Genesis', 1, 1),
   ('Jeremiah', 2, 3),
   ('Jeremiah', 49, 34),
   ('Jeremiah', 49, 35),
   ('Nehemiah', 12, 44),
   ('Numbers', 15, 20),
   ('Numbers', 15, 21),
   ('Proverbs', 3, 9)}),
 (1436898,
  {('1_Chronicles', 19, 13),
   ('2_Kings', 4, 7),
   ('Ezekiel', 28, 13),
   ('Genesis', 1, 1),
   ('Genesis', 1, 2),
   ('Genesis', 43, 23),
   ('Joshua', 18, 3),
   ('Psalms', 7, 4),
   ('Psalms', 55, 15)})]

In [4]:
# a set to track which lexemes are weeded out
lexTracker = set(etcbcLex.keys())
len(lexTracker)

4077

# 2. pass each etcbc gloss from the lexeme to Morphy to get the lexical form in English

In [5]:
from nltk.corpus import wordnet as wn

In [6]:
cleanGlosses = {}
changes = set()

for lexNode, samples in etcbcLex.items():
    lex = F.gloss.v(lexNode)
    morph = wn.morphy(lex, wn.NOUN)
    if morph:
        cleanGlosses[lexNode] = {'WNgloss':morph, 'samples': samples}
        if morph != lex:
            changes.add('{} --> {}'.format(lex, morph))
        
print('glosses found: {}'.format(len(cleanGlosses)))
print('etcbc words lost: {}'.format(len(lexTracker)-len(cleanGlosses)))
print('total changes: {}'.format(len(changes)))
print('sample changes: {}'.format('\n\t\t'.join(list(changes)[:10])))

glosses found: 3172
etcbc words lost: 905
total changes: 84
sample changes: members --> member
		giants --> giant
		adders --> adder
		loops --> loop
		generations --> generation
		robes --> robe
		coverlets --> coverlet
		handshakes --> handshake
		charmers --> charmer
		hostages --> hostage


In [7]:
etcbcLex = cleanGlosses
lexTracker = set(etcbcLex.keys()) & lexTracker
print('remaining words:', len(lexTracker))

remaining words: 3172


# 3. extract the sample passages from English translations with XML Bible API

In [8]:
import requests
import time
from lxml import etree
from spacy.en import English

verseCache = defaultdict(dict)

In [9]:
# first retrieve all verses 

passageText = {}
passagesToGet = set(sect for data in etcbcLex for sect in etcbcLex[data]['samples'])
translationPreferences = ('nasb','ylt','akjv','web','kjv') # see https://www.4-14.org.uk/xml-bible-web-service-api
apiURL = 'http://api.preachingcentral.com/bible.php?passage={passage}&version={version}'

def fix_book(bookString):
    if '_' in bookString:
        return ' '.join(bookString.split('_'))
    else:
        return bookString

def getRequest(passages, version):
    passages = set('{} {}:{}'.format(fix_book(passa[0]), passa[1], passa[2]) for passa in passages)
    passages = ', '.join(passages)
    requestURL = apiURL.format(passage=passages, version=version)
    rawData = requests.get(requestURL).content
    bibleRoot = etree.fromstring(rawData)
    for result in bibleRoot.findall('range/result'):
        passage = result.text
        try:
            text = result.findall('../item/text')[0].text
        except IndexError:
            info('not found: {}'.format(result.text))
        verseCache[version].update({passage: text})

def getPassages(passageSet, version):
    assembledRequest = set()
    for count, passage in enumerate(passageSet):
        if passage in verseCache[version]:
            continue
        assembledRequest.add(passage)
        if any([count != 0 and count % 400 == 0, # create requests of max 400 verses at a time 
                count == len(passageSet)-1  
               ]):
            getRequest(assembledRequest, version) # stores the verse in the cache every 400 verses
            time.sleep(3)
            assembledRequest = set()
            info('verses logged at count: {}'.format(count))
            info('total verses: {}'.format(len(verseCache[version])))
            
for translation in translationPreferences:
    break # the results of this query have been pushed to a json file (cf. below)
    indent(reset=True)
    info('preparing {}'.format(translation))
    getPassages(passagesToGet, translation)

In [10]:
import json 

with open('lexTranslations.json', 'r') as file:
    samples = json.load(file)
    
print(len(samples['nasb']))

8876


## tokenize and parse each word in the sample passages

In [11]:
parser = English()

**First a test/demonstration of Spacy. We link אֶרֶץ (earth) with the parsed Spacy text from the NASB. The output results in True**

In [12]:
test = samples['nasb']['Genesis 1:1']
parsedTest = parser(test)
erets = list(lex for lex in F.otype.s('lex') if F.lex.v(lex) == '>RY/')[0]
testSentence = list(parsedTest.sents)

print(testSentence)
print('lemmatized:\n', list(token.lemma_ for token in testSentence[0]))
print('partOfSpeech:\n', list(token.pos_ for token in testSentence[0]))

# true or false? Now we demonstrate the link from ETCBC to the translation: 
(F.gloss.v(erets),'NOUN') in set((token.lemma_, token.pos_) for token in testSentence[0])

[In the beginning God created the heavens and the earth.]
lemmatized:
 ['in', 'the', 'begin', 'god', 'create', 'the', 'heaven', 'and', 'the', 'earth', '.']
partOfSpeech:
 ['ADP', 'DET', 'VERB', 'PROPN', 'VERB', 'DET', 'NOUN', 'CONJ', 'DET', 'NOUN', 'PUNCT']


True

**tokenize the verses and keep a part of speech**

In [13]:
lemmatizedSamps = defaultdict(dict)
passageCount = 0

indent(reset = True)
for version, passages in samples.items():
    info('starting {}'.format(version))
    for passage, text in passages.items():
        passageCount += 1
        pText = list(parser(text))
        lemmatizedSamps[version][passage]= {'tokens' : tuple(w.orth_ for w in pText), # for sense disambiguation
                                            'posLem' : set((w.lemma_,w.pos_) for w in pText) # to check for lemma
                                            }
        if passageCount != 0 and passageCount % 1000 == 0:
            info('{} passages processed...'.format(passageCount))
info('\nCOMPLETE! {} passages processed'.format(passageCount))

  0.00s starting nasb
  3.56s 1000 passages processed...
  6.27s 2000 passages processed...
  9.62s 3000 passages processed...
    12s 4000 passages processed...
    16s 5000 passages processed...
    19s 6000 passages processed...
    21s 7000 passages processed...
    24s 8000 passages processed...
    27s starting ylt
    27s 9000 passages processed...
    30s 10000 passages processed...
    32s 11000 passages processed...
    35s 12000 passages processed...
    38s 13000 passages processed...
    41s 14000 passages processed...
    43s 15000 passages processed...
    45s 16000 passages processed...
    48s 17000 passages processed...
    50s starting akjv
    51s 18000 passages processed...
    54s 19000 passages processed...
    56s 20000 passages processed...
    59s 21000 passages processed...
 1m 01s 22000 passages processed...
 1m 03s 23000 passages processed...
 1m 06s 24000 passages processed...
 1m 09s 25000 passages processed...
 1m 12s 26000 passages processed...
 1m 13s 

In [14]:
from pprint import pprint

pprint(lemmatizedSamps['nasb']['Genesis 1:1'])

{'posLem': {('.', 'PUNCT'),
            ('and', 'CONJ'),
            ('begin', 'VERB'),
            ('create', 'VERB'),
            ('earth', 'NOUN'),
            ('god', 'PROPN'),
            ('heaven', 'NOUN'),
            ('in', 'ADP'),
            ('the', 'DET')},
 'tokens': ('In',
            'the',
            'beginning',
            'God',
            'created',
            'the',
            'heavens',
            'and',
            'the',
            'earth',
            '.')}


### filter out any sample passages that do not contain the lexeme; map each lexeme to its verified sample passage

In [15]:
# if the sample passage does not contain the exact lexeme,
# move on to the next preferred translation in the preferredTranslations list 

pprint(list(etcbcLex.items())[:2])

[(1436896,
  {'WNgloss': 'beginning',
   'samples': {('Deuteronomy', 11, 12),
               ('Ezekiel', 44, 30),
               ('Genesis', 1, 1),
               ('Jeremiah', 2, 3),
               ('Jeremiah', 49, 34),
               ('Jeremiah', 49, 35),
               ('Nehemiah', 12, 44),
               ('Numbers', 15, 20),
               ('Numbers', 15, 21),
               ('Proverbs', 3, 9)}}),
 (1436901,
  {'WNgloss': 'heavens',
   'samples': {('2_Chronicles', 33, 3),
               ('Deuteronomy', 2, 25),
               ('Deuteronomy', 28, 24),
               ('Deuteronomy', 30, 19),
               ('Deuteronomy', 33, 28),
               ('Genesis', 1, 1),
               ('Genesis', 27, 39),
               ('Isaiah', 49, 13),
               ('Psalms', 33, 6),
               ('Zephaniah', 1, 3)}})]


In [16]:
translationPreferences = ('nasb','akjv','web','kjv') # see https://www.4-14.org.uk/xml-bible-web-service-api

def formatSection(sectionTuple):
    book = fix_book(sectionTuple[0])
    chapter = sectionTuple[1]
    verse = sectionTuple[2]
    return '{} {}:{}'.format(book, chapter, verse)

def getPreferred(lexNode, sectionSet):
    preferredSamples = list()
    lexGloss = etcbcLex[lexNode]['WNgloss']
    for sect in sectionSet:
        sect = formatSection(sect)
        for prefTrans in translationPreferences:
            try:
                if (lexGloss, 'NOUN') in lemmatizedSamps[prefTrans][sect]['posLem']:
                    tokens = lemmatizedSamps[prefTrans][sect]['tokens']
                    preferredSamples.append({'translation':prefTrans, 'reference' : sect, 'tokens' : tokens})
                    break
            except KeyError:
                continue
    return preferredSamples

# create the new mapping:

lexToGoodSamples = defaultdict(dict)
noSamplesRemaining = set()

indent(reset=True)
info('processing samples...')
for lex, lexDat in etcbcLex.items():
    samples = lexDat['samples']
    processedSamples = getPreferred(lex, samples)
    if not processedSamples:
        noSamplesRemaining.add(lex)
        continue
    lexToGoodSamples[lex] = {'WNgloss': lexDat['WNgloss'], 
                             'samples' : processedSamples}
    
avgSampleLen = round(sum( len(passages) for word in lexToGoodSamples \
                         for passages in lexToGoodSamples[word] ) / len(lexToGoodSamples),2)

info('COMPLETE! Average remaining samples per lex: {}'.format(avgSampleLen))
info('Culled lex objects: {}'.format(len(noSamplesRemaining)))
info('Remaining lex objects: {}'.format(len(lexToGoodSamples)))

  0.00s processing samples...
  0.08s COMPLETE! Average remaining samples per lex: 14.0
  0.08s Culled lex objects: 1127
  0.08s Remaining lex objects: 2045


In [17]:
# sample 
etcbcLex = lexToGoodSamples
pprint(list(etcbcLex.items())[:1])

[(1436896,
  {'WNgloss': 'beginning',
   'samples': [{'reference': 'Deuteronomy 11:12',
                'tokens': ('a',
                           'land',
                           'for',
                           'which',
                           'the',
                           'LORD',
                           'your',
                           'God',
                           'cares',
                           ';',
                           'the',
                           'eyes',
                           'of',
                           'the',
                           'LORD',
                           'your',
                           'God',
                           'are',
                           'always',
                           'on',
                           'it',
                           ',',
                           'from',
                           'the',
                           'beginning',
                           'even',
                

## 4. feed lexeme and sample passsages to the [Word Sense Disambiguation algorithm](http://www.nltk.org/howto/wsd.html)
### map lexeme to wordnet sense



In [18]:
from nltk.wsd import lesk # disambiguation algorithm

threshold = 2

def disambiguate(gloss, samples):
    scores = Counter()
    for sample in samples:
        proposedSynset = lesk(sample, gloss, 'n')
        if proposedSynset:
            scores[proposedSynset] += 1
    if not scores or sum(scores.values()) < threshold: return None
    return max(scores) # return the simple majority

def pullSamples(lexNode):
    samples = tuple(text['tokens'] for text in etcbcLex[lexNode]['samples'])
    return samples

# create the mapping

mappedToSynset = defaultdict(dict)
culledLexs = set()

indent(reset=True)
info('looking for synsets...')

count = 0

for lexNode, lexDat in list(etcbcLex.items()):
    count += 1    
    samples = pullSamples(lexNode)
    synset = disambiguate(lexDat['WNgloss'], samples)
    if not synset:
        culledLexs.add(lexNode)
        continue
    mappedToSynset[lexNode] = lexDat
    mappedToSynset[lexNode].update({'synset': synset})

info('Culled lex objects: {}'.format(len(culledLexs)))
info('Remaining lex objects: {}'.format(len(mappedToSynset)))

  0.00s looking for synsets...
  1.49s Culled lex objects: 739
  1.49s Remaining lex objects: 1306


In [19]:
hyper = lambda s: s.hypernyms()

for lex, lexDat in list(mappedToSynset.items())[:150]:
    print('{} --> {}\n{}\nHYPERS: {}\n'.format(lexDat['WNgloss'], 
                                               lexDat['synset'].name(), 
                                               lexDat['synset'].definition(),
                                               list(x for x in lexDat['synset'].closure(hyper))
                                  )
         )

beginning --> beginning.n.02
the time at which something is supposed to begin
HYPERS: [Synset('point.n.06'), Synset('measure.n.02'), Synset('abstraction.n.06'), Synset('entity.n.01')]

earth --> worldly_concern.n.01
the concerns of this life as distinguished from heaven and the afterlife
HYPERS: [Synset('concern.n.01'), Synset('interest.n.01'), Synset('curiosity.n.01'), Synset('cognitive_state.n.01'), Synset('psychological_state.n.01'), Synset('condition.n.01'), Synset('state.n.02'), Synset('attribute.n.02'), Synset('abstraction.n.06'), Synset('entity.n.01')]

darkness --> iniquity.n.01
absence of moral or spiritual values
HYPERS: [Synset('condition.n.01'), Synset('state.n.02'), Synset('attribute.n.02'), Synset('abstraction.n.06'), Synset('entity.n.01')]

face --> face.n.07
the part of an animal corresponding to the human face
HYPERS: [Synset('external_body_part.n.01'), Synset('body_part.n.01'), Synset('part.n.03'), Synset('thing.n.12'), Synset('physical_entity.n.01'), Synset('entity.n

In [20]:
wn.synset('living_thing.n.01').definition()

'a living (or once living) entity'

# 5. test the hypernyms for each sense to create the categories

Look up all hypernyms in the path to the terminal node of a given sense. Then look for certain qualities in the path that would lend one of 4 categories:
1. object 
2. agent (living)
3. location (place)
4. time* 

The categories are created through a process of elimination, as applied below...

In [21]:
lexToSynset = dict( (lex, mappedToSynset[lex]['synset']) for lex in mappedToSynset )

In [22]:
accountedFor = set()
catRules = {('time','point'): {'point.n.06','measure.'},
            ('time','period') : {'time_period.n.01'},
            ('time','unit') : {'time_unit.n.01'},
            ('time','tense') : {'tense.n.01'},
            ('place','boundary') : {'region.n.01', 'boundary.n.01'},
            ('place','point'): {'location.n.01','point.n.02'},
            ('place','location') : {'region.n.01','location.n.01'},
            ('place','location','general') : {'location.n.01'},
            ('place','region') : {'region.n.03'},
            ('place','biome') : {'biome.n.01'},
            ('place','geological formation') : {'geological_formation.n.01'},
            ('place','body of water') : {'body_of_water.n.01'},
            ('place','position') : {'position.n.07'},
            ('agency','unit') : {'unit.n.03'},
            ('agency','group') : {'people.n.01'},
            ('agency','living') : {'living_thing.n.01'},
            ('agency','social group') : {'social_group.n.01'},
            ('agency','biological group') : {'biological_group.n.01'},
            ('object','body part') : {'body_part.n.01'},
            ('object','abstract', 'state') : {'state.n.02','condition.n.01'},
            ('object','abstract','state of mind') : {'cognitive_state.n.01'},
            ('object','abstract','mental feature') : {'psychological_feature.n.01'},
            ('object','abstract','attribute') : {'property.n.02'},
            ('object','abstract','expression') : {'expression.n.01'},
            ('object','physical','man-made') : {'artifact.n.01'},
            ('object','abstract','unit') : {'unit.n.04'},
            ('object','abstract','unit2') : {'unit_of_measurement.n.01'},
            ('object', 'abstract', 'measure') : {'measure.n.02'},
            ('object','abstract','idea') : {'idea.n.01'},
            ('object','abstract','structure') : {'structure.n.03'},
            ('object','physical','substance') : {'substance.n.01'},
            ('object','physical','substance2') : {'substance.n.07'},
            ('object','physical','solid') : {'solid.n.01'},
            ('object','physical','plant') : {'plant.n.02'},
            ('object','physical','plant in locale'): {'vegetation.n.01'},
            ('object','abstract','phenomenon in atmosphere') : {'atmospheric_phenomenon.n.01'},
            ('object','physical','nature') : {'natural_object.n.01'},
            ('object','physical','possession') : {'possession.n.02'},
            ('object','abstract','communication') : {'communication.n.02'},
            ('object','abstract','quantity') : {'indefinite_quantity.n.01'},
            ('object','abstract','motion') : {'mechanical_phenomenon.n.01'},
            ('object','abstract','phenomenon') : {'phenomenon.n.01'},
            ('object','abstract','process'):{'process.n.06'},
            ('object','physical','part') : {'part.n.02'},
           } 
for lex, synset in lexToSynset.items():
    hypernyms = set(h.name() for h in synset.closure(hyper))
    for cat, rule in catRules.items():
        if len(hypernyms & rule) == len(rule):
            accountedFor.add(lex)

In [23]:
print('accounted for: {}/{}'.format( 
      len(accountedFor & set(lexToSynset.keys())),
      len(lexToSynset.keys())
     ))

accounted for: 1130/1306


In [24]:
for lex in set(lexToSynset.keys()) - accountedFor:
    print(lexToSynset[lex])
    print(mappedToSynset[lex]['synset'].definition())
    print(list((x.name(), x.definition()) for x in mappedToSynset[lex]['synset'].closure(hyper)))
    print()

Synset('pile.n.01')
a collection of objects laid on top of each other
[('collection.n.01', 'several things grouped together or considered as a whole'), ('group.n.01', 'any number of entities (members) considered as a unit'), ('abstraction.n.06', 'a general concept formed by extracting common features from specific examples'), ('entity.n.01', 'that which is perceived or known or inferred to have its own distinct existence (living or nonliving)')]

Synset('sine.n.01')
ratio of the length of the side opposite the given angle to the length of the hypotenuse of a right-angled triangle
[('trigonometric_function.n.01', 'function of an angle expressed as a ratio of the length of the sides of right-angled triangle containing the angle'), ('function.n.01', '(mathematics) a mathematical relation such that each element of a given set (the domain of the function) is associated with an element of another set (the range of the function)'), ('mathematical_relation.n.01', 'a relation between mathematic

In [25]:
# Quality control: weed out superfluous rules

ruleMatches = defaultdict(set)

indent(reset = True)
info('beginning rule checks...')
for rule, ruleSet in catRules.items():
    if rule in ruleMatches:
        raise Exception('duplicate rule found: {} '.format(rule))
    for lex, synset in lexToSynset.items():
        hypernyms = set(h.name() for h in synset.closure(hyper))
        if len(hypernyms & ruleSet) == len(ruleSet):
            ruleMatches[rule].add(lex)
info('DONE!')
info('rules found: {}'.format(len(ruleMatches)))
info('terms found: {}'.format(len(set(synset for rule in ruleMatches for synset in ruleMatches[rule]))))
print()
info('comparing matches...')

ruleOverlaps = {}

for rule, matches in ruleMatches.items():
    overlaps = list()
    for compareRule, compareMatches in ruleMatches.items():
        if rule != compareRule and matches & compareMatches:
            if rule[0] != compareRule[0]:
                overlaps.append({
                                   'overlapper':compareRule,
                                   '#overlaps':len(matches&compareMatches),
                                    #'overlaps':list(F.gloss.v(n) for n in matches&compareMatches)
                                   '#my matches':len(matches),
                                   '#their matches':len(compareMatches),
                                   'my ruleSet': catRules[rule],
                                   'their ruleSet': catRules[compareRule]
                               })
            
    if overlaps:
        ruleOverlaps[rule] = overlaps
            
info('comparisons COMPLETE!')

  0.00s beginning rule checks...
  2.10s DONE!
  2.10s rules found: 43
  2.10s terms found: 1130

  2.11s comparing matches...
  2.11s comparisons COMPLETE!


In [26]:
print(len(ruleOverlaps))
pprint(ruleOverlaps)

5
{('agency', 'living'): [{'#my matches': 178,
                         '#overlaps': 25,
                         '#their matches': 25,
                         'my ruleSet': {'living_thing.n.01'},
                         'overlapper': ('object', 'physical', 'plant'),
                         'their ruleSet': {'plant.n.02'}}],
 ('object', 'abstract', 'measure'): [{'#my matches': 55,
                                      '#overlaps': 20,
                                      '#their matches': 20,
                                      'my ruleSet': {'measure.n.02'},
                                      'overlapper': ('time', 'period'),
                                      'their ruleSet': {'time_period.n.01'}},
                                     {'#my matches': 55,
                                      '#overlaps': 4,
                                      '#their matches': 4,
                                      'my ruleSet': {'measure.n.02'},
                                      

# 6. export the new list

For conflicts, the node will be grouped with the smaller (the more specific, in theory) group. 

*[future improvement: track the max depth of each path for a given sense and take the deeper path]*

Now we process the remaining lemmas and group them into their smaller categories. The file is exported.

In [27]:
lexToCat = defaultdict(dict)

for lexNode in accountedFor:
    for rule, matches in ruleMatches.items():
        if lexNode in matches:
            weight = len(matches) 
            if weight < lexToCat[lexNode].setdefault('weight', weight+1):
                lexToCat[lexNode]={'weight':weight,'cat':rule}
                
len(lexToCat)

1130

In [28]:
fieldnames = ['pos','type','of kind']

exportList = dict( (F.lex.v(lex), {'cat': lexToCat[lex]['cat'][0], 
                                   'subcategory': lexToCat[lex]['cat'][1],
                                   'subsubcategory':lexToCat[lex]['cat'][2] if len(lexToCat[lex]['cat']) > 2 else ''
                                  }
                   ) for lex in lexToCat)

## Special Adjustments:

In [30]:
del exportList['KNP/']
del exportList['ML>KT/']

In [31]:
with open('wordnetCategories.json','w') as file:
    json.dump(exportList, file)