<img align="right" src="tf-small.png"/>

# Subjects

Ultimate goal: determine the *number* (singular/plural) of every subject phrase.

First step: an inventarisation of all subjects.

Below is a categorisation of cases.
The outcome of this crude analysis is that most cases we can compute, but there is a list of
2777 cases that require attention.
This notebook will generate those cases in an
[Excel Sheet](subjectCases.xlsx).

In [66]:
import sys, collections
import xlsxwriter
from tf.fabric import Fabric

# Call Text-Fabric

Everything starts by setting up Text-Fabric.
It needs to know where to look for data.

In [3]:
ETCBC = 'hebrew/etcbc4c'
TF = Fabric( modules=[ETCBC])

This is Text-Fabric 2.3.4
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
108 features found and 0 ignored


# Load Features

In [5]:
api = TF.load('''
    sp st
    function
''')
api.makeAvailableIn(globals())

  0.00s loading features ...
   |     0.18s B sp                   from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.13s B st                   from /Users/dirk/github/text-fabric-data/hebrew/etcbc4c
   |     0.00s Feature overview: 102 nodes; 5 edges; 1 configs; 7 computeds
  0.36s All features loaded/computed - for details use loadLog()


# Counting

We count the number of all subject phrases.

Then we refine. We want to know how many subject phrases there are with 1, 2, 3, etc nouns in the
absolute state.

As a preliminary, we make sure that every node that has `function=Subj` is in fact a phrase node.

In [8]:
indent(reset=True)
info('Checking Subj nodes')
e = 0
for n in F.function.s('Subj'):
    if F.otype.v(n) != 'phrase': e += 1
info('Done. {} Subj nodes are not phrases'.format(e))

  0.00s Checking Subj nodes
  0.09s Done. 0 Subj nodes are not phrases


In [16]:
subjects = collections.Counter()
indent(reset=True)
nounCodes = {'subs', 'nmpr'} # do not forget the proper nouns

info('Counting subject phrases ...')
for p in F.function.s('Subj'):
    words = tuple(w for w in L.d(p, 'word') if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(words)
    subjects[nwords] += 1

info('Done: found {} subject phrases'.format(sum(subjects.values())))

  0.00s Counting subject phrases ...
  0.32s Done: found 31930 subject phrases


Now let us see how they are distributed with respect to the number of nouns in absolute state that they contain.

In [18]:
moreThanOne = 0
for (nAbs, nSubjects) in sorted(subjects.items(), key=lambda x: (x[0], -x[1])):
    print('{:>2} absolutes: {:>5} subjects'.format(nAbs, nSubjects))
    if nAbs >= 2: moreThanOne += nSubjects
print('>2 absolutes: {:>5} subjects'.format(moreThanOne))

 0 absolutes:  6890 subjects
 1 absolutes: 20812 subjects
 2 absolutes:  3098 subjects
 3 absolutes:   677 subjects
 4 absolutes:   221 subjects
 5 absolutes:    88 subjects
 6 absolutes:    52 subjects
 7 absolutes:    37 subjects
 8 absolutes:    36 subjects
 9 absolutes:     8 subjects
10 absolutes:     7 subjects
11 absolutes:     1 subjects
12 absolutes:     1 subjects
13 absolutes:     2 subjects
>2 absolutes:  4228 subjects


# Next steps:

* Inspect the subjects with zero nouns in the absolute state. Are these phrases pronominal? Do they
  consist of `asher`? We can assign an '??' value to `asher` cases, unless we want to go as far as determining
  the antecedent phrases for number.
* Inspect the phrases with one absolute noun. Can we safely identify the number of the phrase with the number of
  the noun?
* For the 4228 subjects with more than one absolute: inspect, classify, distinguis between easy classes and difficult
  classes, 
  show difficult cases in a spreadsheet, sort them, and see whether you can   
  assign a number to these semi-automatically.

# Zero nouns

Let us see how the zero noun subjects are shaped. We want to know how many words they have, and how many for each part-of-speech.

In [19]:
zeroSubjects = collections.Counter()
indent(reset=True)

info('Counting zero-noun subject phrases ...')
for p in F.function.s('Subj'):
    allWords = L.d(p, 'word')
    words = tuple(w for w in allWords if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(words)
    if nwords > 0: continue
    profile = collections.Counter()
    for w in allWords:
        profile[F.sp.v(w)] += 1
    fixedProfile = tuple(sorted(profile.items(), key=lambda x: (x[0], -x[1])))
    zeroSubjects[fixedProfile] += 1
info('Done. Found {} profiles'.format(len(zeroSubjects)))

  0.00s Counting zero-noun subject phrases ...
  0.36s Done. Found 111 profiles


Now show the profiles.

In [20]:
for (profile, nSubjects) in sorted(zeroSubjects.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} subjects have profile {}'.format(nSubjects, profile))

 3766 subjects have profile (('prps', 1),)
  610 subjects have profile (('prde', 1),)
  495 subjects have profile (('adjv', 1),)
  349 subjects have profile (('verb', 1),)
  342 subjects have profile (('prin', 1),)
  225 subjects have profile (('adjv', 1), ('art', 1))
  180 subjects have profile (('adjv', 1), ('subs', 1))
  157 subjects have profile (('art', 1), ('verb', 1))
  156 subjects have profile (('advb', 1), ('prps', 1))
  146 subjects have profile (('subs', 1),)
   73 subjects have profile (('subs', 1), ('verb', 1))
   51 subjects have profile (('prde', 1), ('subs', 1))
   45 subjects have profile (('adjv', 1), ('art', 1), ('subs', 1))
   22 subjects have profile (('prep', 1),)
   21 subjects have profile (('advb', 1), ('prde', 1))
   18 subjects have profile (('art', 1), ('subs', 1), ('verb', 1))
   15 subjects have profile (('subs', 2),)
   13 subjects have profile (('adjv', 2), ('art', 2), ('conj', 1))
   12 subjects have profile (('adjv', 2), ('conj', 1))
   12 subjects ha

## Observation
More than half of the zero-noun subjects consist of a single personal pronoun.
In those cases we can identify the number of the phrase with the number of the personal pronoun.

That leaves roughly 3000 cases that merit closer inspection.

Probably it is not needed to go over them individually, because there might be other profiles that allow us
to determine the phrase number. E.g. tha profiles with a single demonstrative pronoun, adjective, verb or interrogative pronoun. That is a reduction of 1800 cases.

The remaining 1200 can be reduced further. Where we have and adjective or verb with artice, or an adjective with substantive, or an adverb with personal pronoun,
we can also determine the phrase number.
This gives a reduction of 700 cases.

Let's inspect the profiles of the remaining 500 cases.

Cases with 1 noun (not in absolute state), or combinations of noun, article, adjective, verb, adverb, pronoun, preposition, but never more than one of each, are probably easy as well.
That adds up to roughly 460 cases.

### Result
We are left with **40 cases for individual inspection**
in the subclass of subjects with zero nouns in the absolute state.

# Two nouns

This is a rather big class. Let us make profiles here, like in the zero-noun class, and identify useful patterns.

In [21]:
twoSubjects = collections.Counter()
indent(reset=True)

info('Counting two-noun subject phrases ...')
for p in F.function.s('Subj'):
    allWords = L.d(p, 'word')
    words = tuple(w for w in allWords if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(words)
    if nwords != 2: continue
    profile = collections.Counter()
    for w in allWords:
        profile[F.sp.v(w)] += 1
    fixedProfile = tuple(sorted(profile.items(), key=lambda x: (x[0], -x[1])))
    twoSubjects[fixedProfile] += 1
info('Done. Found {} profiles'.format(len(twoSubjects)))

  0.00s Counting two-noun subject phrases ...
  0.38s Done. Found 265 profiles


Now show the profiles.

In [22]:
for (profile, nSubjects) in sorted(twoSubjects.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} subjects have profile {}'.format(nSubjects, profile))

  514 subjects have profile (('nmpr', 1), ('subs', 1))
  448 subjects have profile (('nmpr', 2), ('subs', 1))
  277 subjects have profile (('conj', 1), ('subs', 2))
  213 subjects have profile (('subs', 2),)
  183 subjects have profile (('art', 1), ('nmpr', 1), ('subs', 1))
  161 subjects have profile (('nmpr', 2),)
   94 subjects have profile (('conj', 1), ('nmpr', 2))
   67 subjects have profile (('nmpr', 1), ('subs', 2))
   54 subjects have profile (('conj', 1), ('nmpr', 1), ('subs', 1))
   43 subjects have profile (('subs', 3),)
   41 subjects have profile (('prep', 1), ('subs', 2))
   37 subjects have profile (('art', 1), ('subs', 2))
   37 subjects have profile (('conj', 1), ('subs', 3))
   35 subjects have profile (('conj', 1), ('nmpr', 2), ('subs', 1))
   34 subjects have profile (('art', 1), ('prep', 1), ('subs', 2))
   34 subjects have profile (('nmpr', 1), ('prep', 1), ('subs', 2))
   30 subjects have profile (('art', 2), ('conj', 1), ('subs', 2))
   30 subjects have profile

## Observation

There are 3100 cases in total.
We quickly spot the usual suspect: noun + conjunction + noun.
But this is not the overwhelming majority (as I had hoped), only 277 cases.
If we add up all cases where we have a conjunction between to nouns (proper or otherwise), we see
277 + 94 + 54 = 420 cases (roughly).

Maybe it is better to make our profiles a bit coarser: let us only count nouns (including proper ones) and
conjunctions, and see what profiles we get.

In [23]:
twoSubjects = collections.Counter()
indent(reset=True)

relevantSp = nounCodes | {'conj'}

info('Counting two-noun subject phrases ...')
for p in F.function.s('Subj'):
    allWords = L.d(p, 'word')
    words = tuple(w for w in allWords if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(words)
    if nwords != 2: continue
    profile = collections.Counter()
    for w in allWords:
        sp = F.sp.v(w)
        if sp not in relevantSp: continue
        profile[F.sp.v(w)] += 1
    fixedProfile = tuple(sorted(profile.items(), key=lambda x: (x[0], -x[1])))
    twoSubjects[fixedProfile] += 1
info('Done. Found {} profiles'.format(len(twoSubjects)))

  0.00s Counting two-noun subject phrases ...
  0.36s Done. Found 44 profiles


Now show the simplified profiles.

In [24]:
for (profile, nSubjects) in sorted(twoSubjects.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} subjects have profile {}'.format(nSubjects, profile))

  764 subjects have profile (('nmpr', 1), ('subs', 1))
  477 subjects have profile (('nmpr', 2), ('subs', 1))
  406 subjects have profile (('subs', 2),)
  374 subjects have profile (('conj', 1), ('subs', 2))
  170 subjects have profile (('nmpr', 2),)
  149 subjects have profile (('nmpr', 1), ('subs', 2))
  130 subjects have profile (('subs', 3),)
  108 subjects have profile (('conj', 1), ('nmpr', 2))
   95 subjects have profile (('conj', 1), ('nmpr', 1), ('subs', 1))
   82 subjects have profile (('conj', 1), ('subs', 3))
   59 subjects have profile (('conj', 1), ('nmpr', 2), ('subs', 1))
   43 subjects have profile (('nmpr', 2), ('subs', 2))
   42 subjects have profile (('conj', 1), ('subs', 4))
   36 subjects have profile (('conj', 1), ('nmpr', 1), ('subs', 2))
   36 subjects have profile (('subs', 4),)
   27 subjects have profile (('conj', 1), ('nmpr', 2), ('subs', 2))
   27 subjects have profile (('nmpr', 1), ('subs', 3))
   11 subjects have profile (('conj', 2), ('subs', 2))
    7 

# More than one noun

Maybe it is plausible that phrases with multiple nouns and a conjunction are definitely plural.
So let us consider all those cases solved, and list the simplified profiles of the other cases.

In [60]:
onePlusSubjects = collections.Counter()
indent(reset=True)

info('Counting >1-noun subject phrases without conjunction ...')
for p in F.function.s('Subj'):
    allWords = L.d(p, 'word')
    words = tuple(w for w in allWords if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(words)
    if nwords <= 1: continue
    hasConj = len([w for w in allWords if F.sp.v(w) == 'conj']) >= 1
    if hasConj: continue
    profile = collections.Counter()
    for w in allWords:
        sp = F.sp.v(w)
        if sp not in nounCodes: continue
        profile[F.sp.v(w)] += 1

    fixedProfile = tuple(sorted(profile.items(), key=lambda x: (x[0], -x[1])))
    onePlusSubjects[fixedProfile] += 1
info('Done. Found {} profiles with {} cases'.format(len(onePlusSubjects), sum(onePlusSubjects.values())))

  0.00s Counting >1-noun subject phrases without conjunction ...
  0.35s Done. Found 40 profiles with 2633 cases


Now show the simplified profiles.

In [61]:
for (profile, nSubjects) in sorted(onePlusSubjects.items(), key=lambda x: (-x[1], x[0])):
    print('{:>5} subjects have profile {}'.format(nSubjects, profile))

  764 subjects have profile (('nmpr', 1), ('subs', 1))
  485 subjects have profile (('nmpr', 2), ('subs', 1))
  406 subjects have profile (('subs', 2),)
  179 subjects have profile (('subs', 3),)
  170 subjects have profile (('nmpr', 2),)
  162 subjects have profile (('nmpr', 1), ('subs', 2))
  112 subjects have profile (('nmpr', 2), ('subs', 2))
   62 subjects have profile (('subs', 4),)
   54 subjects have profile (('nmpr', 3), ('subs', 2))
   41 subjects have profile (('nmpr', 3), ('subs', 3))
   40 subjects have profile (('nmpr', 1), ('subs', 3))
   40 subjects have profile (('nmpr', 3),)
   16 subjects have profile (('nmpr', 1), ('subs', 4))
   16 subjects have profile (('nmpr', 2), ('subs', 3))
   15 subjects have profile (('subs', 5),)
   10 subjects have profile (('nmpr', 3), ('subs', 1))
    8 subjects have profile (('nmpr', 4), ('subs', 4))
    6 subjects have profile (('nmpr', 4), ('subs', 3))
    5 subjects have profile (('nmpr', 3), ('subs', 4))
    5 subjects have profile

## Conclusion
We have **2633 cases to inspect** in the class of more than one noun.
Maybe it is possible to make big strides here as well.

As a final exercise, I will collect them in an excel file, together with passage indicator and full clause in which the subject occurs.
The **40 problemetic cases** identified among the zero-noun subjects will also be included.

For clarity, we'll make a fresh collection of the cases we want to list.

# Zero noun cases
Main condition: no nouns in the absolute state.

Subset: only those with a profile in which all parts-of-speech occur at most once.

# Multiple noun cases
Main condition: more than one noun in the absolute state.

Subset: no conjunction present.

In [68]:
inspectSubjects = collections.Counter()
indent(reset=True)

info('Collecting subject phrases for inspection...')

fields = '''book chapter verse node subject clause'''.split()
result = []

for p in F.function.s('Subj'):
    allWords = L.d(p, 'word')
    nouns = tuple(w for w in allWords if F.sp.v(w) in nounCodes and F.st.v(w) == 'a')
    nwords = len(nouns)
    if nwords == 1:
        continue
    profile = collections.Counter()
    for w in allWords: profile[F.sp.v(w)] += 1
    hasConj = 'conj' in profile
    
    if nwords == 0 and max(profile.values()) <= 1:
        continue
    if nwords > 1 and hasConj:
        continue

    clause = L.u(p, 'clause')[0]
    clauseText = T.text(L.d(clause, 'word'))
    phraseText = T.text(L.d(p, 'word'))
    (book, chapter, verse) = T.sectionFromNode(p)
    result.append((book, chapter, verse, p, phraseText, clauseText))
info('Done. Found {} cases'.format(len(result)))

  0.00s Collecting subject phrases for inspection...
  0.65s Done. Found 2777 cases


We are going to write this to disk as an excel file.
Let us just check the first 10 results.

In [69]:
for r in result[0:10]:
    print(' '.join(str(f) for f in r))

Genesis 1 9 605233 הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙  יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד 
Genesis 2 4 605560 יְהוָ֥ה אֱלֹהִ֖ים  עֲשֹׂ֛ות יְהוָ֥ה אֱלֹהִ֖ים אֶ֥רֶץ וְשָׁמָֽיִם׃ 
Genesis 2 5 605574 יְהוָ֤ה אֱלֹהִים֙  כִּי֩ לֹ֨א הִמְטִ֜יר יְהוָ֤ה אֱלֹהִים֙ עַל־הָאָ֔רֶץ 
Genesis 2 7 605590 יְהוָ֨ה אֱלֹהִ֜ים  וַיִּיצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים אֶת־הָֽאָדָ֗ם עָפָר֙ מִן־הָ֣אֲדָמָ֔ה 
Genesis 2 8 605603 יְהוָ֧ה אֱלֹהִ֛ים  וַיִּטַּ֞ע יְהוָ֧ה אֱלֹהִ֛ים גַּן־בְּעֵ֖דֶן מִקֶּ֑דֶם 
Genesis 2 9 605615 יְהוָ֤ה אֱלֹהִים֙  וַיַּצְמַ֞ח יְהוָ֤ה אֱלֹהִים֙ מִן־הָ֣אֲדָמָ֔ה כָּל־עֵ֛ץ 
Genesis 2 15 605674 יְהוָ֥ה אֱלֹהִ֖ים  וַיִּקַּ֛ח יְהוָ֥ה אֱלֹהִ֖ים אֶת־הָֽאָדָ֑ם 
Genesis 2 16 605684 יְהוָ֣ה אֱלֹהִ֔ים  וַיְצַו֙ יְהוָ֣ה אֱלֹהִ֔ים עַל־הָֽאָדָ֖ם 
Genesis 2 18 605703 יְהוָ֣ה אֱלֹהִ֔ים  וַיֹּ֨אמֶר֙ יְהוָ֣ה אֱלֹהִ֔ים 
Genesis 2 19 605714 יְהוָ֨ה אֱלֹהִ֜ים  וַיִּצֶר֩ יְהוָ֨ה אֱלֹהִ֜ים מִן־הָֽאֲדָמָ֗ה כָּל־חַיַּ֤ת הַשָּׂדֶה֙ וְאֵת֙ כָּל־עֹ֣וף הַשָּׁמַ֔יִם 


In [75]:
workbook = xlsxwriter.Workbook('subjectCases.xlsx', {'strings_to_urls': False})
worksheet = workbook.add_worksheet('subjects')
hebrew = workbook.add_format({'font_name': 'Ezra SIL', 'font_size': 14, 'align': 'right'})
worksheet.set_column(4, 4, 20, hebrew)
worksheet.set_column(5, 5, 100, hebrew)
for (f, field) in enumerate(fields):
        worksheet.write(0, f, field)
for (r, row) in enumerate(result):
    for (f, val) in enumerate(row):
        worksheet.write(r+1, f, val)
workbook.close()

Download the [Excel Sheet](subjectCases.xlsx)