# Chapters with only "frequent" words

Task: find the chapters without more than 20 rare words, where a rare word has a frequency (as lexeme) of less than 70.

A question posed by Oliver Glanz.

<img align="right" src="images/tf-small.png" width="128"/>
<img align="right" src="images/etcbc.png"/>
<img align="right" src="images/dans-small.png"/>

# Sets and queries

You can pass custom sets to the search function, as we have seen in [advanced](searchAdvanced.ipynb).
Now we want to give a real-world example of that, and also show how you can prepare sets for use
in the TF browser.

## Chapters with only "frequent" words

The following task comes from the department of education:

*Find the chapters without more than 20 rare words, where a rare word has a frequency (as lexeme) of less than 70.*

A question posed by Oliver Glanz.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os

In [3]:
from tf.fabric import Fabric
from tf.app import use
from tf.lib import writeSets, readSets

By way of variation, we do not use the *minimal incantation* but a *fast* incantation.

In [4]:
TF = Fabric(modules='etcbc/bhsa/tf/c')

This is Text-Fabric 7.11.1
Api reference : https://annotation.github.io/text-fabric/Api/Fabric/

114 features found and 0 ignored


In [5]:
api = TF.load('book freq_lex', silent=True)

   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


In [6]:
A = use('bhsa:clone', checkout="clone", api=api, hoist=globals())

Using TF-app in /Users/dirk/github/annotation/app-bhsa/code:
	repo clone offline under ~/github (local github)


In [7]:
FREQ = 70
AMOUNT = 20

## Query

A straightforward query is:

In [8]:
query = f'''
chapter
/without/
  word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
  < word freq_lex<{FREQ}
/-/
'''

Several problems with this query:

* it is very inelegant.
* it does not perform, in fact, you cannot wait for it.
* the logic is wasteful: the `/without/` query that expresses what should be left out
  denotes all possible combinations of 20 infrequent words, an astronomical number.

So, better not search with this one.

In [9]:
# indent(reset=True)
# info('start query')
# results = S.search(query, limit=1)
# info('end query')
# len(results)

# By hand

On the other hand, with a bit of hand coding it is very easy, and almost instantaneous:

In [10]:
results = []
allChapters = F.otype.s('chapter')

for chapter in allChapters:
    if len([
        word for word in L.d(chapter, otype='word') if F.freq_lex.v(word) < FREQ
    ]) < AMOUNT:
        results.append(chapter)
        
print(f'{len(results)} chapters out of {len(allChapters)}')

60 chapters out of 929


In [11]:
resultsByBook = dict()

for chapter in results:
    (bk, ch) = T.sectionFromNode(chapter)
    resultsByBook.setdefault(bk, []).append(ch)
    
for (bk, chps) in resultsByBook.items():
    print('{} {}'.format(bk, ', '.join(str(c) for c in chps)))

Exodus 11, 24
Leviticus 17
Deuteronomy 30
Joshua 23
Isaiah 12, 39
Jeremiah 45
Ezekiel 15
Hosea 3
Joel 3
Psalms 1, 3, 4, 13, 14, 15, 20, 23, 24, 26, 43, 47, 53, 54, 61, 67, 70, 82, 86, 87, 93, 97, 99, 100, 101, 110, 113, 114, 115, 117, 120, 121, 122, 123, 124, 125, 126, 127, 128, 130, 131, 133, 134, 136, 138, 150
Job 25
Esther 10
2_Chronicles 27


# Custom sets

Once you have these chapters, you can put them in a set and use them in queries.

We show how to query results as far as they occur in an "ordinary" chapter.

First we search for a phenomenon in all chapters. The phenomenon is a clause with a subject consisting of a single noun in
the plural and a verb in the plural.

In [12]:
sets = dict(ochapter=set(results))

In [13]:
query1 = '''
verse
  clause
    phrase function=Pred
      word pdp=verb nu=sg
    phrase function=Subj
      =: word pdp=subs nu=pl
      :=
'''

In [14]:
results1 = A.search(query1)

  1.87s 263 results


In [15]:
A.table(results1, start=1, end=10)

n,p,verse,clause,phrase,word,phrase.1,word.1
1,Genesis 1:1,Genesis 1:1בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃,בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃,בָּרָ֣א,בָּרָ֣א,אֱלֹהִ֑ים,אֱלֹהִ֑ים
2,Genesis 1:3,Genesis 1:3וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י אֹ֑ור וַֽיְהִי־אֹֽור׃,וַיֹּ֥אמֶר אֱלֹהִ֖ים,יֹּ֥אמֶר,יֹּ֥אמֶר,אֱלֹהִ֖ים,אֱלֹהִ֖ים
3,Genesis 1:4,Genesis 1:4וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃,וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור,יַּ֧רְא,יַּ֧רְא,אֱלֹהִ֛ים,אֱלֹהִ֛ים
4,Genesis 1:4,Genesis 1:4וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָאֹ֖ור כִּי־טֹ֑וב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃,וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָאֹ֖ור וּבֵ֥ין הַחֹֽשֶׁךְ׃,יַּבְדֵּ֣ל,יַּבְדֵּ֣ל,אֱלֹהִ֔ים,אֱלֹהִ֔ים
5,Genesis 1:5,Genesis 1:5וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום אֶחָֽד׃ פ,וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לָאֹור֙ יֹ֔ום,יִּקְרָ֨א,יִּקְרָ֨א,אֱלֹהִ֤ים׀,אֱלֹהִ֤ים׀
6,Genesis 1:6,Genesis 1:6וַיֹּ֣אמֶר אֱלֹהִ֔ים יְהִ֥י רָקִ֖יעַ בְּתֹ֣וךְ הַמָּ֑יִם וִיהִ֣י מַבְדִּ֔יל בֵּ֥ין מַ֖יִם לָמָֽיִם׃,וַיֹּ֣אמֶר אֱלֹהִ֔ים,יֹּ֣אמֶר,יֹּ֣אמֶר,אֱלֹהִ֔ים,אֱלֹהִ֔ים
7,Genesis 1:7,Genesis 1:7וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒ וַיַּבְדֵּ֗ל בֵּ֤ין הַמַּ֨יִם֙ אֲשֶׁר֙ מִתַּ֣חַת לָרָקִ֔יעַ וּבֵ֣ין הַמַּ֔יִם אֲשֶׁ֖ר מֵעַ֣ל לָרָקִ֑יעַ וַֽיְהִי־כֵֽן׃,וַיַּ֣עַשׂ אֱלֹהִים֮ אֶת־הָרָקִיעַ֒,יַּ֣עַשׂ,יַּ֣עַשׂ,אֱלֹהִים֮,אֱלֹהִים֮
8,Genesis 1:8,Genesis 1:8וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר יֹ֥ום שֵׁנִֽי׃ פ,וַיִּקְרָ֧א אֱלֹהִ֛ים לָֽרָקִ֖יעַ שָׁמָ֑יִם,יִּקְרָ֧א,יִּקְרָ֧א,אֱלֹהִ֛ים,אֱלֹהִ֛ים
9,Genesis 1:9,Genesis 1:9וַיֹּ֣אמֶר אֱלֹהִ֗ים יִקָּו֨וּ הַמַּ֜יִם מִתַּ֤חַת הַשָּׁמַ֨יִם֙ אֶל־מָקֹ֣ום אֶחָ֔ד וְתֵרָאֶ֖ה הַיַּבָּשָׁ֑ה וַֽיְהִי־כֵֽן׃,וַיֹּ֣אמֶר אֱלֹהִ֗ים,יֹּ֣אמֶר,יֹּ֣אמֶר,אֱלֹהִ֗ים,אֱלֹהִ֗ים
10,Genesis 1:10,Genesis 1:10וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ וּלְמִקְוֵ֥ה הַמַּ֖יִם קָרָ֣א יַמִּ֑ים וַיַּ֥רְא אֱלֹהִ֖ים כִּי־טֹֽוב׃,וַיִּקְרָ֨א אֱלֹהִ֤ים׀ לַיַּבָּשָׁה֙ אֶ֔רֶץ,יִּקְרָ֨א,יִּקְרָ֨א,אֱלֹהִ֤ים׀,אֱלֹהִ֤ים׀


Now we want to restrict results to ordinary chapters:

In [16]:
query2 = '''
ochapter
  verse
    clause
      phrase function=Pred
        word pdp=verb nu=sg
      phrase function=Subj
        =: word pdp=subs nu=pl
        :=
'''

Note that we use the name of a set here: `ochapter`. 
It is not a known node type in the BHSA, so we have to tell it what it means.
We do that by passing a dictionary of custom sets.
The keys are the names of the sets, which are the values.

Then we may use those keys in queries, everywhere where a node type is expected.

In [17]:
results2 = A.search(query2, sets=sets)

  1.74s 7 results


In [18]:
A.table(results2)

n,p,chapter,verse,clause,phrase,word,phrase.1,word.1
1,Psalms 47:6,Psalms 47,Psalms 47:6עָלָ֣ה אֱ֭לֹהִים בִּתְרוּעָ֑ה יְ֝הֹוָ֗ה בְּקֹ֣ול שֹׁופָֽר׃,עָלָ֣ה אֱ֭לֹהִים בִּתְרוּעָ֑ה,עָלָ֣ה,עָלָ֣ה,אֱ֭לֹהִים,אֱ֭לֹהִים
2,Psalms 47:9,Psalms 47,Psalms 47:9מָלַ֣ךְ אֱ֭לֹהִים עַל־גֹּויִ֑ם אֱ֝לֹהִ֗ים יָשַׁ֤ב׀ עַל־כִּסֵּ֬א קָדְשֹֽׁו׃,מָלַ֣ךְ אֱ֭לֹהִים עַל־גֹּויִ֑ם,מָלַ֣ךְ,מָלַ֣ךְ,אֱ֭לֹהִים,אֱ֭לֹהִים
3,Psalms 47:9,Psalms 47,Psalms 47:9מָלַ֣ךְ אֱ֭לֹהִים עַל־גֹּויִ֑ם אֱ֝לֹהִ֗ים יָשַׁ֤ב׀ עַל־כִּסֵּ֬א קָדְשֹֽׁו׃,אֱ֝לֹהִ֗ים יָשַׁ֤ב׀ עַל־כִּסֵּ֬א קָדְשֹֽׁו׃,יָשַׁ֤ב׀,יָשַׁ֤ב׀,אֱ֝לֹהִ֗ים,אֱ֝לֹהִ֗ים
4,Psalms 53:3,Psalms 53,Psalms 53:3אֱֽלֹהִ֗ים מִשָּׁמַיִם֮ הִשְׁקִ֪יף עַֽל־בְּנֵ֫י אָדָ֥ם לִ֭רְאֹות הֲיֵ֣שׁ מַשְׂכִּ֑יל דֹּ֝רֵ֗שׁ אֶת־אֱלֹהִֽים׃,אֱֽלֹהִ֗ים מִשָּׁמַיִם֮ הִשְׁקִ֪יף עַֽל־בְּנֵ֫י אָדָ֥ם,הִשְׁקִ֪יף,הִשְׁקִ֪יף,אֱֽלֹהִ֗ים,אֱֽלֹהִ֗ים
5,Psalms 53:6,Psalms 53,Psalms 53:6שָׁ֤ם׀ פָּֽחֲדוּ־פַחַד֮ לֹא־הָ֪יָה֫ פָ֥חַד כִּֽי־אֱלֹהִ֗ים פִּ֭זַּר עַצְמֹ֣ות חֹנָ֑ךְ הֱ֝בִשֹׁ֗תָה כִּֽי־אֱלֹהִ֥ים מְאָסָֽם׃,כִּֽי־אֱלֹהִ֗ים פִּ֭זַּר עַצְמֹ֣ות חֹנָ֑ךְ,פִּ֭זַּר,פִּ֭זַּר,אֱלֹהִ֗ים,אֱלֹהִ֗ים
6,Psalms 70:5,Psalms 70,Psalms 70:5יָ֘שִׂ֤ישׂוּ וְיִשְׂמְח֨וּ׀ בְּךָ֗ כָּֽל־מְבַ֫קְשֶׁ֥יךָ וְיֹאמְר֣וּ תָ֭מִיד יִגְדַּ֣ל אֱלֹהִ֑ים אֹ֝הֲבֵ֗י יְשׁוּעָתֶֽךָ׃,יִגְדַּ֣ל אֱלֹהִ֑ים,יִגְדַּ֣ל,יִגְדַּ֣ל,אֱלֹהִ֑ים,אֱלֹהִ֑ים
7,2_Chronicles 27:6,2_Chronicles 27,2_Chronicles 27:6וַיִּתְחַזֵּ֖ק יֹותָ֑ם כִּ֚י הֵכִ֣ין דְּרָכָ֔יו לִפְנֵ֖י יְהוָ֥ה אֱלֹהָֽיו׃,כִּ֚י הֵכִ֣ין דְּרָכָ֔יו לִפְנֵ֖י יְהוָ֥ה אֱלֹהָֽיו׃,הֵכִ֣ין,הֵכִ֣ין,דְּרָכָ֔יו,דְּרָכָ֔יו


## Custom sets in the browser

We save the sets in a file.
But before we do so, we also want to save all ordinary verses in a set, and all ordinary words. 

In [19]:
queryV = f'''
verse
/without/
  word freq_lex<{FREQ}
/-/
'''
resultsV = A.search(queryV, shallow=True)
sets['overse'] = resultsV

  0.56s 2757 results


In [20]:
sets['oword'] = {w for w in F.otype.s('word') if F.freq_lex.v(w) >= FREQ}

In [21]:
SETS_FILE = os.path.expanduser('~/Downloads/ordinary.set')
writeSets(sets, SETS_FILE)

True

As a test, we read back the sets from disk and compare the number of
elements with those in the original sets, which we still have in memory.

In [22]:
testSets = readSets(SETS_FILE)
for s in sorted(testSets):
    elems = len(testSets[s])
    oelems = len(sets[s])
    print(f'{s} with {elems} nb {elems - oelems}')

ochapter with 60 nb 0
overse with 2757 nb 0
oword with 361485 nb 0


Now you can start your TF browser as follows:

```sh
text-fabric bhsa --sets=~/Downloads/ordinary.set
```

and then you can run the same queries over there!

# Appendix: investigation

Let's investigate the number of ordinary chapters with shifting definitions of ordinary

In [23]:
allChapters = F.otype.s('chapter')
longestChapter = max(len(L.d(chapter, otype='word')) for chapter in allChapters)

print(f'There are {len(allChapters)} chapters, the longest is {longestChapter} words')

There are 929 chapters, the longest is 1603 words


In [24]:
def getOrdinary(freq, amount):
    results = []

    for chapter in allChapters:
        if len([
            word for word in L.d(chapter, otype='word') if F.freq_lex.v(word) < freq
        ]) < amount:
            results.append(chapter)
    return results

In [25]:
def overview(freq):
    for amount in range(20, 1700, 50):
        results = getOrdinary(freq, amount)
        print(f'for freq={freq:>3} and amount={amount:>4}: {len(results):>4} ordinary chapters')
        if len(results) >= len(allChapters):
            break

In [26]:
for freq in (40, 70, 100):
    overview(freq)

for freq= 40 and amount=  20:  140 ordinary chapters
for freq= 40 and amount=  70:  758 ordinary chapters
for freq= 40 and amount= 120:  885 ordinary chapters
for freq= 40 and amount= 170:  908 ordinary chapters
for freq= 40 and amount= 220:  919 ordinary chapters
for freq= 40 and amount= 270:  923 ordinary chapters
for freq= 40 and amount= 320:  924 ordinary chapters
for freq= 40 and amount= 370:  925 ordinary chapters
for freq= 40 and amount= 420:  926 ordinary chapters
for freq= 40 and amount= 470:  928 ordinary chapters
for freq= 40 and amount= 520:  929 ordinary chapters
for freq= 70 and amount=  20:   60 ordinary chapters
for freq= 70 and amount=  70:  551 ordinary chapters
for freq= 70 and amount= 120:  842 ordinary chapters
for freq= 70 and amount= 170:  890 ordinary chapters
for freq= 70 and amount= 220:  915 ordinary chapters
for freq= 70 and amount= 270:  922 ordinary chapters
for freq= 70 and amount= 320:  923 ordinary chapters
for freq= 70 and amount= 370:  923 ordinary ch

# Next

You have seen how to mingle sets with queries.

Time to enter the race for space:
[relations](searchRelations.ipynb)

---

[basic](search.ipynb)
[advanced](searchAdvanced.ipynb)
sets
[relations](searchRelations.ipynb)
[quantifiers](searchQuantifiers.ipynb)
[rough](searchRough.ipynb)
[gaps](searchGaps.ipynb)