# Word sets by accents

We make some classes of words, defined by the accents they contain, and save them as sets, to be used in queries.

In [1]:
import re

In [2]:
from tf.app import use
from tf.lib import writeSets

In [3]:
A = use("bhsa:clone", hoist=globals())

We define the accents and create a regular expression out of them.

In [4]:
A_ACCENTS = set("04 24 33 63 70 71 72 73 74 93 94".split())

In [5]:
A_PAT = "|".join(A_ACCENTS)
A_RE = re.compile(f"(?:{A_PAT})")
A_RE

re.compile(r'(?:93|33|24|74|94|72|70|63|71|73|04)', re.UNICODE)

We make two sets of words: words that contain one or more accents in `A_ACCENTS` and words that don't.

The first set we call `word_a` and the other set `word_non_a`.

We go through all words of the whole corpus.

In [6]:
wordA = set()
wordNonA = set()

A.indent(reset=True)
A.info("Classifying words")

for w in F.otype.s("word"):
    translit = F.g_word.v(w)
    if A_RE.search(translit):
        wordA.add(w)
    else:
        wordNonA.add(w)

A.info(f"word_a has {len(wordA):>6} members")
A.info(f"word_non_a has {len(wordNonA):>6} members")

  0.00s Classifying words
  0.37s word_a has 140225 members
  0.37s word_non_a has 286359 members


Collect the sets in a dictionary that assigns names to them:

In [7]:
accents = dict(
    word_a=wordA,
    word_non_a=wordNonA,
)

Test the set in a query:

In [8]:
query = """
book book=Genesis
  word_a
    g_cons~^(?![KL]$)
    trailer~[^&]
"""
results = A.search(query, sets=accents)
A.table(results, end=5)
A.table(results, end=5, fmt="text-trans-full")

  0.66s 9603 results


n,p,book,word
1,Genesis 1:1,Genesis,רֵאשִׁ֖ית
2,Genesis 1:1,Genesis,בָּרָ֣א
3,Genesis 1:1,Genesis,אֵ֥ת
4,Genesis 1:1,Genesis,שָּׁמַ֖יִם
5,Genesis 1:1,Genesis,אֵ֥ת


n,p,book,word
1,Genesis 1:1,Genesis,R;>CI73JT
2,Genesis 1:1,Genesis,B.@R@74>
3,Genesis 1:1,Genesis,>;71T
4,Genesis 1:1,Genesis,C.@MA73JIM
5,Genesis 1:1,Genesis,>;71T


In [9]:
query = """
book book=Genesis
  word_non_a
    g_cons~^(?![KL]$)
    trailer~[^&]
"""
results = A.search(query, sets=accents)
A.table(results, end=5)
A.table(results, end=5, fmt="text-trans-full")

  0.85s 8034 results


n,p,book,word
1,Genesis 1:1,Genesis,אֱלֹהִ֑ים
2,Genesis 1:1,Genesis,אָֽרֶץ׃
3,Genesis 1:2,Genesis,אָ֗רֶץ
4,Genesis 1:2,Genesis,בֹ֔הוּ
5,Genesis 1:2,Genesis,תְהֹ֑ום


n,p,book,word
1,Genesis 1:1,Genesis,>:ELOHI92JM
2,Genesis 1:1,Genesis,>@75REY00
3,Genesis 1:2,Genesis,>@81REY
4,Genesis 1:2,Genesis,BO80HW.
5,Genesis 1:2,Genesis,T:HO92WM


Now save the sets as a TF file in your Downloads folder (if you want it in an other place,
tweak the variable `SET_DIR` below.

We use the TF helper function
[`writeSets`](https://annotation.github.io/text-fabric/tf/lib.html#tf.lib.writeSets)
to do the work.

In [10]:
SET_DIR = "~/Downloads"

writeSets(accents, f"{SET_DIR}/accents")

True

Check:

In [11]:
!ls -l ~/Downloads/accents

-rw-r--r--@ 1 dirk  staff  730590 Jun 10 19:08 /Users/dirk/Downloads/accents


Now you can use this set in the text-fabric browser by saying:

```sh
text-fabric bhsa --sets=~/Downloads/accents
```

![tfbrowser](accentsScreenshot.png)