# Word sets by accents

We make some classes of words, defined by the accents they contain, and save them as sets, to be used in queries.

In [8]:
import re

In [9]:
from tf.app import use
from tf.lib import writeSets

In [3]:
A = use('bhsa', hoist=globals(), silent='deep')

We define the accents and create a regular expression out of them.

In [4]:
A_ACCENTS = set('04 24 33 63 70 71 72 73 74 93 94'.split())

In [5]:
A_PAT = '|'.join(A_ACCENTS)
A_RE = re.compile(f'(?:{A_PAT})')
A_RE

re.compile(r'(?:24|72|70|04|71|33|63|74|73|93|94)', re.UNICODE)

We make two sets of words: words that contain one or more accents in `A_ACCENTS` and words that don't.

The first set we call `word_a` and the other set `word_non_a`.

We go through all words of the whole corpus.

In [7]:
wordA = set()
wordNonA = set()

silentOff()
indent(reset=True)
info('Classifying words')

for w in F.otype.s('word'):
  translit = F.g_word.v(w)
  if A_RE.search(translit):
    wordA.add(w)
  else:
    wordNonA.add(w)
    
info(f'word_a has {len(wordA):>6} members')
info(f'word_non_a has {len(wordNonA):>6} members')

  0.00s Classifying words
  0.38s word_a has 140225 members
  0.39s word_non_a has 286359 members


Collect the sets in a dictionary that assigns names to them:

In [14]:
accents = dict(
    word_a=wordA,
    word_non_a=wordNonA,
)

Test the set in a query:

In [18]:
query = '''
book book=Genesis
  word_a
    g_cons~^(?![KL]$)
    trailer~[^&]  
'''
results = A.search(query, sets=accents)
A.table(results, end=5)
A.table(results, end=5, fmt='text-trans-full')

  0.51s 9603 results


n,p,book,word
1,Genesis 1:1,,רֵאשִׁ֖ית
2,Genesis 1:1,,בָּרָ֣א
3,Genesis 1:1,,אֵ֥ת
4,Genesis 1:1,,שָּׁמַ֖יִם
5,Genesis 1:1,,אֵ֥ת


n,p,book,word
1,Genesis 1:1,,R;>CI73JT
2,Genesis 1:1,,B.@R@74>
3,Genesis 1:1,,>;71T
4,Genesis 1:1,,C.@MA73JIM
5,Genesis 1:1,,>;71T


In [19]:
query = '''
book book=Genesis
  word_non_a
    g_cons~^(?![KL]$)
    trailer~[^&]  
'''
results = A.search(query, sets=accents)
A.table(results, end=5)
A.table(results, end=5, fmt='text-trans-full')

  0.80s 8034 results


n,p,book,word
1,Genesis 1:1,,אֱלֹהִ֑ים
2,Genesis 1:1,,אָֽרֶץ׃
3,Genesis 1:2,,אָ֗רֶץ
4,Genesis 1:2,,בֹ֔הוּ
5,Genesis 1:2,,תְהֹ֑ום


n,p,book,word
1,Genesis 1:1,,>:ELOHI92JM
2,Genesis 1:1,,>@75REY00
3,Genesis 1:2,,>@81REY
4,Genesis 1:2,,BO80HW.
5,Genesis 1:2,,T:HO92WM


Now save the sets as a TF file in your Downloads folder (if you want it in an other place,
tweak the variable `SET_DIR` below.

We use the TF helper function
[`writeSets`](https://annotation.github.io/text-fabric/Api/Lib/#sets)
to do the work.

In [20]:
SET_DIR = '~/Downloads'

writeSets(
  dict(
    word_a=wordA,
    word_non_a=wordNonA,
  ),
  f'{SET_DIR}/accents',
)

True

Check:

In [11]:
!ls -l ~/Downloads/accents

-rw-r--r--  1 dirk  staff  730590 Jul 21 10:22 /Users/dirk/Downloads/accents


Now you can use this set in the text-fabric browser by saying:

```sh
text-fabric bhsa --sets=~/Downloads/accents
```

![tfbrowser](accentsScreenshot.png)