<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/etcbc.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

DSS in the browser:

```
text-fabric dss
```

In [1]:
import sys, os, collections
from tf.app import use

# Incantation

In [2]:
A = use('dss', hoist=globals())
silentOff()

	connecting to online GitHub repo annotation/app-dss ... connected
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-dss/code:
	rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/dss/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/dss/parallels/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


# Counting

In [3]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

  0.00s Counting nodes ...
  0.29s 2107863 nodes


# Node types

In [4]:
F.otype.slotType

'sign'

In [5]:
F.otype.all

('scroll', 'lex', 'fragment', 'line', 'cluster', 'word', 'sign')

In [6]:
C.levels.data

(('scroll', 1428.8121878121879, 1605868, 1606868),
 ('lex', 129.1396172248804, 1542523, 1552972),
 ('fragment', 127.90565194061885, 1531341, 1542522),
 ('line', 27.03924756593251, 1552973, 1605867),
 ('cluster', 6.678582379647672, 1430242, 1531340),
 ('word', 2.814359424744758, 1606869, 2107863),
 ('sign', 1, 1, 1430241))

In [7]:
for (typ, av, start, end) in C.levels.data:
  print(f'{end - start + 1:>7} {typ}s')

   1001 scrolls
  10450 lexs
  11182 fragments
  52895 lines
 101099 clusters
 500995 words
1430241 signs


# Feature statistics

## part of speech

In [None]:
F.sp.freqList()

In [None]:
w = F.sp.s('unknown')[0]
w

In [None]:
l = L.u(w, otype='line')[0]

In [None]:
T.formats

In [None]:
A.pretty(l, withNodes=True, fmt='layout-orig-full')

## type (cluster)

In [8]:
F.type.freqList('cluster')

(('rec', 93733),
 ('vac', 3522),
 ('cor3', 1582),
 ('unc2', 906),
 ('rem2', 706),
 ('alt', 333),
 ('cor2', 147),
 ('cor', 95),
 ('rem', 75))

## type (word)

In [9]:
F.type.freqList('word')

(('glyph', 470605), ('punct', 29927), ('numr', 463))

## type (sign)

In [None]:
F.type.freqList('sign')

# Word matters

## Top 20 frequent words

In [10]:
for (w, amount) in F.glyph.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

45393 ו
20491 ה
19378 ל
18225 ב
 6389 את
 5863 מ
 4894 אשר
 4789 יהוה
 4355 א
 4236 כול
 4185 על
 4172 אל
 3262 כי
 3091 כ
 3005 לא
 2841 כל
 2424 לוא
 1938 ארץ
 1829 ישראל
 1653 יום


## Hapaxes

In [11]:
hapaxes1 = sorted(lx for (lx, amount) in F.lex.freqList('word') if amount == 1)
len(hapaxes1)

3813

In [12]:
for lx in hapaxes1[0:20]:
  print(lx)

 #  #  #  #  # 
 #  #  #  #  #  #  #  #  # 
 #  #  #  #  # ות
 #  #  #  #  # ל #  #  # 
 #  #  #  #  # ם
 #  #  #  # ב
 #  #  #  # ה
 #  #  #  # ו # 
 #  #  #  # ך
 #  #  #  # ל #  # 
 #  #  #  # תא
 #  #  # ד
 #  #  # דב
 #  #  # דה
 #  #  # ה #  # 
 #  #  # הו
 #  #  # הם
 #  #  # ות
 #  #  # ט
 #  #  # כת


The feature `lex` contains lexemes that may have uncertain characters in it.

The function `glex` has all those characters stripped. 
Let's use `glex` instead.

In [13]:
hapaxes1g = sorted(lx for (lx, amount) in F.glex.freqList('word') if amount == 1)
len(hapaxes1)

3813

In [14]:
for lx in hapaxes1g[0:20]:
  print(lx)

100
115
126
150
32
350
536
54
61
65
66
67
71
83
92
99
 ידה
 לוט
 נַחַל
 שֵׂעָר


If we are not interested in the numerals:

In [15]:
for lx in [x for x in hapaxes1g if not x.isdigit()][0:20]:
  print(lx)

 ידה
 לוט
 נַחַל
 שֵׂעָר
אֱגֹוז
אֱלִידָד
אֱלִיעָם
אֱלִישֶׁבַע
אֲבִיטַל
אֲבִיעֶזְרִי
אֲבִיעֶזֶר
אֲבַטִּיחַ
אֲגֹורָה
אֲדַמְדַּם
אֲדָר
אֲדֹנִי
אֲדֹנִיָּה
אֲדֹרָם
אֲהָלִים
אֲחִיטוּב


### Small occurrence base

The occurrence base of a word are the scrolls in which the word occurs.

In [16]:
occurrenceBase = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for s in F.otype.s('scroll'):
  scroll = F.scroll.v(s)
  for w in L.d(s, otype='word'):
    occurrenceBase[F.glex.v(w)].add(scroll)
info('done')
info(f'{len(occurrenceBase)} entries')

  0.00s compiling occurrence base ...
  0.56s done
  0.56s 8264 entries


An overview of how many words have how big occurrence bases:

In [17]:
occurrenceSize = collections.Counter()

for (w, scrolls) in occurrenceBase.items():
  occurrenceSize[len(scrolls)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'scrolls {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'scrolls {size:>4} : {amount:>5} words')

scrolls    1 :  2763 words
scrolls    2 :  1123 words
scrolls    3 :   698 words
scrolls    4 :   460 words
scrolls    5 :   336 words
scrolls    6 :   252 words
scrolls    7 :   225 words
scrolls    8 :   182 words
scrolls    9 :   175 words
scrolls   10 :   124 words
...
scrolls  459 :     1 words
scrolls  480 :     1 words
scrolls  539 :     1 words
scrolls  600 :     1 words
scrolls  605 :     1 words
scrolls  633 :     1 words
scrolls  745 :     1 words
scrolls  761 :     1 words
scrolls  846 :     1 words
scrolls  987 :     1 words


Let's give the predicate *private* to those words whose occurrence base is a single scroll.

In [18]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

2763

### Peculiarity of scrolls

As a final exercise with scrolls, lets make a list of all scrolls, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the scroll

In [19]:
scrollList = []

empty = set()
ordinary = set()

for d in F.otype.s('scroll'):
  scroll = T.scrollName(d)
  words = {F.glex.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(scroll)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(scroll)
    continue
  p = 100 * o / a
  scrollList.append((scroll, a, o, p))

scrollList = sorted(scrollList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty scrolls')
print(f'Found {len(ordinary):>4} ordinary scrolls (i.e. without private words)')

Found    0 empty scrolls
Found  517 ordinary scrolls (i.e. without private words)


In [20]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'scroll', '#all', '#own', '%own',
    '-'*35,
))

for x in scrollList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in scrollList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

scroll               #all #own %own
-----------------------------------
4Q341                  32   20 62.5%
4Q340                  15    5 33.3%
11Q26                   7    2 28.6%
4Q124                  86   24 27.9%
1Q70bis                12    3 25.0%
4Q282d                  8    2 25.0%
4Q313a                  4    1 25.0%
4Q346a                  4    1 25.0%
1Q70                   25    6 24.0%
3Q15                  268   57 21.3%
4Q561                  74   15 20.3%
4Q559                 130   26 20.0%
4Q250b                  5    1 20.0%
4Q360a                 21    4 19.0%
KhQ1                   32    6 18.8%
4Q347                  11    2 18.2%
4Q575a                 12    2 16.7%
1Q58                    6    1 16.7%
4Q468bb                 6    1 16.7%
11Q10                 635  102 16.1%
...
4Q367                 173    1  0.6%
4Q2                   174    1  0.6%
4Q366                 186    1  0.5%
4Q98                  192    1  0.5%
4Q56                  963    5  0.5%

---

next: [search](search.ipynb)

---

full tutorial: [annotation/tutorials/dss](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/dss/start.ipynb)