<img align="right" src="images/tf.png" width="128"/>
<img align="right" src="images/etcbc.png" width="128"/>
<img align="right" src="images/dans.png" width="128"/>

DSS in the browser:

```
text-fabric dss
```

In [None]:
import sys, os, collections
from tf.app import use

# Incantation

In [None]:
A = use('dss', hoist=globals())
silentOff()

# Counting

In [None]:
indent(reset=True)
info('Counting nodes ...')

i = 0
for n in N(): i += 1

info('{} nodes'.format(i))

# Node types

In [None]:
F.otype.slotType

In [None]:
F.otype.all

In [None]:
C.levels.data

In [None]:
for (typ, av, start, end) in C.levels.data:
  print(f'{end - start + 1:>7} {typ}s')

# Feature statistics

## part of speech

In [None]:
F.sp.freqList()

## type (cluster)

In [None]:
F.type.freqList('cluster')

## type (word)

In [None]:
F.type.freqList('word')

## type (sign)

In [None]:
F.type.freqList('sign')

# Word matters

## Top 20 frequent words

In [None]:
for (w, amount) in F.glyph.freqList('word')[0:20]:
  print(f'{amount:>5} {w}')

## Hapaxes

In [None]:
hapaxes1 = sorted(lx for (lx, amount) in F.lex.freqList('word') if amount == 1)
len(hapaxes1)

In [None]:
for lx in hapaxes1[0:20]:
  print(lx)

The feature `lex` contains lexemes that may have uncertain characters in it.

The function `glex` has all those characters stripped. 
Let's use `glex` instead.

In [None]:
hapaxes1g = sorted(lx for (lx, amount) in F.glex.freqList('word') if amount == 1)
len(hapaxes1)

In [None]:
for lx in hapaxes1g[0:20]:
  print(lx)

If we are not interested in the numerals:

In [None]:
for lx in [x for x in hapaxes1g if not x.isdigit()][0:20]:
  print(lx)

### Small occurrence base

The occurrence base of a word are the scrolls in which the word occurs.

In [None]:
occurrenceBase = collections.defaultdict(set)

indent(reset=True)
info('compiling occurrence base ...')
for s in F.otype.s('scroll'):
  scroll = F.scroll.v(s)
  for w in L.d(s, otype='word'):
    occurrenceBase[F.glex.v(w)].add(scroll)
info('done')
info(f'{len(occurrenceBase)} entries')

An overview of how many words have how big occurrence bases:

In [None]:
occurrenceSize = collections.Counter()

for (w, scrolls) in occurrenceBase.items():
  occurrenceSize[len(scrolls)] += 1
  
occurrenceSize = sorted(
  occurrenceSize.items(),
  key=lambda x: (-x[1], x[0]),
)

for (size, amount) in occurrenceSize[0:10]:
  print(f'scrolls {size:>4} : {amount:>5} words')
print('...')
for (size, amount) in occurrenceSize[-10:]:
  print(f'scrolls {size:>4} : {amount:>5} words')

Let's give the predicate *private* to those words whose occurrence base is a single scroll.

In [None]:
privates = {w for (w, base) in occurrenceBase.items() if len(base) == 1}
len(privates)

### Peculiarity of scrolls

As a final exercise with scrolls, lets make a list of all scrolls, and show their

* total number of words
* number of private words
* the percentage of private words: a measure of the peculiarity of the scroll

In [None]:
scrollList = []

empty = set()
ordinary = set()

for d in F.otype.s('scroll'):
  scroll = T.scrollName(d)
  words = {F.glex.v(w) for w in L.d(d, otype='word')}
  a = len(words)
  if not a:
    empty.add(scroll)
    continue
  o = len({w for w in words if w in privates})
  if not o:
    ordinary.add(scroll)
    continue
  p = 100 * o / a
  scrollList.append((scroll, a, o, p))

scrollList = sorted(scrollList, key=lambda e: (-e[3], -e[1], e[0]))

print(f'Found {len(empty):>4} empty scrolls')
print(f'Found {len(ordinary):>4} ordinary scrolls (i.e. without private words)')

In [None]:
print('{:<20}{:>5}{:>5}{:>5}\n{}'.format(
    'scroll', '#all', '#own', '%own',
    '-'*35,
))

for x in scrollList[0:20]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))
print('...')
for x in scrollList[-20:]:
  print('{:<20} {:>4} {:>4} {:>4.1f}%'.format(*x))

---

next: [search](search.ipynb)

---

full tutorial: [annotation/tutorials/dss](https://nbviewer.jupyter.org/github/annotation/tutorials/blob/master/dss/start.ipynb)