# Getting started

It is assumed that you have read
[start](start.ipynb)
and followed the installation instructions there.

# Corpus

This is:

* `dss` Dead Sea Scrolls

# First acquaintance

We just want to grasp what the corpus is about and how we can find our way in the data.

Open a terminal or command prompt and say one of the following

```text-fabric dss```

Wait and see a lot happening before your browser starts up and shows you an interface on the corpus:

Text-Fabric needs an app to deal with the corpus-specific things.
It downloads/finds/caches the latest version of the **app**:

```
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-dss/code:
	rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)
```

It downloads/finds/caches the latest version of the **data**:

```
Using data in /Users/dirk/text-fabric-data/etcbc/dss/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
```

The data is preprocessed in order to speed up typical Text-Fabric operations.
The result is cached on your computer.
Preprocessing costs time. Next time you use this corpus on this machine, the startup time is much quicker.

```
TF setup done.
```

Then the app goes on to act as a local webserver serving the corpus that has just been downloaded
and it will open your browser for you and load the corpus page

```
 * Running on http://localhost:8107/ (Press CTRL+C to quit)
Opening dss in browser
Listening at port 18987
```

<img src="images/dss-bare.png" width="600">

# Help!

Indeed, that is what you need. Click the vertical `Help` tab.

From there, click around a little bit. Don't read closely, just note the kinds of information that is presented to you.

Later on, it will make more sense!

# Browsing

First we browse our data. Click the browse button.

<img src="images/dss-browse.png" width="800">

and then, in the table of *documents* (scrolls), click on a fragment of scroll `1QSb`: 

<img src="images/dss-documents.png" width="200">


Now you're looking at a fragment of a scroll: the writing in Hebrew characters without vowel signs.

<img src="images/dss-fragment.png" width="800">

Now click the *Options* tab and select the `layout-orig-unicode` format to see the same fragment in a layout that indicates the status
of the pieces of writing.

<img src="images/dss-layout.png" width="1000">

You can click a triangle to see how a line is broken down:

<img src="images/dss-drill.png" width="800">

# Searching

In this corpus there is a lot of attention for the uncertainty of signs and whether they have been corrected, either in antiquity or
in more modern times.

Also, the corpus is marked up with part-of-speech for each word.

So we can, for example, search for *verbs* that have an uncertain or corrected or removed consonant in them.

```
word sp=verb
  sign type=cons
  /with/
  .. unc=1|2|3|4
  /or/
  .. cor=1|2|3
  /or/
  .. rem=1|2
  /-/
```

<img src="images/dss-search.png" width="1200">

In English:

search all `word`s that contain a `sign` with feature `type`
having value `cons` (consonant) where at least one of the following holds for
that sign:

* the feature `unc` has value `1` or `2` or `3`  or `4`
* the feature `cor` has value `1` or `2` or `3`
* the feature `rem` has value `1` or `2`


You can expand results by clicking the triangle. 

You can see the result in context by  clicking the browse icon.

You can go back to the result list by clicking the results icon.

<img src="images/dss-back.png" width="1000">

# Computing

This triggers another question.

How is uncertainty distributed over the verbs?
I.e. how many verbs have how many uncertain/corrected/removed signs?

*This is a typical question where you want to leave the search mode and enter computing mode*.

Let's find out.

Extra information:

* we will filter out *reconstructed* signs from the equation;
* the features `unc`, `cor`, `rem` have values that indicate the kind of uncertainty, correction, removal.
  We just use those values as the seriousness of the uncertainty.
  Essentially, we just sum up all values of these features for each sign.

Open your terminal and say

``` sh
jupyter notebook
```

Your browser starts up and presents you a local computing environment where you can run Python programs.

You see cells like the one below, where you can type programming statements and execute them by pressing `Shift Enter`.

First we load the Text-Fabric module, as follows:

In [1]:
from tf.app import use

Now we load the TF-app for the corpus `oldbabylonian` and that app loads the corpus data.

We give a name to the result of all that loading: `A`.

In [2]:
A = use('dss', hoist=globals())

	connecting to online GitHub repo annotation/app-dss ... connected
Using TF-app in /Users/dirk/text-fabric-data/annotation/app-dss/code:
	rv0.6=#304d66fd7eab50bbe4de8505c24d8b3eca30b1f1 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/dss/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
	connecting to online GitHub repo etcbc/dss ... connected
Using data in /Users/dirk/text-fabric-data/etcbc/dss/parallels/tf/0.6:
	rv0.6=#9b52e40a8a36391b60807357fa94343c510bdee0 (latest release)
   |     0.00s No structure info in otext, the structure part of the T-API cannot be used


Some bits are familiar from above, when you ran the `text-fabric` command in the terminal.

Other bits are links to the documentation, they point to the same places as the links on the Text-Fabric browser.

You see a list of all the data features that have been loaded.

And a list of references to the API documentation, which tells you how you can use this data in your program statements.

# Searching (revisited)

We do the same search again, but now inside our program.

That means that we can capture the results in a list for further processing. 

In [3]:
results = A.search('''
word sp=verb
  sign type=cons
  /with/
  .. unc=1|2|3|4
  /or/
  .. cor=1|2|3
  /or/
  .. rem=1|2
  /-/
''')

  2.81s 18236 results


In less than three seconds, we have all the results!

Let's look at the first one:

In [4]:
results[0]

(1607456, 1742)

Each result is a list of numbers: for a 

1. word
1. sign

Here is the  second one:

In [5]:
results[1]

(1607456, 1743)

And here the last one:

In [6]:
results[-1]

(2107831, 1430148)

Now we are only interested in the words that we have encountered.
We collect them in a set:

In [7]:
verbs = {result[0] for result in results}
len(verbs)

11663

We have nearly twelve thousand verbs with uncertain signs in it.

Now we want to find out something for each result verb: what is the accumulated uncertainty of that verb?
Some verbs havemore consonants than others, so we divide by the number of consonants.
And in the whole calculation we leave out reconstructed signs.

We define a function that collects the uncertainty of a single sign:

In [11]:
def getUncertainty(sign):
  return sum((
    (F.unc.v(sign) or 0),
    (F.cor.v(sign) or 0),
    (F.rem.v(sign) or 0),
  ))

Let's see what this gives for the first sign in the 1000th result:

In [13]:
sign = results[999][1]
print(sign)
unc  =  getUncertainty(sign)
print(unc)

97708
4


Now we define a function that gives us the uncertainty of a word.
We collect the consonants of the word, leave out the reconstructed ones.
If all were reconstructed, we assign the value 1000 to the word.
Else, we sum the uncertainty of the non-reconstructed signs and divide it by the number of unreconstructed
consonants.

In [14]:
def uncertainty(word):
  signs = L.d(word, otype='sign')  # go a Level down to signs and collect them in a list
  nonrec = [sign for sign in signs if F.type.v(sign) == 'cons' and not F.rec.v(sign)]
  if not nonrec:
    return 1000
  return sum(getUncertainty(sign) for sign in nonrec) / len(nonrec)

We compute the uncertainty of the word in the 1000th result.
Here is that word:

In [17]:
A.pretty(word)

Now the computation:

In [18]:
word = results[999][0]
print(word)
unc  =  uncertainty(word)
print(unc)

1639914
1.0


Right: 4 consonants, one of them has uncertainty 4, `4/4 = 1`

Now we collect the set of all line numbers that our result lines have:

In [None]:
{F.ln.v(result[0]) for result in results}

What we really want to know is how the result lines are distributed over the line numbers.

In [None]:
import collections

In [None]:
distribution = collections.Counter()

for result in results:
  lineNumber = F.ln.v(result[0])
  distribution[lineNumber] += 1
  
print(distribution)

An overwhelming majority has it on line 3

Let's make the output a bit more friendly:

In [None]:
for (lineNumber, amount) in sorted(distribution.items()):
  print(f'line {lineNumber:>2} is home to {amount:>3} results')

We can now inspect more closely what is going on, for example where results appear late in the tablet, after line 16:

In [None]:
results16 = A.search('''
line ln>16
  =: word
    =: sign reading=um
    <: sign reading=ma
    :=
  < sign reading=ma
  :=
''')

And we can show them here too:

In [None]:
A.table(results16)

But at this point it might be easier to take the new query back to the Text-Fabric browser and query it there:

<img src="images/results16.png" width="1200">