# Sanskrit to Text-Fabric
This notebook converts a
[Sanskrit text resource](https://etcbc.github.io/text-fabric-data/features/sanskrit/dcs/0_home.html), suggested by [Tyler Neill (Leipzig)](http://www.gko.uni-leipzig.de/indologie-zaw/institut/mitarbeiter-in-forschungsprojekten/tyler-graham-neill.html)
into Text-Fabric.

## Disclaimer
Note that this corpus has some problems.

**This is only a demonstrator**

The purpose is to set op a text-fabric processing pipeline for this kind of resource.

## Structure of the corpus
The resource is a collection of files, which for now we treat as *books*.
The lines in the files are numbered with two hierarchical numbers.
We treat the principal number as a *chapter* number and the other as a *verse* number.

## Modeling the data in Text-Fabric
Slots correspond to letters, and we skip the spaces, but not without having them leave a trail.
So the primary data is considered to be a consecutive stream of non-white characters.

Yet we also have a node type word, whose text values are the words as delimited in the original text.

We give the letter nodes a feature `char`, which contains the unicode character at that position.
Besides `char`, letter nodes also have a feature `trailer`, which is a space if that letter is at the end of an
original word, and the empty string in other cases.

# Preparation
We import some of the most generic Python modules.

In [4]:
import os, re, collections
from glob import glob

Here is the import of the Text-Fabric library.

In [5]:
from tf.fabric import Fabric
from tf.timestamp import Timestamp

We use variables to point to the input repository.

In [6]:
REPO = os.path.expanduser('~/github/sanskrit_text_dcs')
TEXT_DIR = '{}/corpora'.format(REPO)

The output is a dataset in the `text-fabric-data` repository, where it sits alongside the 
Hebrew Bible and the Greek New Testament. 

This place is used only temporarily.
Once it has decided to pursue this path further, we will seek a destination under the control of
the researchers that actually work with this corpus.

We supply a location to the `TF()` call where currently is no dataset.
TF will warn about missing grid features (`otype`, `oslots`), but that will not concern us,
because in the following cells we will construct these features.

In [10]:
tm = Timestamp()
TF = Fabric(locations=['~/github/text-fabric-data'], modules=['sanskrit/dcs'])

This is Text-Fabric 2.3.7
Api reference : https://github.com/ETCBC/text-fabric/wiki/Api
Tutorial      : https://github.com/ETCBC/text-fabric/blob/master/docs/tutorial.ipynb
Data sources  : https://github.com/ETCBC/text-fabric-data
Data docs     : https://etcbc.github.io/text-fabric-data
Shebanq docs  : https://shebanq.ancient-data.org/text
Slack team    : https://shebanq.slack.com/signup
Questions? Ask shebanq@ancient-data.org for an invite to Slack
0 features found and 0 ignored


  0.00s Grid feature "otype" not found in
/Users/dirk/github/text-fabric-data/sanskrit/dcs
  0.00s Grid feature "oslots" not found in
/Users/dirk/github/text-fabric-data/sanskrit/dcs


  0.01s Grid feature "otext" not found. Working without Text-API



# Reading the corpus

We read each file in turn. Extracting information is deferred to `readText()` below.
When errors arise, we collect them here. Only a handful of errors will be displayed.

In [11]:
errors = collections.defaultdict(list)
slotNum = 0
nodeFeatures = collections.defaultdict(dict)
edgeFeatures = collections.defaultdict(dict)

CHAR = 'char'
TRAILER='trailer'
WORD = 'word'
BOOK = 'book'
SECTION = 'chapter'
LINE = 'verse'
nodes = collections.defaultdict(list)

def showErrorSummary():
    errorTexts = sorted(errors.keys())[0:3]
    for errorText in errorTexts:
        for error in errors[errorText][0:3]:
            print(error)

def readCorpus():
    tm.indent(reset=True)
    tm.info('Reading corpus')
    os.chdir(TEXT_DIR)
    errors.clear()
    nodeFeatures.clear()
    edgeFeatures.clear()
    nodes.clear()
    global slotNum
    slotNum = 0
    textFiles = sorted(os.path.splitext(f)[0] for f in glob('*.txt'))
    print('{} texts'.format(len(textFiles)))
    for textFile in textFiles:
        readText(textFile)
    if len(errors):
        print('There were {} errors'.format(sum(len(errors[textFile]) for textFile in errors)))
        showErrorSummary()
    else:
        print('No errors')
    print('''
{} slots
{} words in source
{} lines
{} sections
{} books
'''.format(
        slotNum,
        len(nodes[WORD]),
        len(nodes[LINE]),
        len(nodes[SECTION]),
        len(nodes[BOOK]),
    ))
    tm.info('Done')

We set up a few regular expressions to scan the data lines of the text files.

In [12]:
linePat = re.compile('^\s*([^\/]*)\/+\s*\(([^.)]+).([^)]+)\)\s*')
emptyLinePat = re.compile('^\s*$')

The next function, `readText()` reads a text file line by line, splits text and numbering, splits
the words of the text into characters.
It will construct dictionaries that correspond to the features that we will write out later into the new
Text-Fabric resource.

In this function the slot numbers (for the letters) will be determined.
We postpone the numbering of non-slot nodes.
We only collect lists of non-slot node and their feature values.

In [13]:
def readText(textFile):
    global slotNum
    with open('{}.txt'.format(textFile)) as f:
        bookName = textFile
        bookStart = slotNum + 1
        curSection = None
        sectionStart = slotNum + 1
        for (n, line) in enumerate(f):
            lineStart = slotNum + 1
            line = line.rstrip('\n')
            if emptyLinePat.match(line): continue
            match = linePat.match(line)
            if not match:
                errors[textFile].append('{}:{} - unexpected line\n\t{}\n'.format(textFile, n + 1, line))
                continue
            text = match.group(1).rstrip()
            sectionNr = match.group(2)
            if sectionNr == None:
                break
            if curSection != sectionNr:
                if curSection != None:
                    sectionEnd = slotNum
                    nodes[SECTION].append((sectionStart, sectionEnd, {SECTION: curSection, BOOK: bookName}))
                curSection = sectionNr
                sectionStart = slotNum + 1                
            lineNr = match.group(3)
            words = text.split()
            for word in words:
                wordStart = slotNum + 1
                for letter in word:
                    slotNum += 1                 
                    nodeFeatures[CHAR][slotNum] = letter
                    nodeFeatures[TRAILER][slotNum] = ''
                wordEnd = slotNum
                nodeFeatures[TRAILER][slotNum] = ' '
                nodes[WORD].append((wordStart, wordEnd, {WORD: word}))
            lineEnd = slotNum
            nodes[LINE].append((lineStart, lineEnd, {LINE: lineNr, SECTION: curSection, BOOK: bookName}))
        sectionEnd = slotNum
        if curSection == None:
            print('Empty book {}'.format(textFile))
        else:    
            nodes[SECTION].append((sectionStart, sectionEnd, {SECTION: curSection, BOOK: bookName}))
            bookEnd = slotNum
            nodes[BOOK].append((bookStart, bookEnd, {BOOK: bookName}))

## Conversion notes

190 texts

### Empty Texts

```
Empty book Gřḍhārthaprakāśaka
Empty book Kaulāvalīnirṇaya
Empty book Mṛgendraṭīkā
Empty book Nyāyacandrikāpaṇjikā
Empty book Śārṅgadharasaṃhitādīpikā
Empty book Tantrasaṃgraha
Empty book Tantrāloka
```

**Action taken**

Skipped them altogether

### Irregular lines

```
There were 3 errors
Agastīyaratnaparīkṣā:55 - unexpected line
		[... auein Vers / Satzjh] // (27.2)hariśvetaṃ tathā vaṃśe pītaśvetaṃ ca śūkare // (28.1)

Gokarṇapurāṇasāraḥ:185 - unexpected line
		iti śrīskānde gokarṇakhaṇḍe śrīgokarṇamāhātmye sāroddhāre prathamo 'dhyāyaḥ / // (88.1)

Rasādhyāya:130 - unexpected line
		[... auein Vers / Satzjh] // (64.2)tāmrāt sūtaṃ rasāttāmraṃ pātanāya pṛthakkṛtam / (65.1)
```
        
**Action taken**

Case 1 and 3: inserted a newline, changed the first / into an `~`

Case 2: removed the `//`

Here we call the reading functions.

In [14]:
readCorpus()

  0.00s Reading corpus
190 texts
Empty book Gřḍhārthaprakāśaka
Empty book Kaulāvalīnirṇaya
Empty book Mṛgendraṭīkā
Empty book Nyāyacandrikāpaṇjikā
Empty book Śārṅgadharasaṃhitādīpikā
Empty book Tantrasaṃgraha
Empty book Tantrāloka
No errors

1161379 slots
136409 words in source
25729 lines
13010 sections
183 books

  1.40s Done


# Metadata

We supply the necessary metadata for the new features.
We also have a few generic fields that will be added to all features.

In [17]:
metaData = {
    '': dict(
        createdBy='Tylor Neill and Dirk Roorda',
        name='Sanskrit_Corpus_DCS',
        title='Sanskrit Corpus',
        provenance='[DCS](http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/index.php)',
        description='DCS, the Digital Corpus of Sanskrit, is a searchable collection of lemmatized Sanskrit texts. It offers free internet access to a part of the database of the linguistic program SanskritTagger, which has been under constant development since 1999.'
    ),
    'otext': {
        'sectionFeatures': ','.join((BOOK, SECTION, LINE)),
        'sectionTypes': ','.join((BOOK, SECTION, LINE)),
        'fmt:text-orig-full': '{{{}}}'.format(CHAR),
        'fmt:text-orig-segmented': '{{{}}}{{{}}}'.format(CHAR, TRAILER),
    },
    'otype': {
        'valueType': 'str',        
    },
    'oslots': {
        'valueType': 'str',
    },
    'book@sa': {
        'valueType': 'str',
        'language': 'Saṃskṛtam',
        'languageCode': 'sa',
        'languageEnglish': 'sanskrit',
    },
    'trailer': {
        'valueType': 'str',
    }
}
nodeFeatures['book@sa'] = nodeFeatures[BOOK]

for (sectionType) in (CHAR, WORD, LINE, SECTION, BOOK):
    metaData.setdefault(sectionType, {})['valueType'] = 'int' if sectionType in {LINE, SECTION} else 'str'

We add features that contain the frequency and rank of all words in the corpus.

In [18]:
def computeStatistics():
    tm.info('Computing statistics')
    wstats = {
        'freq': collections.Counter(),
        'rank': {},
    }
    word = {}

    words = [n[0] for n in nodeFeatures['otype'].items() if n[1] == WORD]

    for w in words:
        occ = nodeFeatures[WORD][w]
        wstats['freq'][occ] += 1
    rank = -1
    prev_n = -1
    amount = 1
    for (x, n) in sorted(wstats['freq'].items(), key=lambda y: (-y[1], y[0])):
        if n == prev_n:
            amount += 1
        else:
            rank += amount
            amount = 1
        prev_n = n
        wstats['rank'][x] = rank
    tm.info('Done')

    tm.info('Adding statistics as features')
    occFeatures = {}
    for ft in ('freq', 'rank'):
        occFeatures[ft] = {}
        metaData.setdefault(ft, {})['valueType'] = 'int'

    for w in words:
        occ = nodeFeatures[WORD][w]
        for ft in ['freq', 'rank']:
            occFeatures[ft][w] = str(wstats[ft][occ])

    nodeFeatures.update(occFeatures)
    tm.info('Done')

# Compiling into TF

Here we put all bits together:

1. the features for letters and words
2. the nodes for words, verses, chapters and books
3. the metadata.

In [19]:
def makeTextFabric():
    tm.indent(reset=True)
    tm.info('Generating text-fabric dataset')
    nodeFeatures['otype'] = dict((n, 'letter') for n in range(1, slotNum + 1))
    nodeNum = slotNum
    for (nodeType) in (WORD, LINE, SECTION, BOOK):
        for (start, end, feats) in nodes[nodeType]:
            nodeNum += 1
            nodeFeatures['otype'][nodeNum] = nodeType
            for feat in feats:
                nodeFeatures[feat][nodeNum] = feats[feat]
            edgeFeatures['oslots'][nodeNum] = list(range(start, end + 1))
    computeStatistics()
    TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)
    tm.info('Done')

In [20]:
makeTextFabric()

  0.00s Generating text-fabric dataset
  0.70s Computing statistics
  1.38s Done
  1.38s Adding statistics as features
  1.62s Done
  0.00s Exporting 10 node and 1 edge and 1 config features to /Users/dirk/github/text-fabric-data/sanskrit/dcs:
   |     0.09s T book                 to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.08s T book@sa              to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.08s T chapter              to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     2.14s T char                 to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.24s T freq                 to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.66s T otype                to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.23s T rank                 to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     2.07s T trailer              to /Users/dirk/github/text-fabric-data/sanskrit/dcs
   |     0.06s T verse              

See the [tutorial](https://github.com/ETCBC/text-fabric/blob/master/docs/tutorialSanskrit.ipynb)
for getting started *using* the new text-fabric dataset.