<img src="images/skanda.jpg" width="400"/>
<img src="images/tf-small.png" width="70"/>

# The Skandapurāṇa project

We convert the Skandapurāṇa text to TF. 

They come from Peter Bisschop's [Skandapurāṇa project](https://www.universiteitleiden.nl/en/research/research-projects/humanities/the-skandapurāṇa-project#tab-1).

We used the transliterated and the Devanāgarī representations of the texts as found [here](https://www.universiteitleiden.nl/en/research/research-projects/humanities/the-skandapurāṇa-project#tab-4).

## Details of the TF modeling

We take the Devanāgarī-character as smallest unit, the slot.
Character nodes are called `char`.
The unicode representation of a char is stored in the feature `dchar`.

If a letter is the last letter of a word, we set its feature `last` to a space,
otherwise the empty string.

Words are the maximal stretches of Devanāgarī-chars that do not contain a space.
Word nodes have node type `word`.

The unicode representation of a word is stored in the feature `dword`.

The transliteration of a word is stored in the feature `tword`.

The top-level sectional unit is the node type `text`, and corresponds to the contents of a single file.

Nodes of type `text` have the following features:

* `name`: the name of the text. Usually this is just the number, but in some text
  some characters are appended to the number;
* `number`: the integer corresponding to the first triplet of digits
  after the `SP` at the start of each line;
* `volume`: as given in the description of the project

Texts are subdivided into `verse`s.

Nodes of type `verse` have the following features:

* `number`: the integer corresponding to the second triplet of digits 
  after the `SP` at the start of each line.
  
Verses are subdivided into `line`s.

Nodes of type `line` have the following features:

* `number`: the integer corresponding to the last digit
  after the `SP` at the start of each line.


# The program

Import generic Python modules

In [1]:
import os, re, collections
from glob import glob

Here is the import of the Text-Fabric library.

In [2]:
from tf.fabric import Fabric
from tf.timestamp import Timestamp

In [3]:
tm = Timestamp()

## Configuration
We use variables to point to the input directories and the output tf directory.

In [4]:
BASE = os.path.expanduser('~/github/Dans-labs/text-fabric-data/sanskrit/sp')
ORIG = f'{BASE}/devanagari'
TRANS = f'{BASE}/transliteration'
TF_DIR = f'{BASE}/tf'

We glean the volume membership from the project description. 

In [5]:
VOLUMES = dict(
  I=(1,25),
  IIA=(26,30),
  IIB=(31,52),
  III=(53,69),
  S=(167,'s'),
  RA=(167, 'ra'),  
)
RAS = (1,5)

# node types

TEXT = 'text'
VERSE = 'verse'
LINE = 'line'
WORD = 'word'
CHAR = 'char'

NODE_TYPES = f'''
  {CHAR}
  {WORD}
  {LINE}
  {VERSE}
  {TEXT}
'''.strip().split()

# features

OTYPE = 'otype'
OSLOTS = 'oslots'
VOLUME = 'volume'
NAME = 'name'
NUMBER = 'number'
DWORD = 'dword'
TWORD = 'tword'
DCHAR = 'dchar'
LAST = 'last'

## File list

We generate the list of files in the corpus from the configuration.
The result is represented as a list of items, each item is
a document number plus document name plus a file name with its transliterated text, plus
a file name with its devanagari text.

We also map the document names to volume labels.

Beyond the numbered texts there a a few special texts: S and RA recensions.

In [6]:
def makeFileList():
  textVolume = {}
  texts = []
  for (vol, (n, x)) in VOLUMES.items():
    if type(x) is int:
      for i in range(n, x + 1):
        fileStart = f'st{i:>03}'
        name = f'{i:>03}'
        texts.append((i, name, f'{fileStart}.txt', f'{fileStart}_d.txt'))
        textVolume[name] = vol
    else:
      fileStart = f'st{n:>03}'
      if x == 's':
        name = f'{n}S'
        texts.append((n, name, f'{fileStart}_{x}.txt', f'{fileStart}_{x}_d.txt'))
        textVolume[name] = vol
      else:
        for i in range(RAS[0], RAS[1] + 1):
          name = f'{n}RA{i}'
          texts.append((n, name, f'{fileStart}_{x}{i}.txt', f'{fileStart}_{x}{i}_d.txt'))
          textVolume[name] = vol
  return (texts, textVolume)

## Read the corpus files

For each text item in the list, we read the associated files from disk.
The file contents is chopped up in lines, and the text-containing lines
are chopped up in words.

All this data is collected in one big data structure.

In [7]:
def readText(n, name, transFile, devFile):
  good = True
  transPath = f'{TRANS}/{transFile}'
  devPath = f'{ORIG}/{devFile}'
  results = []
  for path in (transPath, devPath):
    if not os.path.isfile(path):
      tm.error(f'{name}: file does not exist: {path}')
      good = False
      continue
    with open(path) as fh:
      results.append([
        line.rstrip('\n').split('|', 1)[0].split()
        for line in fh
        if line.startswith('SP')
      ])
  return results if good else None

In [8]:
def readCorpus(texts):
  data = []
  tm.indent(reset=True)
  tm.info('Reading corpus')
  for item in texts:
    tm.info(f'\t{item[1]}')
    results = readText(*item)
    if results is not None:
      data.append((item[0], item[1], *results))
  tm.info('Done')
  return data

## Proto TF

We process the data and compose the feature data for the `oslots` edge feature and
the desired node features.

In this stage we cannot know the eventual node numbers, so we identify each node by node type and node number within its type.

In [9]:
def protoTf(data, textVolume):
  curSlot = 0
  curWord = 0
  curLine = 0
  curVerse = 0
  curText = 0
  edgeFeatures = {
    OSLOTS: {},
  }
  nodeFeatures = {
    OTYPE: {},
    VOLUME: {},
    NAME: {},
    NUMBER: {},
    DWORD: {},
    TWORD: {},
    DCHAR: {},
    LAST: {},
  }

  tm.indent(reset=True)
  tm.info('Proto TF generation')
  for (textNum, textName, transData, devData) in data:
    tm.info(f'\t{textName}')
    curText += 1
    nodeFeatures[OTYPE][(TEXT, curText)] = TEXT
    nodeFeatures[NAME][(TEXT, curText)] = textName
    nodeFeatures[NUMBER][(TEXT, curText)] = textNum
    nodeFeatures[VOLUME][(TEXT, curText)] = textVolume[textName]
    firstTextSlot = curSlot + 1
    firstVerseSlot = curSlot + 1
    firstLineSlot = curSlot + 1
    verseNum = None
    lineNum = None
    for (i, dLine) in enumerate(devData):
      label = dLine[0]
      dWords = dLine[1:]
      tWords = transData[i][1:]
      labelNumbers = label[10:14] if label[2:4] == 'ra' else label[5:9]
      thisVerseNum = int(labelNumbers[0:3])
      thisLineNum = int(0 if labelNumbers[3] == ':' else labelNumbers[3])
      if thisLineNum != lineNum or thisVerseNum != verseNum:
        if lineNum is not None:
          curLine += 1
          nodeFeatures[OTYPE][(LINE, curLine)] = LINE
          nodeFeatures[NUMBER][(LINE, curLine)] = lineNum
          edgeFeatures[OSLOTS][(LINE, curLine)] = set(range(firstLineSlot, curSlot + 1))
          firstLineSlot = curSlot + 1
        lineNum = thisLineNum
      if thisVerseNum != verseNum:
        if verseNum is not None:
          curVerse += 1
          nodeFeatures[OTYPE][(VERSE, curVerse)] = VERSE
          nodeFeatures[NUMBER][(VERSE, curVerse)] = verseNum
          edgeFeatures[OSLOTS][(VERSE, curVerse)] = set(range(firstVerseSlot, curSlot + 1))
          firstVerseSlot = curSlot + 1
        verseNum = thisVerseNum
      if thisLineNum != lineNum or thisVerseNum != verseNum:
        if lineNum is not None:
          curLine += 1
          nodeFeatures[OTYPE][(LINE, curLine)] = LINE
          nodeFeatures[NUMBER][(LINE, curLine)] = lineNum
          edgeFeatures[OSLOTS][(LINE, curLine)] = set(range(firstLineSlot, curSlot + 1))
          firstLineSlot = curSlot + 1
        lineNum = thisLineNum
      for (j, dWord) in enumerate(dWords):
        tWord = tWords[j]
        curWord += 1
        nodeFeatures[OTYPE][(WORD, curWord)] = WORD
        nodeFeatures[DWORD][(WORD, curWord)] = dWord
        nodeFeatures[TWORD][(WORD, curWord)] = tWord
        edgeFeatures[OSLOTS][(WORD, curWord)] = set(range(curSlot + 1, curSlot + 1 + len(dWord)))
        for d in dWord:
          curSlot += 1
          nodeFeatures[OTYPE][(CHAR, curSlot)] = CHAR
          nodeFeatures[DCHAR][(CHAR, curSlot)] = d
          nodeFeatures[LAST][(CHAR, curSlot)] = ''
        nodeFeatures[LAST][(CHAR, curSlot)] = ' '
    if verseNum is not None:
      curVerse += 1
      nodeFeatures[OTYPE][(VERSE, curVerse)] = VERSE
      nodeFeatures[NUMBER][(VERSE, curVerse)] = verseNum
      edgeFeatures[OSLOTS][(VERSE, curVerse)] = set(range(firstVerseSlot, curSlot + 1))
    if lineNum is not None:
      curLine += 1
      nodeFeatures[OTYPE][(LINE, curLine)] = LINE
      nodeFeatures[NUMBER][(LINE, curLine)] = lineNum
      edgeFeatures[OSLOTS][(LINE, curLine)] = set(range(firstLineSlot, curSlot + 1))
    edgeFeatures[OSLOTS][(TEXT, curText)] = set(range(firstTextSlot, curSlot + 1))
  tm.info('Done')
  tm.info('Checking whether all slots are contained in a word, line, verse and text')
  typeSlots = {}
  for ((nType, n), slots) in edgeFeatures[OSLOTS].items():
    typeSlots.setdefault(nType, set())
    typeSlots[nType] |= slots
  for (nType, slots) in typeSlots.items():
    minSlot = min(slots)
    maxSlot = max(slots)
    ok = minSlot == 1 and maxSlot == curSlot and len(slots) == maxSlot
    okRep = 'ok' if ok else '!!!'
    print(f'{nType:<8}: {len(slots)} elements between {minSlot} - {maxSlot} ({okRep})')
  return (nodeFeatures, edgeFeatures)      

## TF features

We create real TF feature data, by ordering all nodes into one big sequence.

In [10]:
def tfFeatures(nodeFeaturesProto, edgeFeaturesProto):
  errors = set()
  nodeTypeSets = {}
  for ((otp, n), xtp) in nodeFeaturesProto[OTYPE].items():
    if otp != xtp:
      errors.add((otp, n, xtp))
    nodeTypeSets.setdefault(otp, set()).add(n)
  print(f'{len(errors)} inconsistencies')
  curOffset = 0
  offsets = {}
  for otp in NODE_TYPES:
    offsets[otp] = curOffset
    ns = nodeTypeSets[otp]
    minNtp = min(ns)
    if minNtp != 1:
      print(f'Node type {otp} starts with {minNtp}')
    maxNtp = max(ns)
    if maxNtp != len(ns):
      print(f'Node type {otp} has holes in the sequence: {len(ns)} vs {maxNtp}')
    print(f'{otp:<8}: {maxNtp:>6}')
    curOffset += maxNtp
  nodeFeatures = {}
  edgeFeatures = {}
  for (feature, data) in nodeFeaturesProto.items():
    featureData = {}
    for ((ntp, n), value) in data.items():
      featureData[n + offsets[ntp]] = value
    nodeFeatures[feature] = featureData
  for (feature, data) in edgeFeaturesProto.items():
    featureData = {}
    for ((ntp, n), value) in data.items():
      featureData[n + offsets[ntp]] = value
    edgeFeatures[feature] = featureData
  for (otp, offset) in offsets.items():
    print(f'{otp} has offset {offset}')
    for i in range(max((offset - 3, 1)), offset + 4):
      print(f'{i:>6} is a {nodeFeatures["otype"][i]}')
  return (nodeFeatures, edgeFeatures)

## Main steps

Here we execute the main steps, as defined in the functions above.

In [11]:
(texts, textVolume) = makeFileList()

In [12]:
data = readCorpus(texts)

  0.00s Reading corpus
  0.00s 	001
  0.01s 	002
  0.01s 	003
  0.02s 	004
  0.02s 	005
  0.03s 	006
  0.03s 	007
  0.04s 	008
  0.04s 	009
  0.05s 	010
  0.05s 	011
  0.05s 	012
  0.06s 	013
  0.07s 	014
  0.07s 	015
  0.07s 	016
  0.08s 	017
  0.08s 	018
  0.09s 	019
  0.09s 	020
  0.10s 	021
  0.10s 	022
  0.11s 	023
  0.11s 	024
  0.12s 	025
  0.12s 	026
  0.13s 	027
  0.13s 	028
  0.14s 	029
  0.15s 	030
  0.15s 	031
  0.16s 	032
  0.17s 	033
  0.18s 	034
  0.18s 	035
  0.19s 	036
  0.19s 	037
  0.20s 	038
  0.21s 	039
  0.21s 	040
  0.21s 	041
  0.22s 	042
  0.22s 	043
  0.23s 	044
  0.23s 	045
  0.24s 	046
  0.24s 	047
  0.24s 	048
  0.25s 	049
  0.25s 	050
  0.26s 	051
  0.26s 	052
  0.27s 	053
  0.27s 	054
  0.28s 	055
  0.28s 	056
  0.29s 	057
  0.29s 	058
  0.30s 	059
  0.30s 	060
  0.31s 	061
  0.31s 	062
  0.32s 	063
  0.32s 	064
  0.33s 	065
  0.33s 	066
  0.34s 	067
  0.34s 	068
  0.35s 	069
  0.35s 	167S
  0.36s 	167RA1
  0.36s 	167RA2
  0.37s 	167RA3
  0.37s 	167RA4
  

In [13]:
(nodeFeaturesProto, edgeFeaturesProto) = protoTf(data, textVolume)

  0.00s Proto TF generation
  0.00s 	001
  0.01s 	002
  0.02s 	003
  0.03s 	004
  0.04s 	005
  0.06s 	006
  0.06s 	007
  0.10s 	008
  0.11s 	009
  0.12s 	010
  0.13s 	011
  0.14s 	012
  0.16s 	013
  0.19s 	014
  0.20s 	015
  0.21s 	016
  0.22s 	017
  0.22s 	018
  0.23s 	019
  0.24s 	020
  0.26s 	021
  0.28s 	022
  0.28s 	023
  0.31s 	024
  0.32s 	025
  0.35s 	026
  0.36s 	027
  0.37s 	028
  0.39s 	029
  0.52s 	030
  0.53s 	031
  0.56s 	032
  0.61s 	033
  0.64s 	034
  0.66s 	035
  0.69s 	036
  0.71s 	037
  0.72s 	038
  0.72s 	039
  0.73s 	040
  0.73s 	041
  0.74s 	042
  0.74s 	043
  0.75s 	044
  0.75s 	045
  0.76s 	046
  0.76s 	047
  0.77s 	048
  0.77s 	049
  0.78s 	050
  0.79s 	051
  0.80s 	052
  0.97s 	053
  0.98s 	054
  0.99s 	055
  1.00s 	056
  1.03s 	057
  1.05s 	058
  1.05s 	059
  1.06s 	060
  1.08s 	061
  1.09s 	062
  1.12s 	063
  1.13s 	064
  1.15s 	065
  1.17s 	066
  1.18s 	067
  1.20s 	068
  1.21s 	069
  1.25s 	167S
  1.29s 	167RA1
  1.30s 	167RA2
  1.31s 	167RA3
  1.34s 	167R

In [14]:
(nodeFeatures, edgeFeatures) = tfFeatures(nodeFeaturesProto, edgeFeaturesProto)

0 inconsistencies
char    : 347770
word    :  45101
line    :  10012
verse   :   4532
text    :     75
char has offset 0
     1 is a char
     2 is a char
     3 is a char
word has offset 347770
347767 is a char
347768 is a char
347769 is a char
347770 is a char
347771 is a word
347772 is a word
347773 is a word
line has offset 392871
392868 is a word
392869 is a word
392870 is a word
392871 is a word
392872 is a line
392873 is a line
392874 is a line
verse has offset 402883
402880 is a line
402881 is a line
402882 is a line
402883 is a line
402884 is a verse
402885 is a verse
402886 is a verse
text has offset 407415
407412 is a verse
407413 is a verse
407414 is a verse
407415 is a verse
407416 is a text
407417 is a text
407418 is a text


# Metadata

We supply the necessary metadata for the new features.
We also have a few generic fields that will be added to all features.

In [15]:
metaData = {
  '': dict(
    createdBy='Peter Bisschop et al.',
    convertedBy='Dirk Roorda',
    name='Adhyāyas',
    title='Skandapurāṇa Project',
    source1='http://hum2.leidenuniv.nl/pdf/skandapurana-project/SP_all_devanagari.zip',
    source2='http://hum2.leidenuniv.nl/pdf/skandapurana-project/SP_all_transliteration.zip',
    provenance='https://www.universiteitleiden.nl/en/research/research-projects/humanities/the-skandapurāṇa-project',
    description='volumes I-III of the critical edition of the Skandapurāṇa online',
  ),
  'otext': {
    'sectionFeatures': ','.join((NAME, NUMBER, NUMBER)),
    'sectionTypes': ','.join((TEXT, VERSE, LINE)),
    'fmt:text-orig-full': f'{{{DCHAR}}}{{{LAST}}}',
    'fmt:text-orig-word': f'{{{DWORD}}} ',
    'fmt:text-trans-word': f'{{{TWORD}}} ',
  },
  'name@en': {
    'valueType': 'str',
    'language': 'english',
    'languageCode': 'en',
    'languageEnglish': 'English',
  },
}
nodeFeatures['name@en'] = nodeFeatures[NAME]

for feat in (OSLOTS, OTYPE, NAME, DWORD, TWORD, DCHAR, LAST, VOLUME):
  metaData.setdefault(feat, {})['valueType'] = 'str'
for feat in (NUMBER,):
  metaData.setdefault(feat, {})['valueType'] = 'int'

## Save data as TF data set

The TF package has a function by which we can save all data that we have composed
into a valid TF data set.

In [16]:
TF = Fabric(locations=[TF_DIR])
TF.save(nodeFeatures=nodeFeatures, edgeFeatures=edgeFeatures, metaData=metaData)

This is Text-Fabric 5.5.18
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

11 features found and 0 ignored
  0.00s Exporting 9 node and 1 edge and 1 config features to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf:
   |     0.55s T dchar                to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.08s T dword                to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.59s T last                 to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.00s T name                 to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.00s T name@en              to /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.03s T number               to /Users/dirk/github/Dans-labs/text-fabric-data/

# Work with TF

See the [tutorial](Skandapurāṇa.ipynb)