<img src="images/skanda.jpg" width="400"/>
<img src="images/tficon-small.png" width="70"/>

# Tutorial

We show a few things that you can do with the Skandapurāṇa corpus in Text-Fabric format.

# Load modules

In [1]:
import os

In [2]:
from tf.fabric import Fabric

In [3]:
SK = os.path.expanduser('~/github/Dans-labs/text-fabric-data/sanskrit/sp/tf')

# Start Text-Fabric

In [4]:
TF = Fabric(locations=[SK])

This is Text-Fabric 5.5.18
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

11 features found and 0 ignored


# Load features

In [5]:
api = TF.load('volume dchar dword tword last number name')
api.makeAvailableIn(locals())

  0.00s loading features ...
   |     0.00s B name                 from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.00s B number               from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.15s B dchar                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.03s B dword                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.05s B last                 from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.03s B tword                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.00s B volume               from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
  0.70s All features loaded/computed - for details use loadLog()


# Work

Now we can do real work.

We start with basic exploration.

# Showing

Line 2 in verse 3 in document 4:

In [6]:
node = T.nodeFromSection(('004', 3, 2))
print(T.text(L.d(node, otype='char')))

तेजसा जगदाविश्य आजगाम तदन्तिकम् 


A function to print a line with section indicator.
The line will be printed both in devanagari and in transcription.

In [7]:
def printLine(line, caption=False):
  section = T.sectionFromNode(line)
  words = L.d(line, otype='word')
  if caption:
    print('{} {}:{}'.format(*section))
  print(f'  {T.text(words, fmt="text-orig-word")}')
  print(f'  {T.text(words, fmt="text-trans-word")}')

We print the line with node number 400,000.
Note that this number has no recognizable meaning, it is just a kind of bar code.

In [8]:
printLine(400000, caption=True)

058 2:2
  सा तुष्टा वरदानेन चिन्तयन्ती तदा वरम् 
  sā tuṣṭā varadānena cintayantī tadā varam 


Text 40, divided in verses and lines:

In [9]:
def printText(name):
  node = T.nodeFromSection((name,))
  print(f'Text {name}')
  for verse in L.d(node, otype='verse'):
    print(f'verse {F.number.v(verse)}')
    for line in L.d(verse, otype='line'):
      print(f' line {F.number.v(line)}')
      printLine(line)

In [10]:
printText('040')

Text 040
verse 1
 line 0
  सुशर्मोवाच 
  suśarmovāca 
 line 1
  अतः परं प्रवक्ष्यामि कुम्भीपाकं महाभयम् 
  ataḥ paraṃ pravakṣyāmi kumbhīpākaṃ mahābhayam 
 line 2
  श्रोतॄणामपि तत्कालं भयदं ह्यकृतात्मनाम् 
  śrotṝṇāmapi tatkālaṃ bhayadaṃ hyakṛtātmanām 
verse 2
 line 1
  आयस्यस्तत्र बह्व्यश्च अञ्जनाचलसंनिभाः 
  āyasyastatra bahvyaśca añjanācalasaṃnibhāḥ 
 line 2
  कुम्भ्यस्तैलेन संपूर्णा वह्नितप्ताः सुदुःसहाः 
  kumbhyastailena saṃpūrṇā vahnitaptāḥ suduḥsahāḥ 
verse 3
 line 1
  दुष्कृतींस्तासु तप्तासु बद्ध्वा बद्ध्वा भयावहाः 
  duṣkṛtīṃstāsu taptāsu baddhvā baddhvā bhayāvahāḥ 
 line 2
  चरन्ति राक्षसा घोराः क्रन्दमानान्सुभैरवम् 
  caranti rākṣasā ghorāḥ krandamānānsubhairavam 
verse 4
 line 1
  वर्षकोटीश्चतस्रश्च पच्यन्ते तत्र जन्तवः 
  varṣakoṭīścatasraśca pacyante tatra jantavaḥ 
 line 2
  ये तानिमान्निबोध त्वमुच्यमानान्मया विभो 
  ye tānimānnibodha tvamucyamānānmayā vibho 
verse 5
 line 1
  इष्टकापाककारी च कुम्भपाचक एव च 
  iṣṭakāpākakārī ca kumbhapācaka eva ca 
 line 2
  तौ विनाशयते 

## Counting

How many letters?

In [11]:
len(F.otype.s('char'))

347770

How many words?

In [12]:
len(F.otype.s('word'))

45101

How many lines?

In [13]:
len(F.otype.s('line'))

10012

How many verses?

In [14]:
len(F.otype.s('verse'))

4532

How many texts?

In [15]:
len(F.otype.s('text'))

75

Frequency list of the devanagari characters:

In [16]:
for (d, f) in F.dchar.freqList():
  print(f'{f:>5} x {d}')

45382 x ्
28360 x ा
25100 x त
19963 x र
17067 x व
15626 x म
15421 x ि
15097 x न
14838 x य
14379 x स
10345 x े
 9746 x द
 9473 x ु
 9468 x प
 9218 x ं
 8263 x क
 6714 x च
 6498 x ः
 6211 x श
 5790 x ो
 4754 x ह
 4605 x ष
 4430 x भ
 4187 x ग
 3376 x ण
 3209 x ल
 3074 x ी
 2957 x ज
 2794 x ध
 2729 x ै
 2597 x ृ
 2308 x थ
 1831 x ू
 1587 x ब
 1204 x ट
  980 x ऽ
  963 x अ
  818 x ौ
  706 x उ
  655 x ख
  655 x ञ
  574 x छ
  570 x ङ
  491 x ए
  436 x इ
  431 x घ
  330 x ड
  311 x ठ
  295 x आ
  228 x †
  218 x फ
  182 x -
   90 x ऋ
   77 x ढ
   60 x ऊ
   36 x ॄ
   26 x ई
   15 x ऐ
   13 x झ
    4 x ओ
    4 x औ
    1 x .


How many distinct words, how many hapaxes, and the frequency list of the top 20:

In [17]:
freqWords = F.dword.freqList()
print(f'{len(freqWords)} distinct words')

21980 distinct words


In [18]:
print()
hapaxes = [x[0] for x in freqWords if x[1] == 1]
print(f'{len(hapaxes)} hapaxes')
print('\n'.join(hapaxes[0:10]))


17932 hapaxes
अकम्पनो
अकरोत्किं
अकरोत्किमिति
अकरोदुदकक्रियाम्
अकरोद्भैरवं
अकस्मात्सम्प्रदृश्यते
अकस्मादथ
अकस्मादभवत्सर्वं
अकार्याणि
अकिंचना


In [19]:
for (d, f) in freqWords[0:20]:
  print(f'{f:>5} x {d}')

 1312 x च
  527 x स
  450 x उवाच
  419 x न
  330 x ते
  265 x नमः
  246 x सनत्कुमार
  231 x चैव
  221 x तु
  195 x तदा
  191 x तं
  191 x ततो
  184 x वै
  175 x हि
  166 x तथा
  162 x -
  156 x नमो
  153 x मे
  142 x तत्र
  142 x व्यास


# Parallels

We find parallels between lines.

We compare each pair of lines in the following way:

* the set of devanagari letters occurring in each line is computed;
* the similarity of two lines is defined as the number of characters;
  in the intersection of both sets divided by the number of lines in the union of both sets;
* note that if two lines have the same sets of characters, their similarity is 1;
* we store the similarity of each pair of lines that are sufficiently similar;
* we indicate similarity thresholds: numbers between 0 and 1. If two lines have a similarity
  above the similarity threshold in use, we consider them similar.
* we play with similarity thresholds and list the similar lines for those thresholds.

In [20]:
def sim(a, b): return len(a & b) / len(a | b)

## Preparation

Comparing all lines with each other takes a lot of computations: roughly 10,000 * 10,000 / 2,
or 50,000,000.

It pays of to do as little work per similarity comparison.

On beforehand, we map each line node to the set of characters it contains.

In [21]:
lines = F.otype.s('line')

def prepare():
  lineSets = {}
  for line in lines:
    chars = set()
    for word in L.d(line, otype='word'):
      chars |= set(F.dword.v(word))
    lineSets[line] = chars
  return lineSets

If 2 lines are too dissimilar, we will not store their similarity, in order to save space.

In [22]:
MIN_THRESHOLD = 0.75

In [23]:
lineSets = prepare()

## Filling the similarity matrix

In [24]:
def compare():
  nLines = len(lines)
  indent(reset=True)
  matrix = {}
  info(f'{int(nLines * nLines / 2)} comparisons to do')
  c = 0
  k = 0
  chunkSize = 1000000
  info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
  for i in range(0, nLines - 1):
    for j in range(i + 1, nLines):
      c += 1
      k += 1
      if k == chunkSize:
        k = 0
        info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
      li = lines[i]
      lj = lines[j]
      s = sim(lineSets[li], lineSets[lj])
      if s >= MIN_THRESHOLD:
        matrix[(li, lj)] = s
  info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
  return matrix

In [25]:
matrix = compare()

  0.00s 50120072 comparisons to do
  0.00s        0 comparisons and        0 similarities stored
  3.24s  1000000 comparisons and     1618 similarities stored
  6.82s  2000000 comparisons and     3414 similarities stored
    10s  3000000 comparisons and     4743 similarities stored
    14s  4000000 comparisons and     5797 similarities stored
    17s  5000000 comparisons and     7135 similarities stored
    20s  6000000 comparisons and     9564 similarities stored
    23s  7000000 comparisons and    12522 similarities stored
    26s  8000000 comparisons and    15407 similarities stored
    30s  9000000 comparisons and    17576 similarities stored
    33s 10000000 comparisons and    19895 similarities stored
    36s 11000000 comparisons and    21592 similarities stored
    39s 12000000 comparisons and    22666 similarities stored
    43s 13000000 comparisons and    23752 similarities stored
    46s 14000000 comparisons and    26474 similarities stored
    49s 15000000 comparisons and   

# Selecting the most similar lines

We filter the matrix for similarities above a certain threshold.

In [26]:
def filterMatrix(matrix, threshold):
  return {
    (li, lj, s)
    for ((li, lj), s) in matrix.items()
    if s >= threshold
  }

How many have similarity 1?

In [27]:
matrix100 = filterMatrix(matrix, 1)
len(matrix100)

32215

How many of these are identical pairs?

In [28]:
def devText(line):
  words = L.d(line, otype='word')
  return T.text(words, fmt='text-orig-word')

def lineEq(li, lj):
  return devText(li) == devText(lj)

matrixEq = {
  (li, lj)
  for (li, lj, s) in matrix100
  if lineEq(li, lj)
}

len(matrixEq)

32180

Now list the top 100 of similar lines, excluding the perfectly identical lines.

In [32]:
similars = []
for (li, lj, s) in sorted(
  filterMatrix(matrix, 0.95),
  key=lambda x: -x[2],
):
  if (li, lj) in matrixEq: continue
  if len(similars) >= 100:
    break
  similars.append((li, lj, s))
print(f'Collected {len(similars)} pairs')

Collected 100 pairs


In [33]:
for (li, lj, s) in similars: 
  print(f'similarity {s}:')
  printLine(li, caption=True)
  printLine(lj, caption=True)
  print()

similarity 1.0:
020 999:9
  इति स्कन्दपुराणे विंशतितमो ऽध्यायः 
  iti skandapurāṇe viṃśatitamo 'dhyāyaḥ 
027 999:9
  इति स्कन्दपुराणे सप्तविंशतितमो ऽध्यायः 
  iti skandapurāṇe saptaviṃśatitamo 'dhyāyaḥ 

similarity 1.0:
013 999:9
  इति स्कन्दपुराणे नाम त्रयोदशो ऽध्यायः 
  iti skandapurāṇe nāma trayodaśo 'dhyāyaḥ 
017 999:9
  इति स्कन्दपुराणे सप्तदशमो ऽध्यायः 
  iti skandapurāṇe saptadaśamo 'dhyāyaḥ 

similarity 1.0:
018 999:9
  इति स्कन्दपुराणे ऽष्टादशमो ऽध्यायः 
  iti skandapurāṇe 'ṣṭādaśamo 'dhyāyaḥ 
167S 999:9
  इति स्कन्दपुराणे सप्तषष्ट्युत्तरशततमो ऽध्यायः 
  iti skandapurāṇe saptaṣaṣṭyuttaraśatatamo 'dhyāyaḥ 

similarity 1.0:
042 999:9
  स्कन्दपुराणे द्वाचत्वारिंशो ऽध्यायः 
  skandapurāṇe dvācatvāriṃśo 'dhyāyaḥ 
047 999:9
  स्कन्दपुराणे सप्तचत्वारिंशो ऽध्यायः 
  skandapurāṇe saptacatvāriṃśo 'dhyāyaḥ 

similarity 1.0:
040 999:9
  स्कन्दपुराणे चत्वारिंशो ऽध्यायः 
  skandapurāṇe catvāriṃśo 'dhyāyaḥ 
044 999:9
  स्कन्दपुराणे चतुश्चत्वारिंशो ऽध्यायः 
  skandapurāṇe catuścatvāriṃśo 'dhy