<img src="images/skanda.jpg" width="400"/>
<img src="images/tf-small.png" width="70"/>

# Tutorial

We show a few things that you can do with the Skandapurāṇa corpus in Text-Fabric format.

# Load modules

In [1]:
import os
import pickle

For the next cells you have to install additional Python modules:

```
pip3 install python-Levenshtein
pip3 install text-fabric
```

In [2]:
from Levenshtein import ratio
from tf.fabric import Fabric

In [3]:
BASE = os.path.expanduser('~/github/Dans-labs/text-fabric-data/sanskrit/sp')
SK = f'{BASE}/tf'
TEMP_DIR = f'{BASE}/_temp'

if not os.path.exists(TEMP_DIR):
  os.makedirs(TEMP_DIR, exist_ok=True)

# Start Text-Fabric

In [4]:
TF = Fabric(locations=[SK])

This is Text-Fabric 5.5.18
Api reference : https://dans-labs.github.io/text-fabric/Api/General/
Tutorial      : https://github.com/Dans-labs/text-fabric/blob/master/docs/tutorial.ipynb
Example data  : https://github.com/Dans-labs/text-fabric-data

11 features found and 0 ignored


# Load features

In [5]:
api = TF.load('volume dchar dword tword last number name')
api.makeAvailableIn(locals())

  0.00s loading features ...
   |     0.00s B name                 from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.01s B number               from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.14s B dchar                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.03s B dword                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.06s B last                 from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.03s B tword                from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
   |     0.00s B volume               from /Users/dirk/github/Dans-labs/text-fabric-data/sanskrit/sp/tf
  0.71s All features loaded/computed - for details use loadLog()


# Work

Now we can do real work.

We start with basic exploration.

# Showing

Line 2 in verse 3 in document 4:

In [6]:
node = T.nodeFromSection(('004', 3, 2))
print(T.text(L.d(node, otype='char')))

तेजसा जगदाविश्य आजगाम तदन्तिकम् 


A function to print a line with section indicator.
The line will be printed both in devanagari and in transcription.

In [7]:
def printLine(line, caption=False):
  section = T.sectionFromNode(line)
  words = L.d(line, otype='word')
  if caption:
    print('{} {}:{}'.format(*section))
  print(f'  {T.text(words, fmt="text-orig-word")}')
  print(f'  {T.text(words, fmt="text-trans-word")}')

We print the line with node number 400,000.
Note that this number has no recognizable meaning, it is just a kind of bar code.

In [8]:
printLine(400000, caption=True)

058 2:2
  सा तुष्टा वरदानेन चिन्तयन्ती तदा वरम् 
  sā tuṣṭā varadānena cintayantī tadā varam 


Text 40, divided in verses and lines:

In [9]:
def printText(name):
  node = T.nodeFromSection((name,))
  print(f'Text {name}')
  for verse in L.d(node, otype='verse'):
    print(f'verse {F.number.v(verse)}')
    for line in L.d(verse, otype='line'):
      print(f' line {F.number.v(line)}')
      printLine(line)

In [10]:
printText('040')

Text 040
verse 1
 line 0
  सुशर्मोवाच 
  suśarmovāca 
 line 1
  अतः परं प्रवक्ष्यामि कुम्भीपाकं महाभयम् 
  ataḥ paraṃ pravakṣyāmi kumbhīpākaṃ mahābhayam 
 line 2
  श्रोतॄणामपि तत्कालं भयदं ह्यकृतात्मनाम् 
  śrotṝṇāmapi tatkālaṃ bhayadaṃ hyakṛtātmanām 
verse 2
 line 1
  आयस्यस्तत्र बह्व्यश्च अञ्जनाचलसंनिभाः 
  āyasyastatra bahvyaśca añjanācalasaṃnibhāḥ 
 line 2
  कुम्भ्यस्तैलेन संपूर्णा वह्नितप्ताः सुदुःसहाः 
  kumbhyastailena saṃpūrṇā vahnitaptāḥ suduḥsahāḥ 
verse 3
 line 1
  दुष्कृतींस्तासु तप्तासु बद्ध्वा बद्ध्वा भयावहाः 
  duṣkṛtīṃstāsu taptāsu baddhvā baddhvā bhayāvahāḥ 
 line 2
  चरन्ति राक्षसा घोराः क्रन्दमानान्सुभैरवम् 
  caranti rākṣasā ghorāḥ krandamānānsubhairavam 
verse 4
 line 1
  वर्षकोटीश्चतस्रश्च पच्यन्ते तत्र जन्तवः 
  varṣakoṭīścatasraśca pacyante tatra jantavaḥ 
 line 2
  ये तानिमान्निबोध त्वमुच्यमानान्मया विभो 
  ye tānimānnibodha tvamucyamānānmayā vibho 
verse 5
 line 1
  इष्टकापाककारी च कुम्भपाचक एव च 
  iṣṭakāpākakārī ca kumbhapācaka eva ca 
 line 2
  तौ विनाशयते 

## Counting

How many letters?

In [11]:
len(F.otype.s('char'))

347770

How many words?

In [12]:
len(F.otype.s('word'))

45101

How many lines?

In [13]:
len(F.otype.s('line'))

10012

How many verses?

In [14]:
len(F.otype.s('verse'))

4532

How many texts?

In [15]:
len(F.otype.s('text'))

75

Frequency list of the devanagari characters:

In [16]:
for (d, f) in F.dchar.freqList():
  print(f'{f:>5} x {d}')

45382 x ्
28360 x ा
25100 x त
19963 x र
17067 x व
15626 x म
15421 x ि
15097 x न
14838 x य
14379 x स
10345 x े
 9746 x द
 9473 x ु
 9468 x प
 9218 x ं
 8263 x क
 6714 x च
 6498 x ः
 6211 x श
 5790 x ो
 4754 x ह
 4605 x ष
 4430 x भ
 4187 x ग
 3376 x ण
 3209 x ल
 3074 x ी
 2957 x ज
 2794 x ध
 2729 x ै
 2597 x ृ
 2308 x थ
 1831 x ू
 1587 x ब
 1204 x ट
  980 x ऽ
  963 x अ
  818 x ौ
  706 x उ
  655 x ख
  655 x ञ
  574 x छ
  570 x ङ
  491 x ए
  436 x इ
  431 x घ
  330 x ड
  311 x ठ
  295 x आ
  228 x †
  218 x फ
  182 x -
   90 x ऋ
   77 x ढ
   60 x ऊ
   36 x ॄ
   26 x ई
   15 x ऐ
   13 x झ
    4 x ओ
    4 x औ
    1 x .


How many distinct words, how many hapaxes, and the frequency list of the top 20:

In [17]:
freqWords = F.dword.freqList()
print(f'{len(freqWords)} distinct words')

21980 distinct words


In [18]:
print()
hapaxes = [x[0] for x in freqWords if x[1] == 1]
print(f'{len(hapaxes)} hapaxes')
print('\n'.join(hapaxes[0:10]))


17932 hapaxes
अकम्पनो
अकरोत्किं
अकरोत्किमिति
अकरोदुदकक्रियाम्
अकरोद्भैरवं
अकस्मात्सम्प्रदृश्यते
अकस्मादथ
अकस्मादभवत्सर्वं
अकार्याणि
अकिंचना


In [19]:
for (d, f) in freqWords[0:20]:
  print(f'{f:>5} x {d}')

 1312 x च
  527 x स
  450 x उवाच
  419 x न
  330 x ते
  265 x नमः
  246 x सनत्कुमार
  231 x चैव
  221 x तु
  195 x तदा
  191 x तं
  191 x ततो
  184 x वै
  175 x हि
  166 x तथा
  162 x -
  156 x नमो
  153 x मे
  142 x तत्र
  142 x व्यास


# Parallels

We find parallels between lines.

We compare each pair of lines in the following way:

* the set of devanagari letters occurring in each line is computed;
* the similarity of two lines is defined as the number of characters;
  in the intersection of both sets divided by the number of lines in the union of both sets;
* note that if two lines have the same sets of characters, their similarity is 1;
* we store the similarity of each pair of lines that are sufficiently similar;
* we indicate similarity thresholds: numbers between 0 and 1. If two lines have a similarity
  above the similarity threshold in use, we consider them similar.
* we play with similarity thresholds and list the similar lines for those thresholds.

## Discussion

Our similarity measure is blunt: if two lines are composed out of the same set of characters, they are considered 100% similar. We expect to get many false positives. 
We also expect that the real parallels may be obscured by the many false positives.

Yet, we do not know what will happen, so we just start with a coarse but quick method.
Later, when we refine, we can observe how much more sophistication buys us.

We call the coarse, set-based method `SET`.

In [20]:
def simSET(a, b): return len(a & b) / len(a | b)

For the more sophisticated method
we'll use the module [python-Levenshtein](https://pypi.org/project/python-Levenshtein/)
and especially its method `ratio()`.

This method gives us similarity in terms of how many edits you need to get from the one string to the other.
An other way to characterize it, is to say that it gives the ratio between the longest common subsequence of two strings with respect to their individual length.

In the sequel, we call this method `LCS`.
We do not have to program it, because the Levenshtein module we have imported contains
an efficient implementation of it: the function `ratio()`.

In [21]:
simLCS = ratio

## Preparation

Comparing all lines with each other takes a lot of computations: roughly 10,000 * 10,000 / 2,
or 50,000,000.

It pays off to do as little work per similarity comparison.

On beforehand, we map each line node to the set of characters it contains.

We will vary between methods of similarity computation, which will also require different
preparation routines.

Here we define preparation for the SET method, as outlined above.

In [22]:
lines = F.otype.s('line')

def prepareSET():
  lineSets = {}
  for line in lines:
    chars = set()
    for word in L.d(line, otype='word'):
      chars |= set(F.dword.v(word))
    lineSets[line] = chars
  return lineSets

In order to use the LCS method, we need to compare strings.
We will ignore the spacing between words.

So we define a new preparation step in which for each line its devanagari continuous string value is computed.

In [23]:
def prepareLCS():
  lineStrings = {}
  for line in lines:
    lineString = ''.join(
      F.dword.v(word)
      for word in L.d(line, otype='word')
    )
    lineStrings[line] = lineString
  return lineStrings

In [24]:
def prepare(method):
  return prepareSET() if method == 'SET' else prepareLCS()

If 2 lines are too dissimilar, we will not store their similarity, in order to save space.

In [25]:
MIN_THRESHOLD = 0.75

## Filling the similarity matrix

Computing the similarity matrix is a bit labor-intensive.
Before we compute it, we check whether we have computed it before and saved it do disk.
If so, we load the matrix from there.
If not, we compute it, and save it to disk.

Note that for each method we compute a separate similarity matrix.

In [26]:
MATRIX_FILE_BASE = f'{TEMP_DIR}/matrix'

def compare(lineData, method):
  matrixFile = MATRIX_FILE_BASE + method
  sim = simSET if method == 'SET' else simLCS
  indent(reset=True)
  if os.path.exists(matrixFile):
    info(f'Using existing matrix ...')
    with open(matrixFile, 'rb') as mf:
      matrix = pickle.load(mf)
    info(f'{len(matrix)} similarities')
    return matrix
  
  nLines = len(lines)
  matrix = {}
  info(f'{int(nLines * nLines / 2)} comparisons to do')
  c = 0
  k = 0
  chunkSize = 1000000
  info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
  for i in range(0, nLines - 1):
    for j in range(i + 1, nLines):
      c += 1
      k += 1
      if k == chunkSize:
        k = 0
        info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
      li = lines[i]
      lj = lines[j]
      s = sim(lineData[li], lineData[lj])
      if s >= MIN_THRESHOLD:
        matrix[(li, lj)] = s
  info(f'{c:>8} comparisons and {len(matrix):>8} similarities stored')
  with open(matrixFile, 'wb') as mf:
    pickle.dump(matrix, mf)
  return matrix

Now together:

In [27]:
def parallels(method):
  lineData = prepare(method)
  matrix = compare(lineData, method)
  return (lineData, matrix)

In [28]:
(lineSets, matrixSET) = parallels('SET')

  0.00s Using existing matrix ...
  0.05s 93603 similarities


In [29]:
(lineStrings, matrixLCS) = parallels('LCS')

  0.00s Using existing matrix ...
  0.02s 35177 similarities


# Exploring the results

We need to filter the matrix in order to focus on the most similar pairs.

The following filter function works for matrices in general.

In [30]:
def filterMatrix(matrix, threshold):
  return {
    (li, lj, s)
    for ((li, lj), s) in matrix.items()
    if s >= threshold
  }

# Working with the SET results

We start investigating the results of the SET method.

## Selecting the most similar lines

We filter the matrix for similarities above a certain threshold.

How many have similarity 1?

In [31]:
matrix100 = filterMatrix(matrixSET, 1)
len(matrix100)

32215

How many of these are identical pairs?

In [32]:
def devText(line):
  words = L.d(line, otype='word')
  return T.text(words, fmt='text-orig-word')

def transText(line):
  words = L.d(line, otype='word')
  return T.text(words, fmt='text-trans-word')

def lineEq(li, lj):
  return devText(li) == devText(lj)

matrixEq = {
  (li, lj)
  for (li, lj, s) in matrix100
  if lineEq(li, lj)
}

len(matrixEq)

32180

We want to spot interesting similarities.
But before we do that, we want to get a feeling for all those identical pairs.

How many distinct lines are there with multiple occurrences in the corpus?

In [33]:
repeatedLines = {devText(li) for (li, lj) in matrixEq}
len(repeatedLines)

66

So, there are very few distinct lines that are repeated verbatim, but those are repeated pretty many times. Here they are:

In [34]:
print('\n'.join(sorted(repeatedLines)))

- - - - - - - - - - - - - - - - 
अभितः स्तूयमानश्च सूतमागधवन्दिभिः 
अभेद्यश्चैव वज्रेण मत्प्रसादाद्भविष्यसि 
अमरो जरया त्यक्तः सर्वदुःखविवर्जितः 
उपमन्युरुवाच 
उशीरबीजः शैलेन्द्रस्तत्राश्रमपदं महत् 
ऋषय ऊचुः 
कारोहणमिति ख्यातं त्रिनेत्रायतनं महत् 
काष्ठकूट उवाच 
कुबेर उवाच 
कृमिकीटेषु जायन्ते सहस्राणां शतानि षट् 
ग्राह उवाच 
चतुर्थेनावतस्थे च यतः स भगवान्प्रभुः 
छगलण्डेश्वरे चैव महाभैरवजं फलम् 
जैगीषव्य उवाच 
तस्मिंश्चन्द्रप्रभं लिङ्गं स्फाटिकं मणिरञ्जितम् 
तृतीयश्चाभवन्मित्रो मथुरायां महामनाः 
त्रेतायां दिण्डिमुण्डश्च शिरांसि विनिकृत्तवान् 
दक्ष उवाच 
दधीच उवाच 
दधीचेन महद्दिव्यं पुण्यमायतनं कृतम् 
देव उवाच 
देवा ऊचुः 
देव्युवाच 
द्वितीयं द्वापरे प्राप्ते तृतीयं च कलौ युगे 
नन्द्युवाच 
नमः सहस्रनेत्राय शतनेत्राय वै नमः 
नमस्ते सर्वलोकेश नमस्ते लोकभावन 
पात्यन्ते विवशा मूढा यावत्पापं क्षयं गतम् 
पितर ऊचुः 
पितामह उवाच 
पुण्यमायतनं तत्र च्यवनेनाभिनिर्मितम् 
पुष्पभद्रमिति ख्यातं विन्ध्यप्रस्थे द्रुमावृतम् 
ब्रह्मचारी चतुर्थस्तु कुरुष्वेव सुगोत्रजः 
ब्रह्मोवाच 
भगवन्यदि तुष्टो ऽसि यदि देय

Let's get the transcriptions as well, plus the number of times each line is repeated.

In [35]:
repeatedOccs = {}
for l in lines:
  d = devText(l)
  if d in repeatedLines:
    repeatedOccs.setdefault(d, set()).add(l)

In [36]:
for (repeated, occs) in sorted(
  repeatedOccs.items(), 
  key=lambda x: -len(x[1]),
):
  print(f'{len(occs):>3} x\n\t{repeated}\n\t{transText(sorted(occs)[0])}')

246 x
	सनत्कुमार उवाच 
	sanatkumāra uvāca 
 38 x
	देव उवाच 
	deva uvāca 
 30 x
	व्यास उवाच 
	vyāsa uvāca 
 26 x
	देव्युवाच 
	devyuvāca 
 21 x
	ब्रह्मोवाच 
	brahmovāca 
 15 x
	सुशर्मोवाच 
	suśarmovāca 
 11 x
	पितर ऊचुः 
	pitara ūcuḥ 
  8 x
	वायुरुवाच 
	vāyuruvāca 
  8 x
	नन्द्युवाच 
	nandyuvāca 
  7 x
	- - - - - - - - - - - - - - - - 
	- - - - - - - - - - - - - - - - 
  7 x
	उपमन्युरुवाच 
	upamanyuruvāca 
  6 x
	देवा ऊचुः 
	devā ūcuḥ 
  6 x
	सुकेश उवाच 
	sukeśa uvāca 
  4 x
	पितामह उवाच 
	pitāmaha uvāca 
  4 x
	शिलाद उवाच 
	śilāda uvāca 
  4 x
	काष्ठकूट उवाच 
	kāṣṭhakūṭa uvāca 
  3 x
	ऋषय ऊचुः 
	ṛṣaya ūcuḥ 
  3 x
	विष्णुरुवाच 
	viṣṇuruvāca 
  3 x
	ग्राह उवाच 
	grāha uvāca 
  3 x
	वसिष्ठ उवाच 
	vasiṣṭha uvāca 
  3 x
	भगवानुवाच 
	bhagavānuvāca 
  3 x
	राजोवाच 
	rājovāca 
  3 x
	शक्र उवाच 
	śakra uvāca 
  3 x
	अमरो जरया त्यक्तः सर्वदुःखविवर्जितः 
	amaro jarayā tyaktaḥ sarvaduḥkhavivarjitaḥ 
  2 x
	नमः सहस्रनेत्राय शतनेत्राय वै नमः 
	namaḥ sahasranetrāya śatanetrāya vai namaḥ 
  2 x
	सर्वपा

We continue with the non-identical pairs.

Our first attempt is to see what the 100 most similar
pairs are, excluding the identical pairs.

It turns out that most of them are between the last lines of texts,
which are apparently stating the ordinal number of the text within the corpus.

So in the second attempt, we'll exclude these lines, they all have verse number 999.

In [37]:
similars = []
for (li, lj, s) in sorted(
  filterMatrix(matrixSET, 0.95),
  key=lambda x: -x[2],
):
  if (li, lj) in matrixEq: continue
  if len(similars) >= 100:
    break
  similars.append((li, lj, s))
print(f'Collected {len(similars)} pairs')

Collected 100 pairs


In [38]:
for (li, lj, s) in similars: 
  print(f'similarity {s}:')
  printLine(li, caption=True)
  printLine(lj, caption=True)
  print()

similarity 1.0:
020 999:9
  इति स्कन्दपुराणे विंशतितमो ऽध्यायः 
  iti skandapurāṇe viṃśatitamo 'dhyāyaḥ 
027 999:9
  इति स्कन्दपुराणे सप्तविंशतितमो ऽध्यायः 
  iti skandapurāṇe saptaviṃśatitamo 'dhyāyaḥ 

similarity 1.0:
013 999:9
  इति स्कन्दपुराणे नाम त्रयोदशो ऽध्यायः 
  iti skandapurāṇe nāma trayodaśo 'dhyāyaḥ 
017 999:9
  इति स्कन्दपुराणे सप्तदशमो ऽध्यायः 
  iti skandapurāṇe saptadaśamo 'dhyāyaḥ 

similarity 1.0:
018 999:9
  इति स्कन्दपुराणे ऽष्टादशमो ऽध्यायः 
  iti skandapurāṇe 'ṣṭādaśamo 'dhyāyaḥ 
167S 999:9
  इति स्कन्दपुराणे सप्तषष्ट्युत्तरशततमो ऽध्यायः 
  iti skandapurāṇe saptaṣaṣṭyuttaraśatatamo 'dhyāyaḥ 

similarity 1.0:
042 999:9
  स्कन्दपुराणे द्वाचत्वारिंशो ऽध्यायः 
  skandapurāṇe dvācatvāriṃśo 'dhyāyaḥ 
047 999:9
  स्कन्दपुराणे सप्तचत्वारिंशो ऽध्यायः 
  skandapurāṇe saptacatvāriṃśo 'dhyāyaḥ 

similarity 1.0:
040 999:9
  स्कन्दपुराणे चत्वारिंशो ऽध्यायः 
  skandapurāṇe catvāriṃśo 'dhyāyaḥ 
044 999:9
  स्कन्दपुराणे चतुश्चत्वारिंशो ऽध्यायः 
  skandapurāṇe catuścatvāriṃśo 'dhy

Now the second attempt, excluding lines with verse/line number 999:9.

It turns out that there are less than hundred for similarity as good as 0.95.

Here they are, for what it is worth.

In [39]:
similars = []
for (li, lj, s) in sorted(
  filterMatrix(matrixSET, 0.95),
  key=lambda x: -x[2],
):
  if (li, lj) in matrixEq: continue
  vi = L.u(li, otype='verse')[0]
  vj = L.u(lj, otype='verse')[0]
  if (
    (F.number.v(vi) == 999 and F.number.v(li) == 9)
    or
    (F.number.v(vj) == 999 and F.number.v(lj) == 9)
  ):
    continue
  if len(similars) >= 100:
    break
  similars.append((li, lj, s))
print(f'Collected {len(similars)} pairs')

Collected 19 pairs


In [40]:
for (li, lj, s) in similars: 
  print(f'similarity {s}:')
  printLine(li, caption=True)
  printLine(lj, caption=True)
  print()

similarity 1.0:
012 36:1
  श्रुत्वा तु देवी तं नादं विप्रस्यार्तस्य शोभना 
  śrutvā tu devī taṃ nādaṃ viprasyārtasya śobhanā 
028 1:1
  ततो व्यास पुनर्देवी पतिं व्रतपतिं शुभा 
  tato vyāsa punardevī patiṃ vratapatiṃ śubhā 

similarity 1.0:
167S 95:1
  अन्यदुक्तरथं नाम भवस्यायतनं शुभम् 
  anyaduktarathaṃ nāma bhavasyāyatanaṃ śubham 
167RA5 17:1
  अन्यद्युक्तरथं नाम भवस्यायतनं शुभम् 
  anyadyuktarathaṃ nāma bhavasyāyatanaṃ śubham 

similarity 1.0:
033 115:1
  निवासार्थं च दिव्यं तमिन्द्रद्वीपं ददामि ते 
  nivāsārthaṃ ca divyaṃ tamindradvīpaṃ dadāmi te 
041 17:2
  स्वस्थानं च समासीनांस्त्रासयेद्वारयेदपि 
  svasthānaṃ ca samāsīnāṃstrāsayedvārayedapi 

similarity 1.0:
167S 105:1
  एको राक्षसशार्दूलो यत्राद्यापि विभीषणः 
  eko rākṣasaśārdūlo yatrādyāpi vibhīṣaṇaḥ 
167RA5 31:1
  एत्य राक्षसशार्दूलो यत्राद्यापि विभीषणः 
  etya rākṣasaśārdūlo yatrādyāpi vibhīṣaṇaḥ 

similarity 0.9565217391304348:
167S 56:2
  तं दृष्ट्वा मनुजो व्यास राजसूयफलं लभेत् 
  taṃ dṛṣṭvā manujo vyāsa rājasūyaphalaṃ labhe

Seeing these results, it might be a good thing to base the similarity on another measure,
such as [edit distance](https://en.wikipedia.org/wiki/Levenshtein_distance).

# Working with the LCS results

We follow the same exploration for the LCS method.

## Selecting the most similar lines

We filter the matrix for similarities above a certain threshold.

How many have similarity 1?

In [41]:
matrix100 = filterMatrix(matrixLCS, 1)
len(matrix100)

32180

How many of these are identical pairs?

In [42]:
matrixEq = {
  (li, lj)
  for (li, lj, s) in matrix100
  if lineEq(li, lj)
}

len(matrixEq)

32180

This is the same number we got with the SET method, unsurprisingly.

Our first attempt is to see what the 100 most similar
pairs are, excluding the identical pairs.

In [43]:
similars = []
for (li, lj, s) in sorted(
  filterMatrix(matrixLCS, 0.95),
  key=lambda x: -x[2],
):
  if (li, lj) in matrixEq: continue
  if len(similars) >= 100:
    break
  similars.append((li, lj, s))
print(f'Collected {len(similars)} pairs')

Collected 15 pairs


In [44]:
for (li, lj, s) in similars: 
  print(f'similarity {s}:')
  printLine(li, caption=True)
  printLine(lj, caption=True)
  print()

similarity 0.9743589743589743:
167S 116:2
  द्वापरे चाषढिर्भूत्वा नृत्तेनानुगृहीतवान् 
  dvāpare cāṣaḍhirbhūtvā nṛttenānugṛhītavān 
167RA5 50:2
  द्वापरे चाषढिर्भूत्वा नृत्येनानुगृहीतवान् 
  dvāpare cāṣaḍhirbhūtvā nṛtyenānugṛhītavān 

similarity 0.9705882352941176:
167S 134:1
  आश्रमो योगिनां यत्र प्रवृत्तः पापनाशनः 
  āśramo yogināṃ yatra pravṛttaḥ pāpanāśanaḥ 
167RA5 105:2
  आश्रयो योगिनां यत्र प्रवृत्तः पापनाशनः 
  āśrayo yogināṃ yatra pravṛttaḥ pāpanāśanaḥ 

similarity 0.9705882352941176:
167S 136:1
  महीनर्मदयोर्मध्यं सह्यस्य च यदुत्तरम् 
  mahīnarmadayormadhyaṃ sahyasya ca yaduttaram 
167RA5 108:1
  महीनर्मदयोर्मध्ये सह्यस्य च यदुत्तरम् 
  mahīnarmadayormadhye sahyasya ca yaduttaram 

similarity 0.96875:
019 999:9
  इति स्कन्दपुराणे ऊनविंशतितमो ऽध्यायः 
  iti skandapurāṇe ūnaviṃśatitamo 'dhyāyaḥ 
020 999:9
  इति स्कन्दपुराणे विंशतितमो ऽध्यायः 
  iti skandapurāṇe viṃśatitamo 'dhyāyaḥ 

similarity 0.967741935483871:
167S 95:1
  अन्यदुक्तरथं नाम भवस्यायतनं शुभम् 
  anyaduktarathaṃ n

There are not many pairs that have such a high similarity, and we do see the 999:9 lines.

In our second attempt, we exclude lines with verse/line number 999:9, and we lower the required
similarity to 0.8.

It also appears that there are a lot of similarities between lines with a single word.
We also skip them.

Let's see what we get.

In [45]:
similars = []
for (li, lj, s) in sorted(
  filterMatrix(matrixLCS, 0.8),
  key=lambda x: -x[2],
):
  if (li, lj) in matrixEq: continue
  vi = L.u(li, otype='verse')[0]
  vj = L.u(lj, otype='verse')[0]
  if (
    (F.number.v(vi) == 999 and F.number.v(li) == 9)
    or
    (F.number.v(vj) == 999 and F.number.v(lj) == 9)
  ):
    continue
  if len(devText(li).split()) == 1 and len(devText(lj).split()) == 1:
    continue
  if len(similars) >= 100:
    break
  similars.append((li, lj, s))
print(f'Collected {len(similars)} pairs')

Collected 87 pairs


In [46]:
for (li, lj, s) in similars: 
  print(f'similarity {s}:')
  printLine(li, caption=True)
  printLine(lj, caption=True)
  print()

similarity 0.9743589743589743:
167S 116:2
  द्वापरे चाषढिर्भूत्वा नृत्तेनानुगृहीतवान् 
  dvāpare cāṣaḍhirbhūtvā nṛttenānugṛhītavān 
167RA5 50:2
  द्वापरे चाषढिर्भूत्वा नृत्येनानुगृहीतवान् 
  dvāpare cāṣaḍhirbhūtvā nṛtyenānugṛhītavān 

similarity 0.9705882352941176:
167S 134:1
  आश्रमो योगिनां यत्र प्रवृत्तः पापनाशनः 
  āśramo yogināṃ yatra pravṛttaḥ pāpanāśanaḥ 
167RA5 105:2
  आश्रयो योगिनां यत्र प्रवृत्तः पापनाशनः 
  āśrayo yogināṃ yatra pravṛttaḥ pāpanāśanaḥ 

similarity 0.9705882352941176:
167S 136:1
  महीनर्मदयोर्मध्यं सह्यस्य च यदुत्तरम् 
  mahīnarmadayormadhyaṃ sahyasya ca yaduttaram 
167RA5 108:1
  महीनर्मदयोर्मध्ये सह्यस्य च यदुत्तरम् 
  mahīnarmadayormadhye sahyasya ca yaduttaram 

similarity 0.967741935483871:
167S 95:1
  अन्यदुक्तरथं नाम भवस्यायतनं शुभम् 
  anyaduktarathaṃ nāma bhavasyāyatanaṃ śubham 
167RA5 17:1
  अन्यद्युक्तरथं नाम भवस्यायतनं शुभम् 
  anyadyuktarathaṃ nāma bhavasyāyatanaṃ śubham 

similarity 0.967741935483871:
009 14:1
  यदि तुष्टो ऽसि देवेश यदि देयो वरश्च