# Metrical Analysis of Sanskrit Ninth Class Verb Forms

## Getting Verbal Roots 

In [1]:
!mkdir -p downloads
!mkdir -p data

In [None]:
!wget -O downloads/whitney_roots.pdf http://gretil.sub.uni-goettingen.de/gretil_elib/Whi885__Whitney_Roots-ACCENTED.pdf

In [29]:
# install pdftk if not already there. eg: for ubuntu: sudo apt install pdftk
!pdftk downloads/whitney_roots.pdf cat 229 output data/whitney_roots_ninth_class.pdf

# for our control data
!pdftk downloads/whitney_roots.pdf cat 228 output data/whitney_roots_fifth_class.pdf

In [30]:
# produces data/whitney_roots_ninth_class.txt
!pdftotext data/whitney_roots_ninth_class.pdf

# produces data/whitney_roots_fifth_class.txt
!pdftotext data/whitney_roots_fifth_class.pdf

Cleanup the text version manually, fixing formatting and diacritics.

One extra thing we also do is rewrite a form like _mī̆nā_ as _minā/mīnā_, i.e. re-write the variation in the root vowel as two different stem forms explicitly. This helps us visualize and process the variants easily later (note: whitney has only 3 stems here marked like this -- namely _mī̆nā_, _vlī̆nā_ and _dhū̆nī_ -- so we can get away with doing this manually here easily. If there were a lot of these, we could have automated it)

Final results are in [data/whitney_roots_ninth_class_cleaned.txt](data/whitney_roots_ninth_class_cleaned.txt) and [data/whitney_roots_fifth_class_cleaned.txt](data/whitney_roots_fifth_class_cleaned.txt).

In [225]:
# TODO try to get the 9th class forms/roots directly from Lubotksy's concordance?

## Parsing Verbal Roots Info

In [9]:
# in the same folder as this notebook
import src.lib.roots as roots

# useful during testing to pick up changes in the file
import importlib
importlib.reload(roots)

whitney_roots = roots.parse_whitney_roots([
    "data/whitney_roots_ninth_class_cleaned.txt",
    "data/whitney_roots_fifth_class_cleaned.txt",
])

In [10]:
import pandas

In [11]:
df_whitney_roots = pandas.DataFrame.from_dict(whitney_roots)
df_whitney_roots.to_csv("data/whitney_roots.csv", index=None)
df_whitney_roots.head()

Unnamed: 0,root_guess,variant_no,strong_stem,weak_stem,weak_only,attestation_texts,language_period,present_class
0,i 2,,inā,inī,True,V.,Earlier,ninth
1,iṣ,,iṣṇā,iṣṇī,False,,Earlier,ninth
2,ubh,,ubhnā,ubhnī,False,V.,Earlier,ninth
3,uṣ,,uṣṇā,uṣṇī,False,V.,Earlier,ninth
4,kṣi,,kṣiṇā,kṣiṇī,False,V.B.,Earlier,ninth


## Annotating Verbal Roots with Rig Veda Attestations (Manual)

Using Lubotsky's concordance, attestation info is manually added to [data/whitney_roots.csv](data/whitney_roots.csv).

Final results are in [data/roots_manual.csv](data/roots_manual.csv).

In [12]:
#df_roots_manual = pandas.read_csv("data/roots_manual.csv")
df_roots_manual = pandas.read_csv("data/roots_manual.csv", keep_default_na=False)

In [13]:
# TODO remove test df once we have all the annotations
#df_roots_test_manual = df_roots_manual[~df_roots["notes"].isna()]
df_roots_test_manual = df_roots_manual[df_roots_manual["notes"].str.len() > 0]
df_roots_test_manual.head()

Unnamed: 0,root,variant_no,stem,weak_only,attestation_texts,language_period,rig_veda_weak_attestations,rig_veda_strong_attestations,lubotsky_page_no,notes
1,iṣ,,iṣṇā,False,,Earlier,,1.63.2d,1:,iSnAsi
18,vr̥,1.0,vr̥ṇī,True,V.,Earlier,1.180.4b 1.67.1b 4.25.3a,,2:1338-1339,avRNItam vRnIte vRnIte(accented - last syll)
37,pu,,punā,False,,Earlier & Later,9.16.3c 9.67.27d,1.133.1a 10.13.3d,1:900-,punIhi puNAmi


## Annotating Verbal Roots with Rig Veda Attestations (Own Search)

### Getting Rig Veda padapatha text (Eichler)

In [None]:
# http://www.detlef108.de/Rigveda.htm 
# http://www.detlef108.de/Notes-to-the-Rigveda-Page.htm 
!wget -O downloads/rv_padapatha_eichler.html http://www.detlef108.de/RV-Padapatha-TA3-paada-NA-UTF8.html 

In [14]:
# sudo apt install html2text
#!html2text -utf8 -width 3000 -o rv_padapatha.txt rv_padaptaha.html

from bs4 import BeautifulSoup

with open("downloads/rv_padapatha_eichler.html", "r") as input_file:
    soup = BeautifulSoup(input_file)
    
    hymns = []
    
    for para in soup.find_all("p"):
        # ignore the ending notes
        if para.contents[0].name == "span":
            continue
        
        #hymns.append(para.text.rstrip()) # no extra lines between hymns
        hymns.append(para.text)
    
    with open("data/rv_padapatha_eichler.txt", 'w') as f:
        f.write("".join(hymns))

In [222]:
# TODO break the padapatha verse into sub-lines

### Getting Rig Veda padapatha / metrically restored texts

In [None]:
# https://github.com/cceh/c-salt_vedaweb_sources/tree/master/rigveda/versions
# description of the sources here:
# https://github.com/cceh/c-salt_vedaweb_tei/blob/master/vedaweb_corpus.tei
# https://vedaweb.uni-koeln.de/rigveda/help

!wget -O downloads/rv_padapatha_lubotsky.json https://raw.githubusercontent.com/cceh/c-salt_vedaweb_sources/master/rigveda/versions/lubotsky.json

!wget -O downloads/rv_samhitapatha_vnh.json https://raw.githubusercontent.com/cceh/c-salt_vedaweb_sources/master/rigveda/versions/vnh.json

In [16]:
# make text version from the jsons, with line numbers at the beginning
!python src/transform_json_corpus.py downloads/rv_padapatha_lubotsky.json

Successfully wrote the sanskrit text to data/rv_padapatha_lubotsky.txt

List of sanskrit chars resolved from the text:

vowels_short: ['a', 'a\\', 'i', 'l̥', 'r̥', 'r̥\\', 'u', 'u\\', '~i', '~u', 'á', 'í', 'ú', 'ŕ̥']
vowels_long: ['ai', 'au', 'au\\', 'aí', 'aú', 'e', 'e\\', 'o', 'o\\', 'r̥̄', 'r̥̄́', 'é', 'ó', 'ā', 'ā\\', 'ā́', 'ī', 'ī\\', 'ī́', 'ū', 'ū\\', 'ū́']
consonants: ['b', 'bh', 'c', 'ch', 'd', 'dh', 'g', 'gh', 'h', 'j', 'jh', 'k', 'kh', 'l', 'm', 'm̐', 'n', 'p', 'ph', 'r', 's', 't', 'th', 'v', 'y', 'ñ', 'ś', 'ḍ', 'ḍh', 'ḥ', 'ḷ', 'ḷh', 'ṁ', 'ṅ', 'ṇ', 'ṣ', 'ṭ', 'ṭh']
special_chars: [' ']
others: []

List of sanskrit chars missing:

vowels_short: ['i\\', 'r̥̀', 'à', 'ì', 'ï', 'ù', 'ü']
vowels_long: ['ai\\', 'aì', 'aù', 'è', 'ò', 'ā̀', 'ī3', 'ī̀', 'ī́3', 'ū3', 'ū̀', 'ū́3']
consonants: []
special_chars: [' ̀', "'"]


In [17]:
# make text version from the jsons, with line numbers at the beginning
!python src/transform_json_corpus.py downloads/rv_samhitapatha_vnh.json

Successfully wrote the sanskrit text to data/rv_samhitapatha_vnh.txt

List of sanskrit chars resolved from the text:

vowels_short: ['a', 'a\\', 'i', 'i\\', 'l̥', 'r̥', 'u', '~i', '~u', 'á', 'í', 'ú', 'ŕ̥']
vowels_long: ['ai', 'ai\\', 'au', 'aí', 'aú', 'e', 'e\\', 'o', 'o\\', 'r̥̄', 'r̥̄́', 'é', 'ó', 'ā', 'ā\\', 'ā́', 'ī', 'ī3', 'ī́', 'ī́3', 'ū', 'ū́']
consonants: ['b', 'bh', 'c', 'ch', 'd', 'dh', 'g', 'gh', 'h', 'j', 'jh', 'k', 'kh', 'l', 'm', 'm̐', 'n', 'p', 'ph', 'r', 's', 't', 'th', 'v', 'y', 'ñ', 'ś', 'ḍ', 'ḍh', 'ḥ', 'ḷ', 'ḷh', 'ṁ', 'ṅ', 'ṇ', 'ṣ', 'ṭ', 'ṭh']
special_chars: [' ', ' ̀']
others: ["'bh", "'d", "'dh", "'g", "'h", "'j", "'k", "'m", "'n", "'p", "'r", "'s", "'t", "'v", "'y", "'ś"]

List of sanskrit chars missing:

vowels_short: ['r̥\\', 'r̥̀', 'u\\', 'à', 'ì', 'ï', 'ù', 'ü']
vowels_long: ['au\\', 'aì', 'aù', 'è', 'ò', 'ā̀', 'ī\\', 'ī̀', 'ū3', 'ū\\', 'ū̀', 'ū́3']
consonants: []
special_chars: ["'"]


Make sure that the missing chars here are okay to ignore, or that they are just written differently in the text.

### Searching text for ninth-class verbal forms

In [359]:
# TODO search text for ninth-class verbal forms , replicating vedaweb search below?
# use vidyut to identify only finite verbal forms?

## Annotating Verbal Roots with Rig Veda Attestations

In [24]:
# in the same folder as this notebook
import src.lib.roots_attestations as roots_attestations

# useful during testing to pick up changes in the file
import importlib
importlib.reload(roots_attestations)

roots, roots_attested_words_by_stanza = roots_attestations.get_attestations(whitney_roots)

[ninth] i 2: 0 strong, 0 weak attestations
note: using 'iṣ 1' for root guess iṣ (as done in vedaweb)
[ninth] iṣ 1: 1 strong, 0 weak attestations
[ninth] ubh: 2 strong, 0 weak attestations
[ninth] uṣ: 0 strong, 0 weak attestations
note: using 'kṣī' for root guess kṣi (as done in vedaweb)
[ninth] kṣī: 3 strong, 0 weak attestations
[ninth] gr̥: 0 strong, 0 weak attestations
note: using 'gr̥bhⁱ' for root guess gr̥bh (as done in vedaweb)
[ninth] gr̥bhⁱ: 10 strong, 6 weak attestations
[ninth] jū: 4 strong, 1 weak attestations
note: using 'jyā' for root guess jī (as done in vedaweb)
[ninth] jyā: 3 strong, 0 weak attestations
[ninth] dr̥: 0 strong, 0 weak attestations
[ninth] drū: 0 strong, 0 weak attestations
note: using 'pr̥̄ 1' for root guess pr̥ (as done in vedaweb)
[ninth] pr̥̄ 1: 12 strong, 6 weak attestations
[ninth] pruṣ: 0 strong, 0 weak attestations
[ninth] bhrī: 0 strong, 0 weak attestations
note: using 'mī 1' for root guess mi mī (as done in vedaweb)
[ninth] mī 1: 23 strong, 2 weak

In [25]:
import json
with open(f"data/roots_attestations.json", 'w') as f:
    json.dump(roots_attested_words_by_stanza, f, indent=2, ensure_ascii=False)

import pandas
df_roots = pandas.DataFrame.from_dict(roots)
df_roots.to_csv("data/roots.csv", index=None)
df_roots.head()

Unnamed: 0,root_guess,variant_no,strong_stem,weak_stem,weak_only,attestation_texts,language_period,present_class,root,strong_attestations,strong_attestations_total,weak_attestations,weak_attestations_total
0,i 2,,inā,inī,True,V.,Earlier,ninth,i 2,,0,,0
1,iṣ,,iṣṇā,iṣṇī,False,,Earlier,ninth,iṣ 1,01.063.02,1,,0
2,ubh,,ubhnā,ubhnī,False,V.,Earlier,ninth,ubh,01.063.04 04.019.04,2,,0
3,uṣ,,uṣṇā,uṣṇī,False,V.,Earlier,ninth,uṣ,,0,,0
4,kṣi,,kṣiṇā,kṣiṇī,False,V.B.,Earlier,ninth,kṣī,04.018.12 10.027.04 10.027.13,3,,0


### Validation: Checking for missing roots

In [26]:
# roots attested in RV
df_roots.query(
    'strong_attestations_total > 0 or weak_attestations_total > 0'
).sort_values(["present_class", "language_period", "root"])

Unnamed: 0,root_guess,variant_no,strong_stem,weak_stem,weak_only,attestation_texts,language_period,present_class,root,strong_attestations,strong_attestations_total,weak_attestations,weak_attestations_total
62,dabh,,dabhno,dabhnu,False,V.B.,Earlier,fifth,dabh,,0,01.055.07,1
63,dāś,,dāśno,dāśnu,False,V.,Earlier,fifth,dāś,08.004.06,1,,0
54,i 2,,ino,inu,False,V.,Earlier,fifth,i 2,01.066.10 04.010.07 04.016.07 06.004.03 06.005...,7,06.010.07 09.029.04,2
59,ji,,jino,jinu,False,V.B.,Earlier,fifth,ji 2 jinv,05.084.01,1,,0
57,kr̥,,kr̥ṇo,kr̥ṇu,False,,Earlier,fifth,kr̥,07.018.05 01.013.12 01.018.08 01.031.07 01.048...,113,10.101.02 01.182.03 02.026.02 04.017.10 05.083...,142
66,pruṣ,,pruṣṇo,pruṣṇu,False,V.,Earlier,fifth,pruṣⁱ,,0,01.168.08 06.071.01 10.023.04,3
56,r̥ 1,,r̥ṇo,r̥ṇu,False,V.,Earlier,fifth,r̥ 1,01.030.14 01.030.15 01.035.09 01.174.02 01.174...,9,05.045.06,1
69,sagh,,saghno,saghnu,False,V.,Earlier,fifth,sagh,01.031.03,1,,0
76,spr̥,,spr̥ṇo,spr̥ṇu,False,,Earlier,fifth,spr̥,,0,10.087.07,1
55,u 1,,uno,unu,False,V.,Earlier,fifth,u 1,05.031.01,1,,0


TODO explain cases where root_guess and root differ

eg: iS 'send' being marked with 1 automatically from vedaweb (same as what lubotsky gives)
whitney actually has this as 2 in main root list but since stem does not have variants with it, it's not marked later

In [27]:
# to print without index on the left
#from IPython.display import HTML
#HTML(df_roots.to_html(index=False))

# roots not attested in RV
df_roots.query(
    'strong_attestations_total == 0 and weak_attestations_total == 0'
).sort_values(["present_class", "language_period", "root"])

Unnamed: 0,root_guess,variant_no,strong_stem,weak_stem,weak_only,attestation_texts,language_period,present_class,root,strong_attestations,strong_attestations_total,weak_attestations,weak_attestations_total
53,akṣ,,akṣṇo,akṣṇu,False,V.B.,Earlier,fifth,akṣ,,0,,0
61,dagh,,daghno,daghnu,False,B.,Earlier,fifth,dagh,,0,,0
58,kṣubh,,kṣubhno,kṣubhnu,False,B.,Earlier,fifth,kṣubh,,0,,0
68,lu,,luno,lunu,False,B.S.,Earlier,fifth,lu,,0,,0
64,pi,,pino,pinu,False,V.B.,Earlier,fifth,pi,,0,,0
65,pr̥ 1,1.0,pr̥ṇo,pr̥ṇu,False,S.,Earlier,fifth,pr̥ 1,,0,,0
67,ri,,riṇo,riṇu,False,B.,Earlier,fifth,ri,,0,,0
71,sadh,,sadhno,sadhnu,False,B.,Earlier,fifth,sadh,,0,,0
70,si,,sino,sinu,False,V.B.,Earlier,fifth,si,,0,,0
72,skabh,,skabhno,skabhnu,False,B.,Earlier,fifth,skabh,,0,,0


TODO Test these with different length of final root vowel? just to see if we catch anything

Done and it didn't. 

## Organizing Data by Verse Lines (pādas)

In [339]:
import pandas
#df_roots = pandas.read_csv("data/roots.csv")
df_roots = pandas.read_csv("data/roots.csv", keep_default_na=False)
#df_roots.head()

In [340]:
# TODO remove test df once we have all the annotations
#df_roots_test = df_roots[~df_roots["rig_veda_strong_attestations"].isna()]
#df_roots_test = df_roots[df_roots["rig_veda_strong_attestations"].str.len() > 0]
df_roots_test = df_roots.query('strong_attestations != "" or weak_attestations != ""')
df_roots_test.head()

Unnamed: 0,root_guess,variant_no,strong_stem,weak_stem,weak_only,attestation_texts,language_period,present_class,root,strong_attestations,strong_attestations_total,weak_attestations,weak_attestations_total
1,iṣ,,iṣṇā,iṣṇī,False,,Earlier,ninth,iṣ 1,01.063.02,1,,0
2,ubh,,ubhnā,ubhnī,False,V.,Earlier,ninth,ubh,01.063.04 04.019.04,2,,0
4,kṣi,,kṣiṇā,kṣiṇī,False,V.B.,Earlier,ninth,kṣī,04.018.12 10.027.04 10.027.13,3,,0
6,gr̥bh,,gr̥bhṇā,gr̥bhṇī,False,V.B.,Earlier,ninth,gr̥bhⁱ,01.055.02 01.163.02 03.030.05 05.031.07 07.101...,10,09.046.04 09.106.03 10.062.01 10.062.02 10.062...,6
7,jū,,junā,junī,False,V.,Earlier,ninth,jū,01.027.07 01.071.06 01.186.05 07.086.07,4,09.079.02,1


In [341]:
from pprint import pprint

In [342]:
rv_lines = []

roots_data = df_roots_test.to_dict("records")

for root in roots_data:
    rv_weak_line_nos = root.pop("weak_attestations").split()
    rv_strong_line_nos = root.pop("strong_attestations").split()
    # FIXME just eliminate these? since these are only stanza
    root.pop("weak_attestations_total")
    root.pop("strong_attestations_total")
    
    #rv_weak_attestations_data = root.pop("rig_veda_weak_attestations_data")
    #rv_strong_attestations_data = root.pop("rig_veda_strong_attestations_data")
    
    weak_stem = root.pop("weak_stem")
    strong_stem = root.pop("strong_stem")
    
    for line_no in rv_weak_line_nos:
        rv_lines.append({"line_no": line_no, "stem": weak_stem, "stem_type": "weak"} | root)
        
    for line_no in rv_strong_line_nos: 
        rv_lines.append({"line_no": line_no, "stem": strong_stem, "stem_type": "strong"} | root)

pprint(rv_lines[0])

{'attestation_texts': '',
 'language_period': 'Earlier',
 'line_no': '01.063.02',
 'present_class': 'ninth',
 'root': 'iṣ 1',
 'root_guess': 'iṣ',
 'stem': 'iṣṇā',
 'stem_type': 'strong',
 'variant_no': '',
 'weak_only': False}


### Parsing line numbers

In [343]:
# "1.1.1b" > "01" "001" "02" "b"
# 01.063.02 > "01" "063" "02" ""
def parse_rv_line_no(string):
    line_no_parts = string.split(".")
    
    book = line_no_parts[0].zfill(2)
    hymn = line_no_parts[1].zfill(3)
    
    last_char = line_no_parts[2][-1]
    if last_char.isalpha():
        stanza = line_no_parts[2][:-1] # drop the last char
        pada = last_char
    else:
        stanza = line_no_parts[2]
        pada = ""

    stanza = stanza.zfill(2)

    return {
        "book"    : book,
        "hymn"    : f"{book}.{hymn}",
        "stanza"  : f"{book}.{hymn}.{stanza}",
        "pada"    : f"{book}.{hymn}.{stanza}.{pada}" if pada else ""
        #"pada_id" : pada or ''
    }    

rv_lines = [line | (parse_rv_line_no(line["line_no"])) for line in rv_lines]

pprint(rv_lines[0])

{'attestation_texts': '',
 'book': '01',
 'hymn': '01.063',
 'language_period': 'Earlier',
 'line_no': '01.063.02',
 'pada': '',
 'present_class': 'ninth',
 'root': 'iṣ 1',
 'root_guess': 'iṣ',
 'stanza': '01.063.02',
 'stem': 'iṣṇā',
 'stem_type': 'strong',
 'variant_no': '',
 'weak_only': False}


## Annotating Verse Lines

### Downloading annotation data

In [18]:
!mkdir -p downloads/vedaweb

In [344]:
import os
import requests
import time

VEDAWEB_API_URL = "https://vedaweb.uni-koeln.de/rigveda/api"

rv_stanza_nos = sorted(list(set([line["stanza"] for line in rv_lines])))
print(len(rv_stanza_nos))

814


In [345]:
for stanza_no in rv_stanza_nos:
    print(f"Getting data for stanza: {stanza_no}")
    
    stanza_file = f"downloads/vedaweb/{stanza_no}.json"
    if os.path.exists(stanza_file):
        # we already downloaded this stanza so continue to the next one
        continue
    
    # eg: https://vedaweb.uni-koeln.de/rigveda/api/document/id/0100102
    vedaweb_doc_id = stanza_no.replace('.', '')
    vedaweb_doc_url = f"{VEDAWEB_API_URL}/document/id/{vedaweb_doc_id}"
    
    response = requests.get(vedaweb_doc_url)
    # raises an exception on non-200 responses, since we want to know and act on it
    response.raise_for_status()
    
    with open(stanza_file, 'w') as f:
        f.write(response.text)
    
    # so that we don't hammer the api
    time.sleep(0.5)

print("Done!")

Getting data for stanza: 01.010.04
Getting data for stanza: 01.010.07
Getting data for stanza: 01.010.08
Getting data for stanza: 01.012.01
Getting data for stanza: 01.013.02
Getting data for stanza: 01.013.05
Getting data for stanza: 01.013.12
Getting data for stanza: 01.015.02
Getting data for stanza: 01.015.03
Getting data for stanza: 01.017.09
Getting data for stanza: 01.018.01
Getting data for stanza: 01.018.04
Getting data for stanza: 01.018.08
Getting data for stanza: 01.023.21
Getting data for stanza: 01.025.01
Getting data for stanza: 01.027.07
Getting data for stanza: 01.027.12
Getting data for stanza: 01.028.06
Getting data for stanza: 01.030.12
Getting data for stanza: 01.030.14
Getting data for stanza: 01.030.15
Getting data for stanza: 01.031.03
Getting data for stanza: 01.031.07
Getting data for stanza: 01.031.08
Getting data for stanza: 01.032.03
Getting data for stanza: 01.032.04
Getting data for stanza: 01.035.09
Getting data for stanza: 01.036.03
Getting data for sta

### Enriching the lines with text and metrical info

In [346]:
import json

In [347]:
with open(f"data/roots_attestations.json") as f:
    roots_attested_words_by_stanza = json.load(f)
    
#print(roots_attested_words_by_stanza)

In [348]:
def pada_char_to_no(char):
    # TODO turn this into a dict
    match char:
        case 'a':
            return 0
        case 'b':
            return 1
        case 'c':
            return 2
        case 'd':
            return 3
        case 'e':
            return 4
        case 'f':
            return 5
        case 'g':
            return 6
        case _:
            raise Exception(f"Invalid pada char: {char}")
            
            
def get_stanza_words(stanza_padas):    
    stanza_words = {}
    
    for pada_data in stanza_padas:
        for word_grammar_data in pada_data["grammarData"]:
            word = word_grammar_data["form"]
            
            word_grammar_data_props = word_grammar_data["props"]
            word_position = word_grammar_data_props.pop("position", '')
            word_lemma_type = word_grammar_data_props.pop("lemma type", '')

            word_data = {
                # tracker for when we later search for the actual attested words
                "found": False, 
                "data": {
                    "pada_id": pada_data["id"],
                    # TODO test with this and later eliminate
                    #"pada_index": pada_data["index"], 
                    "pada_label": pada_data["label"],
                    "word": word,
                    "word_position_no": word_grammar_data["index"], # not-zero-indexed!
                    # TODO be careful of this, does not seem to be accurate
                    # (eg: for "punīhi" for 9.67.24 )
                    # TODO use these for checks
                    "word_position": word_position,
                    "word_lemma_type": word_lemma_type,
                    # FIXME pass this and use to validate further
                    #"word_lemma": word_grammar_data["lemma"]
                    "word_props": word_grammar_data_props,
                    # TODO this not needed since all of it is contained in props
                    # but pass and validate they are the same...
                    #"word_gloss": word_tracker_gloss
                }
            }
                
            if word in stanza_words:
                stanza_words[word].append(word_data)
            else:
                # need to use a list since the word may appear multiple times in the stanza
                stanza_words[word] = [word_data]
    
    return stanza_words
    

def get_words_by_pada(stanza_attested_words, stanza_padas, stanza_no=None):
    words_by_pada = []
    
    #pprint(stanza_attested_words)
    #pprint(stanza_padas)
    
    # transform data in stanza_padas to be amenable for searching the attested words
    stanza_words = get_stanza_words(stanza_padas)
    #pprint(stanza_words)
    
    for attested_word_data in stanza_attested_words:
        attested_word = attested_word_data["word"]
        attested_word_gloss = attested_word_data["gloss"]
        
        if attested_word in stanza_words:
            for word_instance in stanza_words[attested_word]:
                # if this word instance was already found, skip to the next one 
                if word_instance["found"]:
                    continue
                    
                word_instance_data = word_instance["data"]
                word_instance_lemma_type = word_instance_data.pop("word_lemma_type")
                
                # FIXME check for lemma too?
                # TODO also ensure this is not causing us to drop valid lines
                if (word_instance_lemma_type and word_instance_lemma_type != "root"):
                    print(
                        f"Skipping an instance of attested word {attested_word} because its lemma type",
                        f"'{word_instance_lemma_type}' is not root"
                    )
                    continue
                    
                if (word_instance_data["word_position"] and
                        "position" in attested_word_gloss and
                        # python "and" operator is short-circuiting so can access "position" below
                        # TODO can we trust this?
                        word_instance_data["word_position"] != attested_word_gloss["position"]
                    ):
                    # even if we skip for a valid instance we will come back to it with a valid position later
                    print(
                        f"Skipping an instance of attested word {attested_word} because it's position '{word_instance_data['word_position']}'",
                        f"does not match the actual attested position '{attested_word_gloss['position']}'"
                    )
                    continue                
                
                # TODO not needed since all of this info is already in word_instance_data
                #word_instance_data["gloss"] = attested_word_gloss
                words_by_pada.append(word_instance_data)
                
                # no need to do this in-place for python!
                word_instance["found"] = True  
                break
        else:
            # TODO handle this better? ok to let go maybe
            raise Exception(
                f"Word {attested_word} was not found in the stanza {stanza_no}: {stanza_padas}"
            )    
            
    # something went wrong and we need to investigate
    if len(words_by_pada) != len(stanza_attested_words):
        raise Exception("No of word instances by pada does not match the input no of word instances attested in the stanza")
    
    return words_by_pada            
    
    
def is_stem_present(stem, text):
    is_present = stem in text
    
    # account for accent variation for fifth class strong and weak stems (-no-/-nu-)
    # don't need to do similar for ninth because its strong stem (-nā́/-nī́-) is already composed of 2 chars
    if not is_present:
        if stem[-1] == "o":
            stem_with_accent = stem[:-1] + "ó"
        elif stem[-1] == "u":
            stem_with_accent = stem[:-1] + "ú"
        else:
            stem_with_accent = stem
        
        is_present = stem_with_accent in text
    
    return is_present      
        
    
def annotate_line(line):    
    with open(f"downloads/vedaweb/{line['stanza']}.json") as f:
        stanza = json.load(f)
        
        # TODO get pada no for each line (could be multiple) using data in: 
        # roots_attested_words_by_stanza
        # we will be ultimately returning multiple lines here sometimes
        #if line["pada"]:
        #    pada_no = pada_char_to_no(line["pada"][-1])
        #    #pada_no = pada_char_to_no(line["pada_id"])
        #else:
        
        #pada_no = 0
        
        stanza_attested_words = roots_attested_words_by_stanza[line["present_class"]][line["root"]][line["stem_type"]][line["stanza"]]
            
        words_by_pada = get_words_by_pada(stanza_attested_words, stanza["padas"], line["stanza"])
        #pprint(words_by_pada)
        
        padas = []
        
        for word in words_by_pada:
            pada = line | word
            
            # TODO rename line_no to location everywhere
            pada["line_no"] = pada["stanza"] + "." + pada["pada_id"]
            # this is not needed now
            pada.pop("pada")
            #pada["pada"] = pada["stanza"] + "." + pada["pada_id"]
            
            # FIXME add each of these as a separate field too
            word_props = pada.pop("word_props")
            pada["word_gloss"] = f"{word_props['person']}.{word_props['number']}" + \
                f".{word_props['tense']}.{word_props['mood']}.{word_props['voice']}"
            
            # TODO try out getting index from stanza info directly and see if we still
            # get the same results
            pada_no = pada_char_to_no(pada["pada_id"])
            # TODO testing remove
            #pada_no = 0

            for version in stanza["versions"]:
                if version["id"] == "version_lubotsky":
                    pada["text_padapatha"] = version["form"][pada_no]
                    break
        
            for version in stanza["versions"]:
                if version["id"] == "version_vannootenholland":
                    # need to override pada no for the vnh version here, since it does
                    # not match the padapatha:
                    # https://vedaweb.uni-koeln.de/rigveda/view/id/08.039.06
                    if pada["line_no"] == "08.039.06.e" and pada["stem"] == "vr̥ṇu ūrṇu":
                        # FIXME update line_no also here?  
                        pada_no = pada_char_to_no("d")
                    # TODO deal with * at the begining of the text here?
                    pada["text_samhitapatha"] = version["form"][pada_no]
                    pada["meter_scansion"] = version["metricalData"][pada_no]
                    break
            
            pada["stanza_meter"] = stanza["stanzaType"] or ''
    
            # TODO get these info from hellewig too?
            # historical info
            pada["stanza_strata"] = stanza["strata"]
            pada["stanza_late_addition"] = stanza["lateAdditions"] or ''
            
            # hymn extra metadata (maybe handy)
            pada["hymn_absolute_no"] = stanza["hymnAbs"]
            pada["hymn_addressee"] = stanza["hymnAddressee"]
            pada["hymn_group"] = stanza["hymnGroup"]
        
            # this shouldn't really happen since the results we got were done
            # via stem searches on the padapatha but validate, just in case
            if not any([
                is_stem_present(stem_variant, pada["text_padapatha"])
                for stem_variant in pada["stem"].split(" ")
            ]):
                raise Exception(
                    f'Stem {pada["stem"]} not found in the padapatha text "{pada["text_padapatha"]}"'
                )   
        
            padas.append(pada)

        return padas

    
#line_annotated = annotate_line(rv_lines[1])
#pprint(line_annotated)

#rv_lines_annotated = [annotate_line(line) for line in rv_lines]

rv_lines_annotated = []
for line in rv_lines:
    # TODO rename the annotate function here
    rv_lines_annotated.extend(annotate_line(line))

pprint(rv_lines_annotated[0])
print(f"\nTotal number of lines: {len(rv_lines_annotated)}")

{'attestation_texts': '',
 'book': '01',
 'hymn': '01.063',
 'hymn_absolute_no': 63,
 'hymn_addressee': 'Indra',
 'hymn_group': 'Hymns of Nodhas, Descendant of Gotama',
 'language_period': 'Earlier',
 'line_no': '01.063.02.d',
 'meter_scansion': 'SS LLS SSLS LL',
 'pada_id': 'd',
 'pada_label': 'M',
 'present_class': 'ninth',
 'root': 'iṣ 1',
 'root_guess': 'iṣ',
 'stanza': '01.063.02',
 'stanza_late_addition': '',
 'stanza_meter': 'Triṣṭubh',
 'stanza_strata': 'A',
 'stem': 'iṣṇā',
 'stem_type': 'strong',
 'text_padapatha': 'púraḥ iṣṇā́si= puruhūta pūrvī́ḥ',
 'text_samhitapatha': 'púra iṣṇā́si puruhūta pūrvī́ḥ',
 'variant_no': '',
 'weak_only': False,
 'word': 'iṣṇā́si',
 'word_gloss': '2.SG.PRS.IND.ACT',
 'word_position': 'intermediate',
 'word_position_no': 2}

Total number of lines: 881


### Saving the Final Line Results

In [349]:
df_rv_lines = pandas.DataFrame.from_dict(rv_lines_annotated)
df_rv_lines.to_csv("data/rv_lines.csv", index=None)
df_rv_lines.head(10)

Unnamed: 0,line_no,stem,stem_type,root_guess,variant_no,weak_only,attestation_texts,language_period,present_class,root,...,word_gloss,text_padapatha,text_samhitapatha,meter_scansion,stanza_meter,stanza_strata,stanza_late_addition,hymn_absolute_no,hymn_addressee,hymn_group
0,01.063.02.d,iṣṇā,strong,iṣ,,False,,Earlier,ninth,iṣ 1,...,2.SG.PRS.IND.ACT,púraḥ iṣṇā́si= puruhūta pūrvī́ḥ,púra iṣṇā́si puruhūta pūrvī́ḥ,SS LLS SSLS LL,Triṣṭubh,A,,63,Indra,"Hymns of Nodhas, Descendant of Gotama"
1,01.063.04.b,ubhnā,strong,ubh,,False,V.,Earlier,ninth,ubh,...,3.SG.PRS.INJ.ACT,vr̥trám yát vajrin= vr̥ṣakarman ubhnā́ḥ,vr̥tráṁ yád vajrin vr̥ṣakarman ubhnā́ḥ,LL L LL SSLS LL,Triṣṭubh,A,,63,Indra,"Hymns of Nodhas, Descendant of Gotama"
2,04.019.04.c,ubhnā,strong,ubh,,False,V.,Earlier,ninth,ubh,...,3.SG.IPRF.IND.ACT,dr̥ḷhā́ni aubhnāt= uśámānaḥ ójaḥ,dr̥r̥ḷhā́ni+ aubhnād uśámāna ójo,SLLS LL SSLS LL,,S,,315,Indra,Hymns to Indra
3,04.018.12.d,kṣiṇā,strong,kṣi,,False,V.B.,Earlier,ninth,kṣī,...,2.SG.IPRF.IND.ACT,yát prá ákṣiṇāḥ= pitáram pādagŕ̥hya,yát prā́kṣiṇāḥ pitáram pādagŕ̥hya,L LSL SSL LSLS,Triṣṭubh,P,"[Grassmann (G), Arnold (C1)]",314,"Dialogue Between Indra, Aditi and Vamadeva",Hymns to Indra
4,10.027.04.d,kṣiṇā,strong,kṣi,,False,V.B.,Earlier,ninth,kṣī,...,1.SG.PRS.INJ.ACT,prá tám kṣiṇām= párvate-_ pādagŕ̥hya,prá táṁ kṣiṇām párvate pādagŕ̥hya,S L SL LSL LSLS,Triṣṭubh,P,[Arnold (C1)],853,Indra,The Vasukra Hymns
5,10.027.13.c,kṣiṇā,strong,kṣi,,False,V.B.,Earlier,ninth,kṣī,...,3.SG.PRS.IND.ACT,ā́sīnaḥ ūrdhvā́m= upási} kṣiṇāti,ā́sīna ūrdhvā́m upási kṣiṇāti,LLS LL SSL SLS,Triṣṭubh,P,[Arnold (C1)],853,Indra,The Vasukra Hymns
6,09.046.04.b,gr̥bhṇī,weak,gr̥bh,,False,V.B.,Earlier,ninth,gr̥bhⁱ,...,2.PL.PRS.IND.ACT,śukrā́ gr̥bhṇīta manthínā,śukrā́ gr̥bhṇīta manthínā,LL LLS LSL,Gāyatrī,N,,758,Soma,Tirasci and Other Poets
7,09.106.03.b,gr̥bhṇī,weak,gr̥bh,,False,V.B.,Earlier,ninth,gr̥bhⁱ,...,3.SG.PRS.INJ.MED,grābhám gr̥bhṇīta sānasím,grābháṁ gr̥bhṇīta sānasím,LL LLS LSL,,A,,818,Soma,The Usnih Group
8,10.062.01.d,gr̥bhṇī,weak,gr̥bh,,False,V.B.,Earlier,ninth,gr̥bhⁱ,...,2.PL.PRS.IND.ACT,práti gr̥bhṇīta= mānavám} sumedhasaḥ,práti gr̥bhṇīta mānaváṁ sumedhasaḥ,SS LLS LSL SLSL,,C,[Arnold (C1)],888,"All the Gods or the Angiras, Thanksgiving to S...",Nabhanedistha Hymns
9,10.062.02.d,gr̥bhṇī,weak,gr̥bh,,False,V.B.,Earlier,ninth,gr̥bhⁱ,...,2.PL.PRS.IND.ACT,práti gr̥bhṇīta= mānavám} sumedhasaḥ,práti gr̥bhṇīta mānaváṁ sumedhasaḥ,SS LLS LSL SLSL,,C,[Arnold (C1)],888,"All the Gods or the Angiras, Thanksgiving to S...",Nabhanedistha Hymns


## Validating the data

### Checking for missing roots

In [350]:
import numpy as np

def print_roots_attestation_info(df_roots, df_rv_lines, present_class):
    # our starting list of roots
    roots_initial = np.sort(df_roots.query(
        f"present_class == '{present_class}'"
    )["root"].unique())
    print(f"Starting list ({len(roots_initial)}):\n{roots_initial}\n")
    
    # attested roots
    roots_present = np.sort(df_rv_lines.query(
        f"present_class == '{present_class}'"
    )["root"].unique())
    print(f"Attested ({len(roots_present)}):\n{roots_present}\n")

    # missing roots
    roots_absent = np.sort(
        np.setdiff1d(roots_initial, roots_present)
    )
    print(f"Missing ({len(roots_absent)}):\n{roots_absent}\n")

In [351]:
print_roots_attestation_info(df_roots, df_rv_lines, "ninth")

Starting list (53):
['aś' 'aśⁱ' 'bandh' 'bhrī' 'dhu dhū' 'drū' 'dr̥' 'grath' 'gr̥' 'gr̥bhⁱ'
 'gr̥hⁱ' 'gr̥̄ 1' 'hr̥̄' 'hvr̥ hru' 'i 2' 'iṣ 1' 'jyā' 'jñā' 'jū' 'kliś'
 'krī' 'kuṣ' 'kṣī' 'lu' 'mathⁱ' 'mr̥d' 'mr̥̄ 1' 'muṣⁱ' 'mī 1' 'pruṣ' 'prī'
 'pr̥̄ 1' 'puṣ' 'pū' 'ram' 'rī' 'skambhⁱ' 'spr̥' 'stambhⁱ' 'str̥̄' 'subh'
 'sā si' 'ubh' 'uṣ' 'vli vlī' 'vr̥' 'vr̥ vr̥̄' 'śam' 'ścam' 'śrathⁱ' 'śrī'
 'śrī 2' 'śr̥̄ 1']

Attested (31):
['aśⁱ' 'bandh' 'gr̥bhⁱ' 'gr̥hⁱ' 'gr̥̄ 1' 'hr̥̄' 'hvr̥ hru' 'iṣ 1' 'jyā'
 'jñā' 'jū' 'krī' 'kṣī' 'mathⁱ' 'mr̥̄ 1' 'muṣⁱ' 'mī 1' 'prī' 'pr̥̄ 1' 'pū'
 'ram' 'rī' 'skambhⁱ' 'stambhⁱ' 'str̥̄' 'sā si' 'ubh' 'vr̥ vr̥̄' 'śrathⁱ'
 'śrī' 'śr̥̄ 1']

Missing (22):
['aś' 'bhrī' 'dhu dhū' 'drū' 'dr̥' 'grath' 'gr̥' 'i 2' 'kliś' 'kuṣ' 'lu'
 'mr̥d' 'pruṣ' 'puṣ' 'spr̥' 'subh' 'uṣ' 'vli vlī' 'vr̥' 'śam' 'ścam'
 'śrī 2']



In [352]:
print_roots_attestation_info(df_roots, df_rv_lines, "fifth")

Starting list (49):
['akṣ' 'ci' 'ci 1' 'dabh' 'dagh' 'dhi' 'dhr̥ṣ' 'dhū' 'du' 'dāś' 'hi' 'i 2'
 'jagh' 'ji 2 jinv' 'kr̥' 'kṣi' 'kṣubh' 'lu' 'mi' 'mi 1' 'naś 1' 'pi'
 'pruṣⁱ' 'pr̥' 'pr̥ 1' 'ri' 'rādh' 'r̥ 1' 'r̥dh' 'sadh' 'sagh' 'si'
 'skabh' 'sku' 'spr̥' 'stabh' 'stigh' 'str̥' 'stu' 'su' 'takṣ' 'ti' 'tr̥p'
 'u 1' 'vr̥' 'vr̥ vr̥̄' 'āp' 'śak' 'śru']

Attested (23):
['ci 1' 'dabh' 'dhr̥ṣ' 'dhū' 'dāś' 'hi' 'i 2' 'ji 2 jinv' 'kr̥' 'mi 1'
 'naś 1' 'pruṣⁱ' 'r̥ 1' 'r̥dh' 'sagh' 'spr̥' 'str̥' 'su' 'tr̥p' 'u 1'
 'vr̥' 'śak' 'śru']

Missing (26):
['akṣ' 'ci' 'dagh' 'dhi' 'du' 'jagh' 'kṣi' 'kṣubh' 'lu' 'mi' 'pi' 'pr̥'
 'pr̥ 1' 'ri' 'rādh' 'sadh' 'si' 'skabh' 'sku' 'stabh' 'stigh' 'stu'
 'takṣ' 'ti' 'vr̥ vr̥̄' 'āp']



TODO explain that these roots not being in RV is ok (and expected):
    
* Some roots are attested only as weak stem before vowel (i.e no _nī_ form). e.g.:  [_uṣ_](https://vedaweb.uni-koeln.de/rigveda/view/id/09.097.39)
* Some roots are there in later vedic texts.
* Some roots are already marked as later language in whitney (to be found only in Classical / Epic sanskrit

TODO also try searching by alternate stem/root vowel forms for the missing roots (and also for the attested?), to see if there are actually recorded in those forms

### Checking found word against padapatha text

In [353]:
# from ./meter.py
from meter import clean_lubotsky_padapatha

mismatches = []

no_of_mismatches = 0


for line in rv_lines_annotated:
    padapatha_text_cleaned = clean_lubotsky_padapatha(line["text_padapatha"])
    padapatha_parts = padapatha_text_cleaned.split(' ')
    
    word_from_padapatha = padapatha_parts[line["word_position_no"] - 1]
    
    if line["word"] != word_from_padapatha:
        no_of_mismatches += 1
        print(
            f"{line['line_no']}:{line['word_position_no']}", 
            f"({line['stem']}) {line['word']} ≠ {word_from_padapatha}",
            #f"[{padapatha_text_cleaned}]",
            f"[{line['text_padapatha']}]"
        )
        
print(f"\nFound {no_of_mismatches} mismatches.")

09.104.03.a:1 (punā) punā́ta ≠ punā́tā [punā́tā+ dakṣasā́dhanam]
02.033.13.c:3 (vr̥ṇī) ávr̥ṇīta ≠ ávr̥ṇītā [yā́ni mánuḥ= ávr̥ṇītā+} pitā́ naḥ]
07.033.02.d:3 (vr̥ṇī) avr̥ṇīta ≠ avr̥ṇītā [sutā́t índraḥ= avr̥ṇītā+} vásiṣṭhān]
06.028.06.b:3 (kr̥ṇu) kr̥ṇutha ≠ kr̥ṇuthā [aśrīrám cit= kr̥ṇuthā+ suprátīkam]
01.110.03.d:3 (kr̥ṇu) akr̥ṇuta ≠ akr̥ṇutā [ékam sántam= akr̥ṇutā+} cáturvayam]
05.049.05.c:4 (kr̥ṇu) kr̥ṇutá ≠ kr̥ṇutā́ [áva etu ábhvam= kr̥ṇutā́+} várīyaḥ]
06.025.03.d:3 (kr̥ṇu) kr̥ṇuhí ≠ kr̥ṇuhī́ [jahí vŕ̥ṣṇyāni= kr̥ṇuhī́?_+} párācaḥ]
08.027.18.a:4 (kr̥ṇu) kr̥ṇutha ≠ kr̥ṇuthā [ájre?_ cit asmai= kr̥ṇuthā+} nyáñcanam]
10.067.11.a:3 (kr̥ṇu) kr̥ṇuta ≠ kr̥ṇutā [satyā́m āśíṣam= kr̥ṇutā+} vayodhaí]
10.078.08.a:4 (kr̥ṇu) kr̥ṇuta ≠ kr̥ṇutā [subhāgā́n naḥ= devāḥ kr̥ṇutā+ surátnān]
01.161.11.a:3 (kr̥ṇo) akr̥ṇotana ≠ akr̥ṇotanā [udvátsu asmai= akr̥ṇotanā+ tŕ̥ṇam]
08.045.22.c:3 (aśnu) aśnuhi ≠ aśnuhī [tr̥mpā́+ ví aśnuhī?_+ mádam]
10.066.14.d:4 (dhunu dhūnu) dhūnuta ≠ dhūnutā [asmé?_ devāsaḥ= áva dhūnu

Ignore cases of final vowel lengthening above (especially with imperatives) -- they are not really mismatches.

TODO look at vowel positions for these? but not really in the scope of our investigation