# Raw Rigveda Corpus

## Getting Rigveda samhitapatha (VNH version)

In [None]:
# https://github.com/cceh/c-salt_vedaweb_sources/tree/master/rigveda/versions
# description of the sources here:
# https://github.com/cceh/c-salt_vedaweb_tei/blob/master/vedaweb_corpus.tei
# https://vedaweb.uni-koeln.de/rigveda/help

!wget -O downloads/rv_samhitapatha_vnh.json https://raw.githubusercontent.com/cceh/c-salt_vedaweb_sources/master/rigveda/versions/vnh.json

In [4]:
# make text version from the jsons, with line numbers at the beginning
!python src/transform_json_corpus.py downloads/rv_samhitapatha_vnh.json

Successfully wrote the sanskrit text to data/rv_samhitapatha_vnh.txt

List of sanskrit chars resolved from the text:

vowels_short: ['a', 'a\\', 'i', 'i\\', 'l̥', 'r̥', 'u', '~i', '~u', 'á', 'í', 'ú', 'ŕ̥']
vowels_long: ['ai', 'ai\\', 'au', 'aí', 'aú', 'e', 'e\\', 'o', 'o\\', 'r̥̄', 'r̥̄́', 'é', 'ó', 'ā', 'ā\\', 'ā́', 'ī', 'ī3', 'ī́', 'ī́3', 'ū', 'ū́']
consonants: ['b', 'bh', 'c', 'ch', 'd', 'dh', 'g', 'gh', 'h', 'j', 'jh', 'k', 'kh', 'l', 'm', 'm̐', 'n', 'p', 'ph', 'r', 's', 't', 'th', 'v', 'y', 'ñ', 'ś', 'ḍ', 'ḍh', 'ḥ', 'ḷ', 'ḷh', 'ṁ', 'ṅ', 'ṇ', 'ṣ', 'ṭ', 'ṭh']
special_chars: [' ', ' ̀']
others: ["'bh", "'d", "'dh", "'g", "'h", "'j", "'k", "'m", "'n", "'p", "'r", "'s", "'t", "'v", "'y", "'ś"]

List of sanskrit chars missing:

vowels_short: ['r̥\\', 'r̥̀', 'u\\', 'à', 'ì', 'ï', 'ù', 'ü']
vowels_long: ['au\\', 'aì', 'aù', 'è', 'ò', 'ā̀', 'ī\\', 'ī̀', 'ū3', 'ū\\', 'ū̀', 'ū́3']
consonants: []
special_chars: ["'"]


Make sure that the missing chars here are okay to ignore, or that they are just written differently in the text.

## Getting Rigveda padapatha (Lubotsky version)

In [None]:
!wget -O downloads/rv_padapatha_lubotsky.json https://raw.githubusercontent.com/cceh/c-salt_vedaweb_sources/master/rigveda/versions/lubotsky.json

In [3]:
# make text version from the jsons, with line numbers at the beginning
!python src/transform_json_corpus.py downloads/rv_padapatha_lubotsky.json

Successfully wrote the sanskrit text to data/rv_padapatha_lubotsky.txt

List of sanskrit chars resolved from the text:

vowels_short: ['a', 'a\\', 'i', 'l̥', 'r̥', 'r̥\\', 'u', 'u\\', '~i', '~u', 'á', 'í', 'ú', 'ŕ̥']
vowels_long: ['ai', 'au', 'au\\', 'aí', 'aú', 'e', 'e\\', 'o', 'o\\', 'r̥̄', 'r̥̄́', 'é', 'ó', 'ā', 'ā\\', 'ā́', 'ī', 'ī\\', 'ī́', 'ū', 'ū\\', 'ū́']
consonants: ['b', 'bh', 'c', 'ch', 'd', 'dh', 'g', 'gh', 'h', 'j', 'jh', 'k', 'kh', 'l', 'm', 'm̐', 'n', 'p', 'ph', 'r', 's', 't', 'th', 'v', 'y', 'ñ', 'ś', 'ḍ', 'ḍh', 'ḥ', 'ḷ', 'ḷh', 'ṁ', 'ṅ', 'ṇ', 'ṣ', 'ṭ', 'ṭh']
special_chars: [' ']
others: []

List of sanskrit chars missing:

vowels_short: ['i\\', 'r̥̀', 'à', 'ì', 'ï', 'ù', 'ü']
vowels_long: ['ai\\', 'aì', 'aù', 'è', 'ò', 'ā̀', 'ī3', 'ī̀', 'ī́3', 'ū3', 'ū̀', 'ū́3']
consonants: []
special_chars: [' ̀', "'"]


Make sure that the missing chars here are okay to ignore, or that they are just written differently in the text.

## Getting Rigveda padapatha (Eichler version)

In [None]:
# http://www.detlef108.de/Rigveda.htm 
# http://www.detlef108.de/Notes-to-the-Rigveda-Page.htm 
!wget -O downloads/rv_padapatha_eichler.html http://www.detlef108.de/RV-Padapatha-TA3-paada-NA-UTF8.html 

In [14]:
# sudo apt install html2text
#!html2text -utf8 -width 3000 -o rv_padapatha.txt rv_padaptaha.html

from bs4 import BeautifulSoup

with open("downloads/rv_padapatha_eichler.html", "r") as input_file:
    soup = BeautifulSoup(input_file)
    
    hymns = []
    
    for para in soup.find_all("p"):
        # ignore the ending notes
        if para.contents[0].name == "span":
            continue
        
        #hymns.append(para.text.rstrip()) # no extra lines between hymns
        hymns.append(para.text)
    
    with open("data/rv_padapatha_eichler.txt", 'w') as f:
        f.write("".join(hymns))

In [222]:
# TODO break the padapatha verse into sub-lines