A general purpose Sanskrit word-list.
Each line is formatted as follows: inflected<space>operation
operation
(to reconstruct the lemma) can have the following values:
\lemma
: the lemma is inserted when more than the operations below are required to find it from the inflected form\=
: the inflected form and the lemma are identical\>NUM
: remove NUM characters at the end of the inflected form\<NUM
: remove NUM characters at the beginning of the inflected form\<NUMa>NUMb
: remove NUMa characters at the beginning and NUMb characters at the end of the inflected form
Note: see this readme
A word-list containing all the sandhied inflected forms in Heritage's XML files and the files in input/custom_entries/
.
Since this file is 128mo at writing time, it won't be included in the repository, but will need to be generated with the following command:
python3 sandhify/sandhifier.py
Each line is formatted as follows:
<sandhied_inflected_form>,<initial>$<diffs>/<initial_diff>=<sandhi_type>#<POS>
<diffs>
:<diff_to_1st_lemma>;<diff_to_2nd_lemma>;…
<diff_to_nth_lemma>
:-<number_of_chars_to_delete>+<chars_to_add>
<initial_diff>
:-<sandhied_initial>+<initial>
<sandhi_type>
:0
: no sandhi1
: vowel sandhi2
: consonant sandhi 13
: consonant sandhi 24
: visarga sandhi5
: absolute finals sandhi6
: "cC"-words sandhi7
: special sandhi: "punar"
<POS>
:-1
: multi-token lemma (see below)0
: Indeclinable1
: Noun2
: Pronoun3
: Verb4
: Preverb
The space between the sandhied words is preserved except for the vowel sandhis where the final and initial vowels coalesce.
- inflected form:
prezyate
- initial character of next word:
a
- diff of first corresponding lemma:
-1+
(lemma =prezyat
) - diff of second corresponding lemma:
-6+I
(lemma =prI
) - diff to undo sandhi of initial character of next word:
-'+a
(initial =a
, sandhied initial ='
) - sandhi type:
=1
, vowel sandhi
- inflected form:
aprezyata
- possible initial characters for this inflected form:
A
,i
,u
,U
,f
,e
,E
,o
andO
- diff of first corresponding lemma:
-1+
(lemma =aprezyat
) - diff of second corresponding lemma:
-6+I
(lemma =aprI
) - sandhi type:
=1
, vowel sandhi
The Part-of-Speech tags are attributed based on the file of origin in the Sanskrit Heritage Resources.
SL_indecls.xml
: IndeclinableSL_final.xml
: NounSL_nouns.xml
: NounSL_pronouns.xml
: PronounSL_roots.xml
: VerbSL_parts.xml
: Verb
As a workaround to the incorrect segmentation that is unavoidable with the Maximal Matching strategy, we provide support for multi-token lemmas.
For ex., atikramati
should be segmented in ati kramati
, yet atikrama
is longer than ati
.
So, the Maximal Matching algorithm will take the longest existing word and segment it as atikrama ti
.
We propose to add to the lexical resources a new entry for atikramati
as a whole, with the following format:
<inflected_form>,<multi-token_lemma>
<multi-token_lemma>
:<token1>⟾<token2>⟾<tokenN>
<token>
:<token_string><POS_number>_<indices>
<indices>
:<start>><end>
(from the first character)
Thus, atikramati,ati4_1>3⟾kram3_4>10_-1
is analyzed as follows:
atikramati
: inflected formati4_1>3
: lemma:ati
, POS:Preverb
, starting char:1
, ending char:3
kram3_4>10
: lemma:kram
, POS:Verb
, starting char:4
, ending char:10
_-1
: POS:multi-lemma token
Note: see this readme