Pali is a morphologically rich language, like Sanskrit or Latin.
For this reason dealing with it always gets my mind thinking about automatic ways to study morphology. The present repo is a place to keep such experiments.
$ python pali.py uppa UPPA 2 uppajji 2 uppannanti 2 uppannaphalo ...etc... 38 uppajjanti 38 uppajjeyyuṃ 40 uppajjeyya 48 uppanno 60 uppannā 87 uppannaṃ 295 uppajjati
Thu Feb 26 11:48:55 EST 2009
Some of you may be wondering, "why the ‽‽‽ would you want to count the Pali words that start with uppa?"
A fair question.
This is the first step in a larger project, which is intended to answer questions like the following:
Given a particular words, can you find other words of the same paradigm?
I give you the word "uppajjati". Can you give me back a list of words from the corpus that have the same grammatical paradigm?
Can you automatically generate lists of paradigms?
Once you can answer the first question, it should be able to generate a list of such paradigms. This means that you could automatically generate something that serves a general purpose like that of a grammatical category (a part of speech tag).
It doesn't have to just with Pali; in theory whatever works for this language could be interesting for any language with an affixing morphology. If feed this program the Latin word "amo," it should give you back "amat, amemus, amabit" and a bazillion other forms of that verb. (Note that we don't expect anything resembling the traditional ordering; if anything, we'd probably order the forms by frequency.) And if the word were a noun, say "libro," we'd want back forms from that noun's declension.
The point to the whole project is that this would be purely statistical, and thus language independent. I'm focusing on a particular language because I find that trying to think about this stuff in abstract terms makes me dizzy. It's easier for me to try to first solve a problem like "give me words like uppajjati" than "write a morphological inference engine."
Will it work? Will it be useful? We'll see.
Sat Feb 28 18:36:40 EST 2009
In search of the aorist.
In chapter 4 of Warder's Introduction to Pali, he introduces the aorist form of verbs. It's a kind of past tense, apparently.
Here's the paradigm:
|3rd person||upasaṃkami, “he approached”||upasaṃkamiṃsu|
|1st person||upasaṃkamiṃ||upasaṃkamiṃhā (or -imha)|
Okay so, who cares. But let's take this as a simple (and rather arbitrary) test case. Just from comparing what's in that table, it looks like there is a root that goes "upasaṃkam". (Actually Warder tells us that the root is "kam", "upa" and "saṃ" are prefixes—again, doesn't matter at the moment.)
Keeping in mind that our goal is to be as language-independent as possible, how can we extract that "upasaṃkam" bit from a list of the forms in the table? (Sort of cheating to assume we start with that paradigm, but we need to start somewhere.) Looks like a job for the Longest common substring (LCS) algorithm.
def LCS(S, T): # from Wikibook link above m = len(S); n = len(T) L = [ * (n+1) for i in xrange(m+1)] lcs = 0 for i in xrange(m): for j in xrange(n): if S[i] == T[j]: L[i+1][j+1] = L[i][j] + 1 lcs = max(lcs, L[i+1][j+1]) return S[:lcs] def extract_stem(formlist): stem = LCS(formlist, formlist) for i in range(len(formlist)-1): stem = LCS(stem, formlist[i]) return stem formlist = "upasaṃkami upasaṃkamiṃsu upasaṃkami upasaṃkamittha upasaṃkamiṃ upasaṃkamiṃhā upasaṃkamimha".split() print extract_stem(formlist)
So if we run that we get:
~/palihack$ python lcs_upsamkam.py upasaṃkami
Heh, all the suffixes in this whole paradigm happen to begin with "i", which makes it look as if the stem were "upasaṃkami" instead of "upasaṃkam". That's kind of interesting.
But it looks like we're heading down the wrong track. If our goal is to "exemplify the Pali aorist," we need to 1) figure out what the aorist suffixes are and 2) collect words in the corpus which occur with those suffixes.
So what if I were to ask you to do this by hand, right now, what would you do? Well, you'd look at the list of words I gave you (the ones in @formlist@ above), and you'd go through text looking for words that "shared a suffix." Now what does that mean? It means, if you saw: