In [3]:
from tamil_segmenter_modified import stem
import sys

## Segmenter Analysis

In this notebook, we provide some analysis of the segmenter used in this work. We modify a pre-existing [Tamil stemmer](https://github.com/rdamodharan/tamil-stemmer) into a rules-based morphological segmenter using [sbl2py](https://github.com/torfsen/sbl2py). Here we show some of the analysis and evaluation we carried out of our segmenter, and its distinct failure and success modes.

#### Basic Examples
We just provide a small list of words here to show the output of the segmenter. The expected forms are written in a comment below.

In [2]:
# We just provide a small list of words here to show the output of the segmenter
words = ['மரத்தின்', 'அக்காலம்', 'அக்காலத்தின்']
for word in words:
    print(word)
    print(str(stem(word)))
    
# maram, kaalam, kaalam

பாப்போம்
('பா', [], ['tense_suffix_oom'])
மரத்தின்
('மரம்', [], ['vetrumai_suffix_in'])
அக்காலம்
('காலம்', ['pronoun_prefix_a'], [])
அக்காலத்தின்
('காலம்', ['pronoun_prefix_a'], ['vetrumai_suffix_in'])


#### Success Modes of the Segmenter

(1) Question prefix 'e'/('what/which'):

In [4]:
# GOOD
word = 'எக்காலம்' # kaalam
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('காலம்', ['question_prefix'], [])


(2) Suffix 'um'/('also')

In [10]:
# GOOD
word = 'அவனும்' # avan
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('அவன்', [], ['suffix_um'])


(3) Question suffix 'aa'/('is it?'):

In [8]:
# GOOD
word = 'கண்ணனா' # kannan
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('கண்ணன்', [], ['question_suffix_aa'])


(4) Suffix 'idam'/('to, with')

In [4]:
# GOOD
word = 'அவனிடம்' # avan
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('அவன்', [], ['common_suffix_idam'])


(5) Plural suffix 'gal'

In [5]:
# GOOD

word = 'மரங்கள்' # maram
print("Our stemmer: " + str(stem(word)))
# note the velarization of the nasal in the plural: மரம்/மரங்கள்

Our stemmer: ('மரம்', [], ['plural_suffix'])


(6) Causative suffix 'pi'

In [14]:
# GOOD

word = 'காண்பி' # kaan
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('காண்', [], ['command_suffix_pi'])


(7) Possible false morpheme 'manikkam' ('gem')

In [25]:
# GOOD

word = 'மாணிக்கம்'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('மாணிக்கம்', [], [])


(8) Complex form: past progressive plural 'cey' ('to do')

In [24]:
# GOOD

word = 'செய்துக்கொண்டிர்ந்தார்'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('செய்', [], ['tense_suffix_kondir', 'tense_suffix_aar'])


(9) Complex form: present progressive 'piri' ('to split')

In [15]:
# GOOD

word = 'பிரிகின்றன' # piri
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('பிரி', [], ['tense_suffix_kinra', 'tense_suffix_na'])


(10) Complex form: Denominal present progressive plural 1st person question 'kaStappaDa' ('how much do we struggle/experience difficulty')

In [26]:
# GOOD

word = 'எக்கஷ்டப்படுகிறோம்' # kashtam
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('கஷ்ட', ['question_prefix'], ['tense_suffix_padu', 'tense_suffix_kir', 'tense_suffix_oom'])


(11) Possible false morpheme 'uLLa'

In [30]:
# GOOD

word = 'உள்ள' # ulla
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('உள்ள', [], [])


(12) Bare progressive morpheme 'konDiru'

In [7]:
# GOOD
word = 'கொண்டிரு'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('', [], ['tense_suffix_kondiru'])


(13) Bare root 'cey' ('to do')

In [34]:
# GOOD

word = 'செய்' # sey
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('செய்', [], [])


(14) Possible false morpheme '-ppu' 'kalakalappu' (excitement)

In [37]:
# GOOD
stem('கலகலப்பு')

('கலகலப்பு', [], [])

(15) Instrumental case marker '-aal'

In [46]:
# GOOD

stem('அவனால்')

('அவன்', [], ['vetrumai_suffix_aal'])

(15) Noun stem 'kaSTam' ('difficulty')

In [50]:
# GOOD

stem('கஷ்டம்')

('கஷ்டம்', [], [])

(16) Complex form: Denominal present progressive plural 1st person 'kaStappaDa' ('to struggle/experience difficulty')

In [51]:
# GOOD (some issues of polysemy with the morpheme 'padu')

stem('கஷ்டப்படுகிறோம்')

('கஷ்ட', [], ['tense_suffix_padu', 'tense_suffix_kir', 'tense_suffix_oom'])

(17) Non-segmentable stem 'makal' ('daughter')

In [55]:
# GOOD

stem('மகள்')

('மகள்', [], [])

(18) Non-segmentable stem 'arivu' ('knowledge')

In [38]:
# GOOD

word = 'அறிவு'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('அறிவு', [], [])


(19) Bare root 'kAppiDu' ('to save/protect')

In [67]:
# GOOD

stem('சாப்பிடு')

('சாப்பிடு', [], [])

(20) Gerund (having VER-ed) 'amai' ('to be')

In [73]:
# GOOD

stem('அமைத்து')

('அமை', [], ['vetrumai_suffix_thu'])

(21) Noun to adverb 'manithan' ('person')

In [80]:
# GOOD

stem('மனிதனாக') # manithan

('மனிதன்', [], ['vetrumai_suffix_aaka'])

(22) Possibly ambiguous stem 'katti' ('knife') + instrumental case 

In [86]:
# GOOD

stem('கத்தியால்') # kaththi

('கத்தி', [], ['vetrumai_suffix_aal'])

(23) Possibly ambiguous stem 'kai' ('hand') + instrumental case 

In [87]:
# GOOD

stem('கையால்') # kai

('கை', [], ['vetrumai_suffix_aal'])

(24) Passive voice + verb 'ezhuda' ('write')

In [89]:
# GOOD

stem('எழுதப்பட்டது')

('எழுத', [], ['common_suffix_pattathu'])

(24) Verb 'cey' ('do') + aspectual verb 'vai' + past 3rd person plural

In [95]:
# GOOD

stem('செய்துவைத்தார்')

('செய்', [], ['tense_suffix_thu', 'tense_suffix_vai', 'tense_suffix_aar'])

(25) Possibly ambiguous stem 'pATTi' ('grandma') 

In [10]:
# GOOD
# Resembles the gerund from a hypothetical verb 'pATTa'
stem('பாட்டி')

('பாட்டி', [], [])

#### Failure Modes of the Segmenter

(26) Compare to (1) — 'எக்காளம்' meaning trumpet falsely resembles an interrogative statement.

In [5]:
# BAD (false morpheme)

word = 'எக்காளம்' # ekkaalam (trumpet)
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('காளம்', ['question_prefix'], [])


(27) Stem 'avan' incorrectly broken/Negation only taken by 'A' in'illA'

In [11]:
# BAD (incorrectly split morphemes)

word = 'அவனில்லாத' # avan # BUG HERE
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('அவனில்', [], ['tense_suffix_aa', 'tense_suffix_tha'])


(28) Compare to (1) — word 'eppadi'('how') is broken into question prefix 'e' and 'padi' ('step/stair').

In [7]:
# BAD (false morpheme)

word = 'எப்படி' # eppadi
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('படி', ['question_prefix'], [])


(29) False morpheme: 'mayil' ('peacock') is broken into 'may' and '-il' (locative suffix)

In [17]:
# BAD (false morpheme)

word = 'மயில்' # mayil
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('மய்', [], ['vetrumai_suffix_il'])


In [19]:
# BAD (false morpheme)

word = 'புயல்' # puyal
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('புய்', [], ['vetrumai_suffix_l'])


In [None]:
# BAD (false morpheme)

word = 'வெயில்' # veyil
print("Our stemmer: " + str(stem(word)))

(30) Past participle suffix not identified

In [75]:
# BAD

stem('ஆழ்ந்த')

('ஆழ்', [], [])

(31) Eroded stem 'padutta' + incorrect morphemes

In [11]:
# BAD (multiple reasons)

word = 'படுத்தல்' # padu
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('ப்', [], ['tense_suffix_tum', 'vetrumai_suffix_l'])


(32) Incorrect sequence of morphemes (for word meaning 'which types')

In [27]:
# BAD (false morpheme)

stem('எத்தகையது')

('தகை', ['question_prefix'], ['vetrumai_suffix_thu'])

(33) Unable to break up serial verb 'paNNimuDi' into 'paNNi' + 'muDi'

In [16]:
# BAD (incorrect stem)

word = 'பண்ணிமுடித்தவர்கள்'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('பண்ணிமுடி', [], ['tense_suffix_tha', 'tense_suffix_var', 'plural_suffix'])


(34) 'Kuzhal' ('flute') broken into 'kuzh' + 'al'

In [22]:
# BAD (false morpheme)

word = 'குழல்'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('குழ்', [], ['vetrumai_suffix_l'])


(35) Stem 'athu': irregular sandhi form misidentified

In [31]:
# BAD (incorrect stem)

stem('அதனுடைய')

('அதன்', [], ['common_suffix_udai'])

(36) Stem 'athu': irregular sandhi form misidentified

In [39]:
# BAD (incorrect stem)

stem('அதற்கு')

('அத', [], ['tense_suffix_kku'])

(37) Unable to breakdown compound 'paDipparivu' into 'padippu' + 'arivu'

In [35]:
# BAD (two root problem)

word = 'படிப்பறிவு'
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('படிப்பறிவு', [], [])


(38) Word 'maRRum' ('and') oversegmented into 'ma'

In [40]:
# BAD (false morpheme)

stem('மற்றும்')

('ம', [], ['suffix_um'])

(39) Complex form incorrectly segmented: correct suffix but root 'vara' ('come') misidentified and 'ntha' specifier missing

In [43]:
# BAD

word = 'வந்தவர்களின்' # vantharvarkalin
print("Our stemmer: " + str(stem(word)))

Our stemmer: ('வ', [], ['tense_suffix_var', 'plural_suffix', 'vetrumai_suffix_in'])


(40) Complex form incorrectly segmented: correct suffix but root 'tara' ('give') misidentified and 'ntha' specifier missing

In [48]:
# BAD (misses the ntha morpheme)

stem("தந்தவர்கள்")

('த', [], ['tense_suffix_var', 'plural_suffix'])

(41) Irregular gerund of 'sappiDa' ('eat') not segmented

In [49]:
# BAD (doesn't stem irregular forms)

stem('சாப்பிட்டு')

('சாப்பிட்டு', [], [])

(42) Root misidentified in past progressive of 'colla' ('to tell')

In [57]:
# BAD (incorrect stem)

stem('சொல்லிக்கொண்டிருந்தார்')

('சொல்லி', [], ['tense_suffix_kondiru', 'tense_suffix_aar'])

(43) Unable to breakdown compound 'palAppazham' ('jackfruit') into 'palA' + 'pazham'

In [58]:
# BAD (multiple roots)

stem('பலாப்பழம்')

('பலாப்பழம்', [], [])

(44) Stem 'irukka' ('to be') oversegmented and eroded

In [59]:
# BAD (incorrect stem)

stem('இருக்க')

('இர்', [], ['tense_suffix_ka'])

(45) Complex form incorrectly segmented: correct suffix but root 'irukka' ('to be') misidentified and 'ntha' specifier missing

In [61]:
# BAD

stem('இருந்தவர்களின்')

('இரு', [], ['tense_suffix_var', 'plural_suffix', 'vetrumai_suffix_in'])

(46) Root 'Varuthapadu' oversegmented into 'varu' (with false morpheme 'tha')

In [65]:
# BAD (false morphemes)

stem('வருத்தப்படுத்த')

('வரு', [], ['tense_suffix_tha', 'tense_suffix_padu', 'tense_suffix_tha'])

(47) Complex form incorrectly segmented: correct suffix but root 'sAppiDa' ('to eat') not separated from specifier

In [66]:
# BAD (doesn't stem all the way)

stem('சாப்பிட்டவர்கள்')

('சாப்பிட்ட', [], ['tense_suffix_var', 'plural_suffix'])

(48) Complex form incorrectly segmented: correct suffix but root 'po' ('to go') not separated from specifier

In [68]:
# BAD (stem not normalised in tense)

stem('போனவர்கள்')

('போன', [], ['tense_suffix_var', 'plural_suffix'])

(47) Specifier 'ya' not listed in suffixes

In [69]:
# BAD (just ignores a morpheme)

stem('போகிய')

('போகி', [], [])

(48) Specifier 'ntha' not listed in suffixes

In [71]:
# BAD (another ntha)

stem('அமைந்த')

('அமை', [], [])

(49) Aspectual verb 'vidu' included in root 'amai'

In [72]:
# BAD, doesn't go all the way

stem('அமைத்துவிட்டார்')

('அமைத்துவிடு', [], ['tense_suffix_aar'])

(50) '-kkaran' profession/agentive particle not segmented

In [76]:
# BAD (irregular morphemes)

stem('வேலைக்காரன்')

('வேலைக்காரன்', [], [])

(51) Root 'aNNA' ('brother') oversegmented

In [79]:
# BAD (stems too far)

stem('அண்ணாவுக்காக') # BUG HERE also

('அண்', [], ['tense_suffix_aa', 'vetrumai_suffix_ukkaaka'])

(52) Inanimate noun class form of ablative '-ilirunthu' ignored 

In [88]:
# BAD (wrong form of stem)

stem('அதிலிருந்து')

('அதில்', [], ['vetrumai_suffix_irunthu'])

(53) Reference (24) — 'common_suffix_pattathu' used for past tense, while 'padu' for present.

In [90]:
# BAD (different morphemes used for different tenses)

stem('எழுதப்படுகிறது')

('எழுத', [], ['tense_suffix_padu', 'tense_suffix_kira', 'vetrumai_suffix_thu'])

(54) Incorrect root for future tense verb.

In [91]:
# BAD (stem is of the wrong form)

stem('எழுதுவார்')

('எழுது', [], ['tense_suffix_aar'])

(55) Incorrect stem for kinship term 'mAmA' + verbal prefix identified for noun

In [92]:
# BAD (incorrect stem)

stem('மாமனார்')

('மாம', [], ['tense_suffix_naar'])

(55) Incorrect verb root identified for past tense

In [93]:
# BAD (incorrect root)

stem('சொன்னார்')

('சொன்', [], ['tense_suffix_naar'])

(56) 'AvviDam' ('that place') broken into nonsense root based off overapplication of sandhi rules (and confusion of word 'iDam' with morpheme 'iDam')

In [18]:
# BAD

stem('அவ்விடம்')

('வ்', ['pronoun_prefix_a'], ['common_suffix_idam'])

(57) Complex form undersegmented (root 'kaTTa' ('to tie') left in gerund form)

In [97]:
# BAD (bad stem)

stem('கட்டியவர்கள்')

('கட்டி', [], ['tense_suffix_var', 'plural_suffix'])

(58) Complex form undersegmented (root 'pAda' ('to sing') left in gerund form).

In [19]:
# BAD (bad stem)

stem('பாடியவர்கள்')

('பாடி', [], ['tense_suffix_var', 'plural_suffix'])

(59) Stem 'paccai' ('green') oversegmented into false root with accusative ending '-ai'

In [20]:
# BAD (false morpheme)

stem('பச்சை')

('ப', [], ['vetrumai_suffix_ai'])

(60) Stem 'koTTAvi' ('yawn') oversegmented into nonsensical mix of negation and causative suffix with incorrect root

In [102]:
# BAD

stem('கொட்டாவி')

('கொடு', [], ['tense_suffix_aa', 'command_suffix_vi'])

#### Examples from Dataset