# Tokens and analysis

This demo is based on 123844 tokens with 205005 analysis.

We have a file of tokens with analyses where every line has json encoded form
```
token, [(lemma, root, root_tokens, ending, clitic, partofspeech, form), ...]
```
For example:
```
["Ühtedel", [["üks", "üks", ["üks"], "del", "", "N", "pl ad"], ["üks", "üks", ["üks"], "del", "", "P", "pl ad"]]]
```

## Create a dict analysis_examples
```
{(cap, dash, eq, ending, clitic, partofspeech, form): token, ...}
```
where 

        cap == True, if token starts with a capital letter,
        cap == False, if token starts with a lower case letter,
        cap == None, if token does not start with a letter,
        dash == True, if '-' in the root,
        dast == False if '-' not in the root,
        eq == True, if '=' in the root,
        eq == False if '=' not in the root.

In [1]:
from collections import Counter
from tokens_analysis import yield_tokens_analysis
file='../corpora/tokens_analysis.json'

analysis_examples = {}
analysis_counter = Counter()
for token, analysis in yield_tokens_analysis(file):
    if token[0].islower():
        cap = False
    elif token[0].isupper():
        cap = True
    elif not token[0].isalpha():
        cap = None
    for a in analysis:
        key = (cap, 
               '-' in a['root'], 
               '=' in a['root'], 
               a['ending'], 
               a['clitic'], 
               a['partofspeech'], 
               a['form'])
        analysis_examples[key] = token
        analysis_counter[key] += 1

print('20 most common analysis out of', len(analysis_examples), '\n')
for a, n in analysis_counter.most_common(20):
    print('{:<50} {:<20} {:>10}'.format(str(a), analysis_examples[a], n))

20 most common analysis out of 1658 

(True, False, False, '0', '', 'H', 'sg n')         Προϊόντα                  18159
(True, False, False, '0', '', 'H', 'sg g')         Εcu                       13890
(False, False, False, '0', '', 'S', 'sg n')        τόνο                      12402
(False, False, False, '0', '', 'S', 'sg g')        žürii                     12330
(False, False, False, '0', '', 'S', 'sg p')        žiletitera                 5039
(False, False, False, 'd', '', 'S', 'pl n')        žüriid                     4663
(False, False, False, '0', '', 'S', 'adt')         želatiini                  4043
(True, False, False, '0', '', 'S', 'sg n')         Žürii                      3654
(False, False, False, 't', '', 'S', 'sg p')        žiirot                     3427
(True, False, False, '0', '', 'S', 'sg g')         Žürii                      3353
(False, False, False, 's', '', 'S', 'sg in')       žanris                     3112
(False, False, False, 'st', '', 'S', 'sg el')    

## Text of tokens that have pairwise different analysis

Maybe we can use this text for testing syntax_preprocessing efficiently.

In [2]:
' '.join(list(analysis_examples.values())[:100])

'Üksikjuhte ÜRO-s ühelegi õnnestubki Teist Üleminekuperioodiks Tuulikki emadel-isadel sügavaimas sala-asju Üheks ümbritsevatesse Üsna Üritused Läänemeremaadeni viitena Ükski Üritusel Šotimaal üheteistkümnendaks Venemaa-siseste Ülemöödunud teerulli-hääletamisi Vanust ühiskondlik-majanduslike Ameerika-turneed Neilt üritustel mõjutamast-allusioneerimast üle-eelmisega üheski võtke välis- teistega 24-käigulisega vakantsele üleajalisem Ütled viite ümberkaudsetelt i-mitmuseja Paljudest Vanusega Ähvardavad vanul emba-kumba Viiakse Ülevaim Pikendamisel Ärge Võttis Noor-Tartuga idapoolseisse resultatiivseimana Polnud õlijaid Üritusele Zaragoza-Soria möödagi käsitlevais 21-aastastelt üllatusena ülimenukateks Keila-poolsele ütlemata tele-eetrisse üheski üht-kaht Sadade Šokeeriv Oscari-filmidest WC-kabiinide Muudegi Innustatuna Üldhariduslike Šokeeriv tegeldavat samadesse Üleujutuspiirkondades Ongi Üliõpilastel 45-50-ruutmeetrised dOisans-Courchevel viiakse Sobig.F-ile pisimategi tuhandetesse ütlem

## Pronouns in corpora

Find all tokens that are marked as pronouns (partofspeech P). Group the pronouns by regexes from \_pronConversions. The class 'no_search_match' is for pronouns that don't match any regex.

In [3]:
_pronConversions = ( ("emb\+.*",             ("det",)),
                     ("enda\+.*",            ("pos", "refl")),
                     ("enese\+.*",           ("pos", "refl")),
                     ("eikeegi.*",           ("indef",)),
                     ("eimiski.*",           ("indef",)),
                     ("emb-kumb.*",          ("det",)),
                     ("esimene.*",           ("dem",)),
                     ("iga\+.*",             ("det",)),
                     ("iga_sugune.*",        ("indef",)),
                     ("iga_.ks\+.*",         ("det",)),
                     ("ise\+.*",             ("pos", "det", "refl")),
                     ("ise_enese.*",         ("refl",)),
                     ("ise_sugune.*",        ("dem",)),
                     ("keegi.*",             ("indef",)),
                     ("kes.*",               ("inter rel",)),
                     ("kumb\+.*",            ("rel",)),
                     ("kumbki.*",            ("det",)),
                     ("kõik.*",              ("det",)),
                     ("k.ik.*",              ("det",)),
                     ("meie_sugune.*",       ("dem",)),
                     ("meie_taoline.*",      ("dem",)),
                     ("mihuke\+.*",          ("inter rel",)),
                     ("mihukene\+.*",        ("inter rel",)),
                     ("mille_taoline.*",     ("dem",)),
                     ("milli=?ne.*",         ("rel",)),
                     ("mina\+.*",            ("pers ps1",)),
                     (" ma\+.*",             ("pers ps1",)),
                     ("mina=?kene\+.*",      ("dem",)),
                     ("mina=?ke\+.*",        ("dem",)),
                     ("mingi\+.*",           ("indef",)),
                     ("mingi_sugune.*",      ("indef",)),
                     ("minu_sugune.*",       ("dem",)),
                     ("minu_taoline.*",      ("dem",)),
                     ("miski.*",             ("indef",)),
                     ("mis\+.*",             ("inter rel",)),
                     ("mis_sugune.*",        ("inter rel",)),
                     ("miski\+.*",           ("inter rel",)),
                     ("miski_sugune.*",      ("inter rel",)),
                     ("misu=?ke(ne)?\+.*",   ("dem",)),
                     ("mitme_sugune.*",      ("indef",)),
                     ("mitme_taoline.*",     ("indef",)),
                     ("mitmendik\+.*",       ("inter rel",)),
                     ("mitmes\+.*",          ("inter rel", "indef")),
                     ("mi=?tu.*",            ("indef",)),
                     ("miuke(ne)?\+.*",      ("inter rel",)),
                     ("muist\+.*",           ("indef",)),
                     ("muu.*",               ("indef",)),
                     ("m.lema.*",            ("det",)),
                     ("m.ne_sugune\+.*",     ("indef",)),
                     ("m.ni\+.*",            ("indef",)),
                     ("m.ningane\+.*",       ("indef",)),
                     ("m.ningas.*",          ("indef",)),
                     ("m.herdune\+.*",       ("indef", "rel")),
                     ("määntne\+.*",         ("dem",)),
                     ("na_sugune.*",         ("dem",)),
                     ("nende_sugune.*",      ("dem",)),
                     ("nende_taoline.*",     ("dem",)),
                     ("nihuke(ne)?\+.*",     ("dem",)),
                     ("nii_mi=?tu\+.*",      ("indef", "inter rel")),
                     ("nii_sugune.*",        ("dem",)),
                     ("niisama_sugune.*",    ("dem",)),
                     ("nii?su=?ke(ne)?\+.*", ("dem",)),
                     ("niuke(ne)?\+.*",      ("dem",)),
                     ("oma\+.*",             ("pos", "det", "refl")),
                     ("oma_enese\+.*",       ("pos",)),
                     ("oma_sugune\+.*",      ("dem",)),
                     ("oma_taoline\+.*",     ("dem",)),
                     ("palju.*",             ("indef",)),
                     ("sama\+.*",            ("dem",)),
                     ("sama_sugune\+.*",     ("dem",)),
                     ("sama_taoline\+.*",    ("dem",)),
                     ("samune\+.*",          ("dem",)),
                     ("see\+.*",             ("dem",)),
                     ("see_sama\+.*",        ("dem",)),
                     ("see_sam[au]ne\+.*",   ("dem",)),
                     ("see_sinane\+.*",      ("dem",)),
                     ("see_sugune\+.*",      ("dem",)),
                     ("selle_taoline\+.*",   ("dem",)),
                     ("selli=?ne\+.*",       ("dem",)),
                     ("setu\+.*",            ("indef",)),
                     ("setmes\+.*",          ("indef",)),
                     ("sihuke\+.*",          ("dem",)),
                     ("sina\+.*",            ("pers ps2",)),
                     (" sa\+.*",             ("pers ps2",)),
                     ("sinu_sugune\+.*",     ("dem",)),
                     ("sinu_taoline\+.*",    ("dem",)),
                     ("siuke(ne)?\+.*",      ("dem",)),
                     ("säherdune\+.*",       ("dem",)),
                     ("s.herdune\+.*",       ("dem",)),
                     ("säärane\+.*",         ("dem",)),
                     ("s..rane\+.*",         ("dem",)),
                     ("taoline\+.*",         ("dem",)),
                     ("teie_sugune\+.*",     ("dem",)),
                     ("teie_taoline\+.*",    ("dem",)),
                     ("teine\+.*",           ("dem",)),
                     ("teine_teise\+.*",     ("rec",)),
                     ("teist?_sugune\+.*",   ("dem",)),
                     ("tema\+.*",            ("pers ps3",)),
                     (" ta\+.*",             ("pers ps3",)),
                     ("temake(ne)?\+.*",     ("pers ps3",)),
                     ("tema_sugune\+.*",     ("dem",)),
                     ("tema_taoline\+.*",    ("dem",)),
                     ("too\+.*",             ("dem",)),
                     ("too_sama\+.*",        ("dem",)),
                     ("üks.*",               ("dem", "indef")),
                     (".ks.*",               ("dem", "indef")),
                     ("ükski.*",             ("dem", "indef")),
                     (".kski.*",             ("dem", "indef")),
                     ("üks_teise.*",         ("rec", "indef")),
                     (".ks_teise.*",         ("rec",))
)

import regex

search = regex.search
match = regex.match
pronoun_examples = {}
for a, b in _pronConversions:
    pronoun_examples[a] = []
no_search_match_examples = []

for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['partofspeech'] != 'P':
            continue
        root_ec = ''.join((a['root'], '+', a['ending'], a['clitic']))
        search_match = False
        for pattern, l in pronoun_examples.items():
            if search(pattern, root_ec):
                # kas search või match? "enese" vs "iseenese"
                # kas token asemel võiks olla hoopis token.lower() või root või lemma (ja l oleks siis list asemel set)?
                l.append(token)
                search_match = True
        if not search_match:
            no_search_match_examples.append(token)
pronoun_examples['no_search_match'] = no_search_match_examples
for pattern, examples in pronoun_examples.items():
    print(pattern)
    print(examples)

m.lema.*
['Mõlema', 'Mõlemad', 'Mõlemaid', 'Mõlemal', 'Mõlemale', 'Mõlemas', 'Mõlemast', 'Mõlemate', 'mõlema', 'mõlemad', 'mõlemaid', 'mõlemaks', 'mõlemal', 'mõlemale', 'mõlemalt', 'mõlemani', 'mõlemas', 'mõlemasse', 'mõlemast', 'mõlemat', 'mõlemate', 'mõlematele', 'mõlematest']
mitme_taoline.*
[]
mihukene\+.*
['Mihukeselt']
mina=?ke\+.*
[]
mina\+.*
['-ma', 'MA', 'ME', 'ME', 'Ma', 'Me', 'Me', 'Meid', 'Meie', 'Meie', 'Meiegi', 'Meiegi', 'Meieni', 'Meil', 'Meile', 'Meilt', 'Mina', 'Minagi', 'Mind', 'Minu', 'Minugi', 'Minul', 'Minule', 'Minult', 'Minus', 'Mu', 'Mul', 'Mulgi', 'Mulle', 'Must', 'hingetõm-me', 'hingetõm-me', 'ma', 'me', 'me', 'meid', 'meidki', 'meie', 'meie', 'meiega', 'meiegi', 'meiegi', 'meieni', 'meil', 'meile', 'meilegi', 'meilgi', 'meilt', 'meis', 'meisse', 'meist', 'meistki', 'mina', 'minagi', 'mind', 'minu', 'minuga', 'minugi', 'minul', 'minule', 'minulgi', 'minult', 'minuni', 'minus', 'minusse', 'minust', 'mu', 'mul', 'mulle', 'mullegi', 'mult', 'musse', 'must', 'tul

## Morph analysis of tokens *vs* tmorftrtabel.txt

In [4]:
pos_form_pairs = {}
from estnltk.rewriting.syntax_preprocessing.syntax_preprocessing import MorphToSyntaxMorphRewriter
tmorph_file = '../estnltk/estnltk/rewriting/syntax_preprocessing/rules_files/tmorftrtabel.txt'
pos_form_pairs['in tmorftrtabel.txt'] = set(MorphToSyntaxMorphRewriter.load_fs_mrf_to_syntax_mrf_translation_rules(tmorph_file))

pos_form_pairs['in corpus'] = set()
for a in analysis_examples:
    pos_form_pairs['in corpus'].add((a[5], a[6]))
    
print('(pos, form) pairs in tmorftrtabel.txt, but not in corpus:')
print(pos_form_pairs['in tmorftrtabel.txt'] - pos_form_pairs['in corpus'])
print('\n(pos, form) pairs in corpus, but not in tmorftrtabel.txt:')
not_in_tmorftrtabel = pos_form_pairs['in corpus'] - pos_form_pairs['in tmorftrtabel.txt']
for pos_form in not_in_tmorftrtabel:
    print('{:<20}'.format(str(pos_form)), end='')
    for a in analysis_examples:
        if pos_form == (a[5], a[6]):
            print(analysis_examples[a], end=', ')
    print()

(pos, form) pairs in tmorftrtabel.txt, but not in corpus:
{('O', 'pl ter'), ('V', 'tuvat'), ('U', 'pl ab'), ('U', 'pl ter'), ('Y', 'pl ter'), ('N', 'pl ter'), ('Y', 'pl ill'), ('N', 'pl ab'), ('U', 'pl abl'), ('O', 'sg ab'), ('O', 'sg ter'), ('Y', 'sg n'), ('P', ''), ('Y', 'pl abl'), ('U', 'pl ill'), ('Y', 'pl es'), ('Y', 'pl all'), ('U', 'sg kom'), ('C', 'sg ter'), ('U', 'sg ter'), ('N', 'sg ab'), ('Y', 'pl tr'), ('N', 'pl tr'), ('V', 'nuvat'), ('C', 'pl ter'), ('U', 'pl kom'), ('U', 'sg ab'), ('U', 'pl es')}

(pos, form) pairs in corpus, but not in tmorftrtabel.txt:
('V', 'neg me')     Ärme, ärme, 
('H', '')           Ülemöödunud, Virisemata, 


## Roots that are in corpus but not in abileksikon06.utf.lx

In [5]:
# find all pairs (root, partofspeech) from abileksikon
abileksikon_file = '../estnltk/estnltk/rewriting/syntax_preprocessing/rules_files/abileksikon06utf.lx'

root_pos = set()
with open(abileksikon_file, mode='r') as f:
    while True:
        root = next(f, None)
        if root is None:
            break
        root = root.rstrip('\n')
        rule = next(f).rstrip('\n')
        if ('_K_' in rule):
            root_pos.add((root, 'K'))
        if ('_V_' in rule):
            root_pos.add((root, 'V'))
        if ('_Y_' in rule):
            root_pos.add((root, 'Y'))
        if ('_Z_' in rule):
            root_pos.add((root, 'Z'))

not_in_abileksikon = {'K': set(), 'V': set(), 'Y': set(), 'Z': set()}

file='../corpora/tokens_analysis.json'
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['partofspeech'] in {'K', 'V'}:
            if (a['root'], a['partofspeech']) not in root_pos:
                not_in_abileksikon[a['partofspeech']].add(a['root'])
        elif a['partofspeech'] in {'Y', 'Z'}:
            if len(a['root']) > 1:
                continue
            if (a['root'], a['partofspeech']) not in root_pos:
                not_in_abileksikon[a['partofspeech']].add(a['root'])

# Y ja Z jaoks me abileksikoni syntax_preprocessingu juures ei kasuta,
# piisab PunctuationTypeRewriter-st.
print('Y (max one character)')
print(not_in_abileksikon['Z'])
print('Z (max one character)')
print(not_in_abileksikon['Z'])
print('K')
print(not_in_abileksikon['K'])
print('V')

print(not_in_abileksikon['V'])

Y (max one character)
{'=', '', '[', '<', '’', '/', '*', ']', '+', '‹', '›', '>'}
Z (max one character)
{'=', '', '[', '<', '’', '/', '*', ']', '+', '‹', '›', '>'}
K
{'maa-rahu-eest', 'peast', 'ilma+st-ilma', 'jõulu-ajal', 'vastavalt', 'kohaselt', 'arust', 'alates'}
V
{'Ameerika+s-käi', 'represseeri', 'ohuta', 'ajasta', 'sirvi', 'vorpi', 'dresseeri', 'kaalutle', 'sanu', 'liputa', 'tingi', 'kasukka', 'ene', 'reserveeri', 'reiti', 'subritreeri', 'sõbrune', 'heasta', 'hubune', 'kappi', 'puuderda', 'taganda', 'tasa_arvelda', 'agiteeri', 'seiska', 'monopoliseeri', 'protokolli', 'riigi_kontrolli-poolne', 'tonnellata', 'masseeri', 'volksuta', 'sauti', 'haipi', 'eksponeeri', 'voha', 'manipuleeri', 'figureeri', 'komandeeri', 'kraami', 'osata', 'sõnasta', 'õhki', 'hõlju', 'ehku', 'optimeeri', 'ripi', 'romaani', 'tiri', 'suvereniteeti', 'seadusta', 'friigi', 'reasta', 'nõidu', 'kontsentreeri', 'talita', 'raali', 'falle', 'kulsi', 'määratle', 'liberaliseeri', 'virele', 'raporteeri', 'žsamaaliidu',

## Everything that comes afer '=' in root

Group the results by part of speech.

In [6]:
from collections import defaultdict
root_eq_ends = defaultdict(set)
file = '../corpora/tokens_analysis.json'
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if '=' in a['root']:
            root_split = a['root'].split('=')
            if len(root_split) > 2:
                print('more than one "=" in root:b', a['root'])
            root_eq_ends[a['partofspeech']].add(root_split[-1])
for key, values in root_eq_ends.items():
    print(key)
    print(values)

more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b sorki=mis-nuusuta=mis_meetod
O
{'s'}
V
{'nu+d-tege', 'nu+d-sula', 'tu+d-armasta', 'tu+d-kirjuta', 'nu+d-kirjuta'}
U
{'duim', 'nuim', 'lisim', 'vaim', 'tuim', 'tavaim', 'im'}
D
{'ti', 'liselt', 'malt', 'matult', 'laslikult', 'likumalt', 'tavamalt', 'tamatult', 'lt', 'istlikumalt', 'nult', 'likult', 'lisemalt', 'numalt', 'konniti', 'vamalt', 'valt', 'mini', 'istlikult', 'tult'}
C
{'m', 'matum', 'lisem', 'num', 'likum', 'vam', 'istlikum', 'tum', 'tavam'}
A
{'lik', 'tu+d-kirjutatud', 'ja+te_poolne', 'dav', 'mis_tarkune', 'mata', 'tuslik', 'tu+d-armastatud', 'mis_vastane', 'line', 'lik-poliitiline', 'nu+d-kirjutanud', 'luslik', 'lik-optimistlik', 'ja+te_vaheline', 'nu+d-sulav', 'istlik', 'tu', 'dud', 'nud', 'tud', 'jalik', 'tav', 'matu', 'v', 'tamatu', 'tuslik-kasvatuslik', 'ke', 'laslik', 'istlik-avangardistlik', 'mis_himuline

## All tokens that have part of speech "V" and form "neg"

In [7]:
file = '../corpora/tokens_analysis.json'
aux_neg_tokens = []
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['form'] == 'neg' and a['partofspeech'] == 'V':
            aux_neg_tokens.append(token)
            break
aux_neg_tokens

['EI', 'Ei', 'ei']