# Tokens and analysis

This demo is based on 123844 tokens with 205005 analysis.

We have a file of tokens with analyses where every line has json encoded form

       token, [(lemma, root, root_tokens, ending, clitic, partofspeech, form), ...]
For example:
```
...
["insenerid", [["insener", "insener", ["insener"], "d", "", "S", "pl n"]]]
...
["Ühtedel", [["üks", "üks", ["üks"], "del", "", "N", "pl ad"], ["üks", "üks", ["üks"], "del", "", "P", "pl ad"]]]
...
```

## Create a dict analysis_examples

       {(cap, dash, eq, ending, clitic, partofspeech, form): token, ...}

where 

        cap == True, if token starts with a capital letter,
        cap == False, if token starts with a lower case letter,
        cap == None, if token does not start with a letter,
        dash == True, if '-' in the root,
        dast == False if '-' not in the root,
        eq == True, if '=' in the root,
        eq == False if '=' not in the root.

In [1]:
from collections import Counter
from tokens_analysis import yield_tokens_analysis
file='../corpora/tokens_analysis.json'

analysis_examples = {}
analysis_counter = Counter()
for token, analysis in yield_tokens_analysis(file):
    cap = token[0].isupper()
    for a in analysis:
        key = (cap, 
               '-' in a['root'], 
               '=' in a['root'], 
               a['ending'], 
               a['clitic'], 
               a['partofspeech'], 
               a['form'])
        analysis_examples[key] = token
        analysis_counter[key] += 1

print('Ten most common analysis.\n')
for a, n in analysis_counter.most_common(10):
    print('{:<45} {:<20} {:>10}'.format(str(a), analysis_examples[a], n))

Ten most common analysis.

(True, False, False, '0', '', 'H', 'sg n')    Προϊόντα                  18159
(True, False, False, '0', '', 'H', 'sg g')    Εcu                       13890
(False, False, False, '0', '', 'S', 'sg n')   τόνο                      12430
(False, False, False, '0', '', 'S', 'sg g')   žürii                     12360
(False, False, False, '0', '', 'S', 'sg p')   žiletitera                 5051
(False, False, False, 'd', '', 'S', 'pl n')   žüriid                     4680
(False, False, False, '0', '', 'S', 'adt')    želatiini                  4055
(True, False, False, '0', '', 'S', 'sg n')    Žürii                      3654
(False, False, False, 't', '', 'S', 'sg p')   žiirot                     3431
(True, False, False, '0', '', 'S', 'sg g')    Žürii                      3353


## Text of tokens that have pairwise different analysis

Maybe we can use this text for testing syntax_preprocessing efficiently.

In [2]:
text = ' '.join(set(analysis_examples.values()))
text

'üksikasjus Viigi üürib õn-nestus üllatusena Viimasena Miljoneid ytleme.et neilegi ps-i neistki viljelejatele üldhariduslikes Noor-Tartuga ökonoomsemaid põlvini viitena OC-id ise-ennast õdedele-vendadele toodanguvõimelisemateks Ütle Teist Üheksas Uranus välis-ja = vanemadki üürilepinguid Saaremaa-poolseks 90-ndail viiendalt Vastastele Õnnepalu Ühinenud teistel Teistel Mainekamal pikemata polnud 90ndate seitsmendate-kaheksandate EVP-kroonides Ühel kuues-seitsmes Varasemate MM-ide ühegi UFO-mänguasju Seitsme-kümnendad vastavatki ümbritsevate temata Võtsime teilt PR-tegelasena Mangi Varasemaga Teie Vahemere-stiilis ärandamisvõimaluste võttev Teistsuguste šokolaadikaupadesse jõulu-Lennartiks teile osatagi ümbert Neisse viiendas ülinappe Varema tõigi XPga suurimal ümbritsemas provokatiivseltki Võid Tuli suurimast Üliõpilastel Pooltel 13-ndana üle-eestiliselt üldnimetust mida Värviplekilistes ülimalt turumajanduslikest Takistustena šokolaadikondiitritoodete vabatahtlikega Vett Veelgi Saint-E

## Pronouns in corpora

Find all tokens that are marked as pronouns (partofspeech P). Group the pronouns by regexes from \_pronConversions. The class 'no_search_match' is for pronouns that don't match any regex.

In [3]:
_pronConversions = ( ("emb\+.*",             ("det",)),
                     ("enda\+.*",            ("pos", "refl")),
                     ("enese\+.*",           ("pos", "refl")),
                     ("eikeegi.*",           ("indef",)),
                     ("eimiski.*",           ("indef",)),
                     ("emb-kumb.*",          ("det",)),
                     ("esimene.*",           ("dem",)),
                     ("iga\+.*",             ("det",)),
                     ("iga_sugune.*",        ("indef",)),
                     ("iga_.ks\+.*",         ("det",)),
                     ("ise\+.*",             ("pos", "det", "refl")),
                     ("ise_enese.*",         ("refl",)),
                     ("ise_sugune.*",        ("dem",)),
                     ("keegi.*",             ("indef",)),
                     ("kes.*",               ("inter rel",)),
                     ("kumb\+.*",            ("rel",)),
                     ("kumbki.*",            ("det",)),
                     ("kõik.*",              ("det",)),
                     ("k.ik.*",              ("det",)),
                     ("meie_sugune.*",       ("dem",)),
                     ("meie_taoline.*",      ("dem",)),
                     ("mihuke\+.*",          ("inter rel",)),
                     ("mihukene\+.*",        ("inter rel",)),
                     ("mille_taoline.*",     ("dem",)),
                     ("milli=?ne.*",         ("rel",)),
                     ("mina\+.*",            ("pers ps1",)),
                     (" ma\+.*",             ("pers ps1",)),
                     ("mina=?kene\+.*",      ("dem",)),
                     ("mina=?ke\+.*",        ("dem",)),
                     ("mingi\+.*",           ("indef",)),
                     ("mingi_sugune.*",      ("indef",)),
                     ("minu_sugune.*",       ("dem",)),
                     ("minu_taoline.*",      ("dem",)),
                     ("miski.*",             ("indef",)),
                     ("mis\+.*",             ("inter rel",)),
                     ("mis_sugune.*",        ("inter rel",)),
                     ("miski\+.*",           ("inter rel",)),
                     ("miski_sugune.*",      ("inter rel",)),
                     ("misu=?ke(ne)?\+.*",   ("dem",)),
                     ("mitme_sugune.*",      ("indef",)),
                     ("mitme_taoline.*",     ("indef",)),
                     ("mitmendik\+.*",       ("inter rel",)),
                     ("mitmes\+.*",          ("inter rel", "indef")),
                     ("mi=?tu.*",            ("indef",)),
                     ("miuke(ne)?\+.*",      ("inter rel",)),
                     ("muist\+.*",           ("indef",)),
                     ("muu.*",               ("indef",)),
                     ("m.lema.*",            ("det",)),
                     ("m.ne_sugune\+.*",     ("indef",)),
                     ("m.ni\+.*",            ("indef",)),
                     ("m.ningane\+.*",       ("indef",)),
                     ("m.ningas.*",          ("indef",)),
                     ("m.herdune\+.*",       ("indef", "rel")),
                     ("määntne\+.*",         ("dem",)),
                     ("na_sugune.*",         ("dem",)),
                     ("nende_sugune.*",      ("dem",)),
                     ("nende_taoline.*",     ("dem",)),
                     ("nihuke(ne)?\+.*",     ("dem",)),
                     ("nii_mi=?tu\+.*",      ("indef", "inter rel")),
                     ("nii_sugune.*",        ("dem",)),
                     ("niisama_sugune.*",    ("dem",)),
                     ("nii?su=?ke(ne)?\+.*", ("dem",)),
                     ("niuke(ne)?\+.*",      ("dem",)),
                     ("oma\+.*",             ("pos", "det", "refl")),
                     ("oma_enese\+.*",       ("pos",)),
                     ("oma_sugune\+.*",      ("dem",)),
                     ("oma_taoline\+.*",     ("dem",)),
                     ("palju.*",             ("indef",)),
                     ("sama\+.*",            ("dem",)),
                     ("sama_sugune\+.*",     ("dem",)),
                     ("sama_taoline\+.*",    ("dem",)),
                     ("samune\+.*",          ("dem",)),
                     ("see\+.*",             ("dem",)),
                     ("see_sama\+.*",        ("dem",)),
                     ("see_sam[au]ne\+.*",   ("dem",)),
                     ("see_sinane\+.*",      ("dem",)),
                     ("see_sugune\+.*",      ("dem",)),
                     ("selle_taoline\+.*",   ("dem",)),
                     ("selli=?ne\+.*",       ("dem",)),
                     ("setu\+.*",            ("indef",)),
                     ("setmes\+.*",          ("indef",)),
                     ("sihuke\+.*",          ("dem",)),
                     ("sina\+.*",            ("pers ps2",)),
                     (" sa\+.*",             ("pers ps2",)),
                     ("sinu_sugune\+.*",     ("dem",)),
                     ("sinu_taoline\+.*",    ("dem",)),
                     ("siuke(ne)?\+.*",      ("dem",)),
                     ("säherdune\+.*",       ("dem",)),
                     ("s.herdune\+.*",       ("dem",)),
                     ("säärane\+.*",         ("dem",)),
                     ("s..rane\+.*",         ("dem",)),
                     ("taoline\+.*",         ("dem",)),
                     ("teie_sugune\+.*",     ("dem",)),
                     ("teie_taoline\+.*",    ("dem",)),
                     ("teine\+.*",           ("dem",)),
                     ("teine_teise\+.*",     ("rec",)),
                     ("teist?_sugune\+.*",   ("dem",)),
                     ("tema\+.*",            ("pers ps3",)),
                     (" ta\+.*",             ("pers ps3",)),
                     ("temake(ne)?\+.*",     ("pers ps3",)),
                     ("tema_sugune\+.*",     ("dem",)),
                     ("tema_taoline\+.*",    ("dem",)),
                     ("too\+.*",             ("dem",)),
                     ("too_sama\+.*",        ("dem",)),
                     ("üks.*",               ("dem", "indef")),
                     (".ks.*",               ("dem", "indef")),
                     ("ükski.*",             ("dem", "indef")),
                     (".kski.*",             ("dem", "indef")),
                     ("üks_teise.*",         ("rec", "indef")),
                     (".ks_teise.*",         ("rec",))
)

import regex

search = regex.search
match = regex.match
pronoun_examples = {}
for a, b in _pronConversions:
    pronoun_examples[a] = []
no_search_match_examples = []

for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['partofspeech'] != 'P':
            continue
        root_ec = ''.join((a['root'], '+', a['ending'], a['clitic']))
        search_match = False
        for pattern, l in pronoun_examples.items():
            if search(pattern, root_ec):
                # kas search või match? "enese" vs "iseenese"
                # kas token asemel võiks olla hoopis token.lower() või root või lemma (ja l oleks siis list asemel set)?
                l.append(token)
                search_match = True
        if not search_match:
            no_search_match_examples.append(token)
pronoun_examples['no_search_match'] = no_search_match_examples
for pattern, examples in pronoun_examples.items():
    print(pattern)
    print(examples)

too\+.*
['Nood', 'Tol', 'Tolle', 'Tolleks', 'Tollesse', 'Tollest', 'Too', 'nois', 'nonde', 'nood', 'toda', 'tol', 'tolle', 'tolleks', 'tollel', 'tollele', 'tolles', 'tollest', 'too']
 sa\+.*
[]
see\+.*
['60seks', 'Need', 'Neid', 'Neiks', 'Neil', 'Neile', 'Neilt', 'Neis', 'Neisse', 'Neist', 'Nende', 'Nendega', 'Nendel', 'Nendele', 'Nendelt', 'Nendes', 'Nendest', 'Nendestki', 'SEE', 'SELLEKS', 'Seda', 'Sedagi', 'See', 'Seegi', 'Seks', 'Sel', 'Selgi', 'Selle', 'Sellega', 'Sellegi', 'Selleks', 'Sellel', 'Sellele', 'Sellelt', 'Selleni', 'Selles', 'Selleski', 'Sellesse', 'Sellest', 'Sest', 'kinnita-mi-seks', 'need', 'needki', 'neid', 'neidki', 'neil', 'neile', 'neilegi', 'neilgi', 'neilt', 'neis', 'neisse', 'neist', 'neistki', 'nende', 'nendega', 'nendegi', 'nendeks', 'nendel', 'nendele', 'nendelgi', 'nendelt', 'nendeni', 'nendes', 'nendesse', 'nendest', 'nendeta', 'seda', 'sedagi', 'see', 'seegi', 'seks', 'sel', 'selgi', 'selle', 'sellega', 'sellegi', 'selleks', 'sellel', 'sellele', 'sellel

## Roots that are in korpus but not in abileksikon06.utf.lx

In [4]:
# find all pairs (root, partofspeech) from abileksikon
abileksikon_file = '../estnltk/estnltk/rewriting/syntax_preprocessing/rules_files/abileksikon06utf.lx'

root_pos = set()
with open(abileksikon_file, mode='r') as f:
    while True:
        root = next(f, None)
        if root is None:
            break
        root = root.rstrip('\n')
        rule = next(f).rstrip('\n')
        if ('_K_' in rule):
            root_pos.add((root, 'K'))
        if ('_V_' in rule):
            root_pos.add((root, 'V'))
        if ('_Y_' in rule):
            root_pos.add((root, 'Y'))
        if ('_Z_' in rule):
            root_pos.add((root, 'Z'))

not_in_abileksikon = {'K': set(), 'V': set(), 'Y': set(), 'Z': set()}

file='../corpora/tokens_analysis.json'
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['partofspeech'] in {'K', 'V'}:
            if (a['root'], a['partofspeech']) not in root_pos:
                not_in_abileksikon[a['partofspeech']].add(a['root'])
        elif a['partofspeech'] in {'Y', 'Z'}:
            if len(a['root']) > 1:
                continue
            if (a['root'], a['partofspeech']) not in root_pos:
                not_in_abileksikon[a['partofspeech']].add(a['root'])

# Y ja Z jaoks me abileksikoni syntax_preprocessingu juures ei kasuta,
# piisab PunctuationTypeRewriter-st. Siia kirutame ainult lühikesed (0 või 1 märki) näited.
print('Y')
print(not_in_abileksikon['Z'])
print('Z')
print(not_in_abileksikon['Z'])
print('K')
print(not_in_abileksikon['K'])
print('V')

print(not_in_abileksikon['V'])

Y
{'', ']', '*', '[', '<', '=', '’', '›', '‹', '/', '>', '+'}
Z
{'', ']', '*', '[', '<', '=', '’', '›', '‹', '/', '>', '+'}
K
{'peast', 'vastavalt', 'ilma+st-ilma', 'alates', 'jõulu-ajal', 'kohaselt', 'maa-rahu-eest', 'arust'}
V
{'täiusta', 'piiska', 'aja_loo-alane', 'elektrifitseeri', 'viidata', 'sisene', 'rüüta', 'revideeri', 'resolutusiooni', 'raba', 'tead+a-tund', 'kaalutle', 'hoiusta', 'tolmle', 'hõisata', 'nuia', 'irjuta', 'tarvitse', 'lühenda', 'konkretiseeri', 'dis_krediteeri', 'au-hinda', 'nuputa', 'sööbi', 'juurdle', 'kleepi', 're_organiseeri', 'haiguse+ga-pakku', 'parselda', 'kuulata', 'suune', 'jetti', 'põeta', 'defaulti', 'ühita', 'tuula', 'sääne', 'piidru', 'peaagu', 'seila', 'evi', 'paisuta', 'paneeli', 'tsŠiili', 'risusta', 'para_fraseeri', 'tippi', 'tampi', 'ene', 'tõde', 'eksle', 'sprinti', 'frakstiooni', 'rooska', 'agiteeri', 'enneta', 'purki', 'tasa_kaalustu', 'jaluta', 'hävita', 'lahva', 'guildi', 'kahe_kordistu', 'all_kirjasta', 'demonsteeri', 'laksnu', 'dis_kvali

## Find everything that comes afer '=' in root

Group the results by part of speech.

In [5]:
from collections import defaultdict
root_eq_ends = defaultdict(set)
file = '../corpora/tokens_analysis.json'
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if '=' in a['root']:
            root_split = a['root'].split('=')
            if len(root_split) > 2:
                print('more than one "=" in root:b', a['root'])
            root_eq_ends[a['partofspeech']].add(root_split[-1])
for key, values in root_eq_ends.items():
    print(key)
    print(values)

more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b mõel=nu+d-tei=nud
more than one "=" in root:b sorki=mis-nuusuta=mis_meetod
V
{'nu+d-sula', 'nu+d-kirjuta', 'tu+d-kirjuta', 'nu+d-tege', 'tu+d-armasta'}
O
{'s'}
C
{'vam', 'istlikum', 'tavam', 'm', 'lisem', 'tum', 'num', 'matum', 'likum'}
A
{'nu+d-sulav', 'tuslik', 'lik-poliitiline', 'matu', 'mata', 'tud', 'laslik', 'nu+d-kirjutanud', 'dud', 'dav', 'mis_vastane', 'ja+te_vaheline', 'ke', 'luslik', 'jalik', 'tu+d-kirjutatud', 'mis_tarkune', 'line', 'tav', 'tamatu', 'lik', 'istlik', 'tu+d-armastatud', 'nud', 'mis_himuline', 'lik-optimistlik', 'ja+te_poolne', 'v', 'istlik-avangardistlik', 'tu', 'tuslik-kasvatuslik'}
S
{'mis_suund', 'mis_kiird', 'mis_tingimus', 'mis_taotlus', 'mis_plaat', 'lane', 'vus', 'mis_tulemus', 'tsus', 'kene', 'dus', 'mis_keskus', 'sus_põhi_mõte', 'mis_tseremoonia', 'tavus', 'mis_kulu', 'mis_kurk', 'matus', 'tamatus', 'du', 'ja+te_koda', 's', 'us', 'm

## Find all tokens that have part of speech "V" and form "neg"

In [6]:
file = '../corpora/tokens_analysis.json'
aux_neg_tokens = []
for token, analysis in yield_tokens_analysis(file):
    for a in analysis:
        if a['form'] == 'neg' and a['partofspeech'] == 'V':
            aux_neg_tokens.append(token)
            break
aux_neg_tokens

['EI', 'Ei', 'ei']