## Extracting information from universal dependency treebanks
Using udapi and other tools to review content of treebanks.
I've repeated some of the examples from http://udapi.github.io/tutorial/ and from http://udapi.github.io/slides.pdf using the Turkish universal dependencies *.conllu file.  I don't include the file itself, but you can obtain this or others from http://universaldependencies.org.  

I've also included an example of my own where I use the structure provided by udapi to do a custom report without writing my own Block within the udapi architecture.

In [None]:
# Use of udapy - an API for universal dependencies. 
# See http://udapi.github.io for more information.
# To install use: $ pip3 install --user --upgrade udapi

# Report multi-word tokens from Turkish.
# Both methods word achieve the same result.
#!cat UD_Turkish-IMST/tr_imst-ud-train.conllu | udapy -T | less -R
!udapy -T < UD_Turkish-IMST/tr_imst-ud-train.conllu | less -R

2018-05-17 16:06:37,570 [   INFO] execute - No reader specified, using read.Conllu
2018-05-17 16:06:37,570 [   INFO] execute -  ---- ROUND ----
2018-05-17 16:06:37,570 [   INFO] execute - Executing block Conllu
2018-05-17 16:06:38,531 [   INFO] execute - Executing block TextModeTrees
# sent_id = mst-0003[m
# text = Sanal parçacıklarsa bunların hiçbirini yapamazlar.[m
─┮[m
 │   ╭─╼ [33mSanal[0m [31mADJ[0m [34mamod[0m[m
 │ ╭─┾ [33mparçacıklar[0m [31mNOUN[0m [34mcsubj[0m[m
 │ │ ╰─╼ [33msa[0m [31mAUX[0m [34mcop[0m[m
 │ │ ╭─╼ [33mbunların[0m [31mPRON[0m [34mnmod:poss[0m[m
 │ ┢─┶ [33mhiçbirini[0m [31mPRON[0m [34mobj[0m[m
 ╰─┾ [33myapamazlar[0m [31mVERB[0m [34mroot[0m[m
   ╰─╼ [33m.[0m [31mPUNCT[0m [34mpunct[0m[m
[m
# sent_id = mst-0004[m
# text = Ona her şeyimi verdim.[m
─┮[m
 │ ╭─╼ [33mOna[0m [31mPRON[0m [34mobl[0m[m
 │ ┢─┮ [33mher[0m [31mDET[0m [34mobj[0m[m
 │ │ ╰─╼ [33mşeyimi[0m [31mNOUN[0m [34mcompound[0m[m
 ╰─┾

In [2]:
# Now we try a query.
!udapy util.Eval node='if node.deprel == "discourse": print(node.form, node.upos)' < UD_Turkish-IMST/tr_imst-ud-train.conllu > disc.txt
!cat disc.txt | sort | uniq -c | sort -rn 

2018-05-17 16:14:58,145 [   INFO] execute - No reader specified, using read.Conllu
2018-05-17 16:14:58,145 [   INFO] execute -  ---- ROUND ----
2018-05-17 16:14:58,145 [   INFO] execute - Executing block Conllu
2018-05-17 16:14:59,174 [   INFO] execute - Executing block Eval
  29 ise CCONJ
  10 Hadi INTJ
   5 ya INTJ
   4 Aman INTJ
   3 tabi INTJ
   3 of INTJ
   3 hadi INTJ
   3 a INTJ
   3 Yahu INTJ
   2 haydi INTJ
   2 ha INTJ
   2 Eee INTJ
   1 yo INTJ
   1 yazık INTJ
   1 sakın INTJ
   1 hah INTJ
   1 be INTJ
   1 abi NOUN
   1 Yoo INTJ
   1 Yo INTJ
   1 Ulan INTJ
   1 Oh INTJ
   1 Hey INTJ
   1 Haydi INTJ
   1 Eyvah INTJ
   1 Ee INTJ
   1 E INTJ
   1 Aaa INTJ
   1 A INTJ


In [5]:
# Word counts
!udapy util.Wc < UD_Turkish-IMST/tr_imst-ud-train.conllu

2018-05-17 16:27:03,405 [   INFO] execute - No reader specified, using read.Conllu
2018-05-17 16:27:03,405 [   INFO] execute -  ---- ROUND ----
2018-05-17 16:27:03,405 [   INFO] execute - Executing block Conllu
2018-05-17 16:27:04,375 [   INFO] execute - Executing block Wc
    3685 trees
   38082 words
    1087 multi-word tokens
   36970 tokens


In [11]:
# More advanced statistics.
#!udapy util.See node='node.is_nonprojective()' < UD_Turkish-IMST/tr_imst-ud-train.conllu

!udapy util.See node='node.multiword_token != None' < UD_Turkish-IMST/tr_imst-ud-train.conllu

2018-05-17 16:43:08,114 [   INFO] execute - No reader specified, using read.Conllu
2018-05-17 16:43:08,114 [   INFO] execute -  ---- ROUND ----
2018-05-17 16:43:08,114 [   INFO] execute - Executing block Conllu
2018-05-17 16:43:09,087 [   INFO] execute - Executing block See
node.multiword_token != None
matches 2199 out of 38082 nodes (5.8%) in 884 out of 3685 trees (24.0%)
=== dir (3 values) ===
          right  1346  61% delta=+27%
           left   569  25% delta=-30%
           root   284  12% delta= +3%
=== edge (44 values) ===
              1  1071  48% delta=+28%
             -2   299  13% delta= +3%
              0   284  12% delta= +3%
              2   134   6% delta= +0%
             -3   129   5% delta= +0%
=== depth (11 values) ===
              2   589  26% delta= -5%
              3   502  22% delta= -4%
              4   379  17% delta= +1%
              1   284  12% delta= +3%
              5   229  10% delta= +2%
=== children (11 values) ===
              0  1150  52% 

### Custom reporting using udapi
More complex reporting requires either that we write a class that inherits from Block (udapi.core.block) or that we open the file we want to report on using Document (udapi.core.document) and access directly the node, mwt, tree structues needed.  This later approach is a bit of a hack, or more generously, a rapid prototyping strategy.  

Here is an example reporting on the Turkish *.conllu files using udapi tools.  

We were looking at the frequecy of use of multiword tokens (MWTs) in Turkish for comparison with how we are coding our own agglinative languages of the Amazon.

In [59]:
# Use conllu more directly with udapi structure.
from udapi.core.document import Document
from collections import Counter

filename = 'UD_Turkish-IMST/tr_imst-ud-train.conllu'
outfile = open('Turkish-mwt.txt','wt')

count_upos = Counter()
count_upos1 = Counter()
count_deprel = Counter()
count_deprel1 = Counter()
count_lemma1 = Counter()
count_lemma = Counter()
count_mwt = Counter()
count_tokens = 0

document = Document()
document.load_conllu(filename=filename)
document.bundles
print('number of sentences:', len(document.bundles))
# Process through all trees. 
for sentence in document.bundles:
    # Each sentence has unique presentation in monolingual dependency graph.
    # So no need to specify zone or iterate on zone.
    tree = sentence.get_tree()
    count_tokens += len(tree.get_sentence())
    # Get list of all multiword tokens.
    for mwt in tree.multiword_tokens:
        #print(mwt.form)
        token = [mwt.form]
        count_mwt[mwt.form] += 1

        for i, word in enumerate(mwt.words):
            token += ['|', word.form, word.lemma, word.upos, word.deprel]
            if i == 0:
                count_lemma1[word.lemma] += 1
                count_deprel1[word.deprel] += 1
                count_upos1[word.upos] += 1
            else:
                count_lemma[word.lemma] += 1
                count_deprel[word.deprel] += 1
                count_upos[word.upos] += 1
            
        print(token, file=outfile)
        
outfile.close()
print('\n# tokens:', count_tokens)
count_MWTs = sum(count_mwt.values())
print('# MWTs:', count_MWTs)
print('proption MWTs:', count_MWTs/count_tokens)
print('MWT:', count_mwt.most_common(50))
print('\nFirst morpheme:')
print('lemma1:', count_lemma1.most_common(50))
print('\nupos1:', count_upos1)
print('\ndeprel1:', count_deprel1)
print('\nSubsequent morphemes:')
print('lemma:', count_lemma.most_common(50))
print('\nupos:', count_upos)
print('\ndeprel:', count_deprel)


number of sentences: 3685

# tokens: 226597
# MWTs: 1087
proption MWTs: 0.004797062626601411
MWT: [('yoktu', 16), ('önemli', 15), ('vardı', 9), ('arasındaki', 9), ('önceki', 9), ('hafifçe', 8), ('vardır', 8), ('vadeli', 8), ('yoksa', 7), ('gibiydi', 7), ('yoktur', 7), ('sessiz', 6), ('dolarlık', 6), ('altındaki', 5), ('üzerindeki', 5), ('?edir', 5), ('saatlik', 4), ('içindeki', 4), ('yanındaki', 4), ('doluydu', 4), ('Katana', 4), ('benim', 4), ('rahatça', 4), ('zamanki', 4), ('demektir', 4), ('adlı', 4), ('olanlar', 4), ('sağlıklı', 3), ('?eymiş', 3), ('iyice', 3), ('tarihli', 3), ('ürkütücü', 3), ('sebzedir', 3), ('zordur', 3), ('aptalca', 3), ('önümüzdeki', 3), ('Tehlikeli', 3), ('imkansızdı', 3), ('elbiseli', 3), ('saçlı', 3), ('elindeki', 3), ('buydu', 3), ('yıldır', 3), ('günkü', 3), ('yıllardır', 3), ('buradaki', 3), ('gelene', 3), ('Benim', 3), ('Yoksa', 3), ('yavaşça', 3)]

First morpheme:
lemma1: [('yok', 34), ('var', 21), ('önem', 19), ('ol', 13), ('gibi', 12), ('ben', 11), (