## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ (Links to an external site.) and download Version 2.7 treebanks
2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or with the gf-ud tool.)
5. Convert the following four trees from CoNLL format to graphical trees by hand, on paper.
 - a short English tree (5-10 words, of your choice) and its translation.
 - a long English tree (>25 words) and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper. Use the same trees as in the previous question. What can you say about the syntactic differences between the languages?

https://universaldependencies.org/format.html
* ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
* FORM: Word form or punctuation symbol.
* LEMMA: Lemma or stem of word form.
* UPOS: Universal part-of-speech tag.
* XPOS: Language-specific part-of-speech tag; underscore if not available.
* FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
* HEAD: Head of the current word, which is either a value of ID or zero (0).
* DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
* DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
* MISC:


In [32]:
import pandas as pd
def process_conllu(file):
    '''Reads a PUD file and returns a dataframe of tokens'''
    with open(file, 'r', encoding="utf8") as f:
        lines = [l.split('\t') for l in f]
        lines = [l for l in lines if len(l)==10 ]

        return pd.DataFrame(lines, columns= ['id','form','lemma','upos','xpos','feats','head','deprel','deps','misc'])


In [44]:
Chinese_PUD = 'UD_Chinese-PUD/zh_pud-ud-test.conllu'
English_PUD = 'UD_English-PUD/en_pud-ud-test.conllu'
df_chinese = process_conllu(Chinese_PUD)
df_english = process_conllu(English_PUD)
df_chinese

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc
0,1,"""",_,PUNCT,``,_,18,punct,_,"SpaceAfter=No|Translit=""\n"
1,2,雖然,_,SCONJ,IN,_,8,mark,_,SpaceAfter=No|Translit=suīrán\n
2,3,美國,_,PROPN,NNP,_,7,nmod,_,SpaceAfter=No|Translit=měiguó\n
3,4,的,_,PART,DEC,Case=Gen,3,case,_,SpaceAfter=No|Translit=de\n
4,5,許多,_,NUM,CD,NumType=Card,7,nummod,_,SpaceAfter=No|Translit=xǔduō\n
...,...,...,...,...,...,...,...,...,...,...
21410,31,和平,_,ADJ,JJ,_,34,amod,_,SpaceAfter=No|Translit=hépíng\n
21411,32,的,_,PART,DEC,_,31,mark:relcl,_,SpaceAfter=No|Translit=de\n
21412,33,友誼,_,NOUN,NN,_,34,compound,_,SpaceAfter=No|Translit=youyì\n
21413,34,關係,_,NOUN,NN,_,30,obj,_,SpaceAfter=No|Translit=guān係\n


In [40]:
df_chinese.upos.value_counts()

NOUN     5410
VERB     3467
PUNCT    2902
PART     1881
PROPN    1361
ADP      1288
ADV      1283
NUM       873
PRON      710
ADJ       650
AUX       618
DET       355
X         306
CCONJ     283
SCONJ      28
Name: upos, dtype: int64

In [43]:
df_english.upos.value_counts()

NOUN     4040
ADP      2493
PUNCT    2451
VERB     2156
DET      2086
PROPN    1727
ADJ      1540
PRON     1021
AUX      1014
ADV       849
CCONJ     576
NUM       455
PART      426
SCONJ     290
SYM        42
X          16
INTJ        1
Name: upos, dtype: int64