## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ (Links to an external site.) and download Version 2.7 treebanks
2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or with the gf-ud tool.)
5. Convert the following four trees from CoNLL format to graphical trees by hand, on paper.
 - a short English tree (5-10 words, of your choice) and its translation.
 - a long English tree (>25 words) and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper. Use the same trees as in the previous question. What can you say about the syntactic differences between the languages?

In [57]:
import pandas as pd
from string import punctuation

def process_text(file):
    '''Reads a TXT file and formats'''
    with open(file, 'r', encoding="utf8") as f:
        lines = [l.split() for l in f] #each line is a list of tokens
        for l in lines:
            if l[-1][0] not in punctuation:
                l.append('.')
        
        sents = []
        for l in lines:
            rows = []  # a row reps a token; a list of rows reps a sentence
            for i in range(len(l)):
                t = l[i]
                if ':<' in t:
                    word,pos = t.split(':<')[0],t.split(':<')[1][:-1]
                    row = [i+1, word, '_', pos, '_', '_', 'head', 'label', '_', '_' ] # ['position', 'word', '_', 'postag', '_', '_', 'head', 'label', '_', '_']
                    rows.append(row)
                else:
                    row = [i+1, t, '_', 'pos', '_', '_', 'head', 'label', '_', '_' ]
                    rows.append(row)
            
            og_text = '# text =' + ''.join([' '+r[1] if r[1][0] not in punctuation else r[1] for r in rows])
            rows.insert(0, og_text) # unshift
            for r in rows[1:]:
                r[0]=str(r[0])
            
            sents.append(rows)

    return sents

process_text('comp-syntax-corpus-english.txt')

[['# text = Who are they?',
  ['1', 'Who', '_', 'PRON', '_', '_', 'head', 'label', '_', '_'],
  ['2', 'are', '_', 'AUX', '_', '_', 'head', 'label', '_', '_'],
  ['3', 'they', '_', 'PRON', '_', '_', 'head', 'label', '_', '_'],
  ['4', '?', '_', 'PUNCT', '_', '_', 'head', 'label', '_', '_']],
 ['# text = A small town with two minarets glides by.',
  ['1', 'A', '_', 'DET', '_', '_', 'head', 'label', '_', '_'],
  ['2', 'small', '_', 'ADJ', '_', '_', 'head', 'label', '_', '_'],
  ['3', 'town', '_', 'NOUN', '_', '_', 'head', 'label', '_', '_'],
  ['4', 'with', '_', 'ADP', '_', '_', 'head', 'label', '_', '_'],
  ['5', 'two', '_', 'NUM', '_', '_', 'head', 'label', '_', '_'],
  ['6', 'minarets', '_', 'NOUN', '_', '_', 'head', 'label', '_', '_'],
  ['7', 'glides', '_', 'VERB', '_', '_', 'head', 'label', '_', '_'],
  ['8', 'by', '_', 'ADV', '_', '_', 'head', 'label', '_', '_'],
  ['9', '.', '_', 'PUNCT', '_', '_', 'head', 'label', '_', '_']],
 ['# text = I was just a boy with muddy shoes.',
  ['1

In [59]:
lang = 'english'
sents = process_text(f'comp-syntax-corpus-{lang}.txt')
new_sents = []
for sen in sents:
    new_sent = [sen[0]]
    for r in sen[1:]:
        new_sent.append('\t'.join(r))
    new_sents.append(new_sent)
    
with open(f'{lang}-tab-separated.txt', 'w') as f:
    for sent in new_sents:
        for r in sent:
            f.write(r+'\n')
        f.write('\n\n')

In [68]:
eng_sents = [s[0].replace('text', 'text_en',1) for s in sents]
eng_sents

['# text_en = Who are they?',
 '# text_en = A small town with two minarets glides by.',
 '# text_en = I was just a boy with muddy shoes.',
 "# text_en = Shenzhen's traffic police have opted for unconventional penalties before.",
 '# text_en = The study of volcanoes is called volcanology, sometimes spelled vulcanology.',
 '# text_en = It was conducted just off the Mexican coast from April to June.',
 '# text_en =" Her voice literally went around the world," Leive said.',
 '# text_en = A witness told police that the victim had attacked the suspect in April.',
 "# text_en = It's most obvious when a celebrity's name is initially quite rare.",
 '# text_en = This has not stopped investors flocking to put their money in the funds.',
 '# text_en = This discordance between economic data and political rhetoric is familiar, or should be.',
 '# text_en = The feasibility study estimates that it would take passengers about four minutes to cross the Potomac River on the gondola.',
 '# text_en = he co

In [84]:
enumerate(translations)

<enumerate at 0x2441b7d9b80>

In [91]:
with open(f'chinese-tab-separated.txt', 'w', encoding='utf8') as f:
    for sent in text_tokens:
        f.write(sent[0]+'\n')
        f.write(sent[1]+'\n')
        for r in sent[2]:
            f.write(r+'\n')
        f.write('\n\n')