## Chapter 1: explore the parallel UD treebank (PUD)
1. Go to https://universaldependencies.org/ (Links to an external site.) and download Version 2.7 treebanks
2. Look up the Parallel UD treebanks for those 19 languages that have it. They are named e.g. UD_English-PUD/
3. Select a language to compare with English.
4. Make statistics about the frequencies of POS tags and dependency labels in your language compared with English: find the top-20 tags/labels and their number of occurrences. What does this tell you about the language? (This can be done with shell or Python programming or with the gf-ud tool.)
5. Convert the following four trees from CoNLL format to graphical trees by hand, on paper.
 - a short English tree (5-10 words, of your choice) and its translation.
 - a long English tree (>25 words) and its translation.
6. Draw word alignments for some non-trivial example in the PUD treebank, on paper. Use the same trees as in the previous question. What can you say about the syntactic differences between the languages?

https://universaldependencies.org/format.html
* ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
* FORM: Word form or punctuation symbol.
* LEMMA: Lemma or stem of word form.
* UPOS: Universal part-of-speech tag.
* XPOS: Language-specific part-of-speech tag; underscore if not available.
* FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
* HEAD: Head of the current word, which is either a value of ID or zero (0).
* DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
* DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
* MISC:


In [154]:
import pandas as pd
def process_conllu(file):
    '''Reads a PUD file and returns a dataframe of tokens'''
    with open(file, 'r', encoding="utf8") as f:
        lines = [l.split('\t') for l in f]
        lines = [l for l in lines if len(l)==10 ]
        for l in lines:
            l[9] = l[9].rstrip('\n')

    tabs = ['id','form','lemma','upos','xpos','feats','head','deprel','deps','misc']
    return pd.DataFrame(lines, columns= tabs)


In [155]:
Chinese_PUD = 'UD_Chinese-PUD/zh_pud-ud-test.conllu'
English_PUD = 'UD_English-PUD/en_pud-ud-test.conllu'
df_chinese = process_conllu(Chinese_PUD)
df_english = process_conllu(English_PUD)
df_chinese

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc
0,1,"""",_,PUNCT,``,_,18,punct,_,"SpaceAfter=No|Translit="""
1,2,雖然,_,SCONJ,IN,_,8,mark,_,SpaceAfter=No|Translit=suīrán
2,3,美國,_,PROPN,NNP,_,7,nmod,_,SpaceAfter=No|Translit=měiguó
3,4,的,_,PART,DEC,Case=Gen,3,case,_,SpaceAfter=No|Translit=de
4,5,許多,_,NUM,CD,NumType=Card,7,nummod,_,SpaceAfter=No|Translit=xǔduō
...,...,...,...,...,...,...,...,...,...,...
21410,31,和平,_,ADJ,JJ,_,34,amod,_,SpaceAfter=No|Translit=hépíng
21411,32,的,_,PART,DEC,_,31,mark:relcl,_,SpaceAfter=No|Translit=de
21412,33,友誼,_,NOUN,NN,_,34,compound,_,SpaceAfter=No|Translit=youyì
21413,34,關係,_,NOUN,NN,_,30,obj,_,SpaceAfter=No|Translit=guān係


In [190]:
df_english[ df_english.feats!='_' ] # SELECT rows FROM df WHERE df.feat != '_'

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc
2,3,much,much,ADJ,JJ,Degree=Pos,9,nsubj,9:nsubj,_
4,5,the,the,DET,DT,Definite=Def|PronType=Art,7,det,7:det,_
5,6,digital,digital,ADJ,JJ,Degree=Pos,7,amod,7:amod,_
6,7,transition,transition,NOUN,NN,Number=Sing,3,nmod,3:nmod:of,_
7,8,is,be,AUX,VBZ,Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbF...,9,cop,9:cop,_
...,...,...,...,...,...,...,...,...,...,...
21176,19,declared,declare,VERB,VBD,Mood=Ind|Tense=Past|VerbForm=Fin,12,acl:relcl,12:acl:relcl,_
21177,20,himself,himself,PRON,PRP,Case=Acc|Gender=Masc|Number=Sing|Person=3|Pron...,19,obj,19:obj|22:nsubj:xsubj,_
21178,21,a,a,DET,DT,Definite=Ind|PronType=Art,22,det,22:det,_
21179,22,friend,friend,NOUN,NN,Number=Sing,19,xcomp,19:xcomp,_


In [204]:
# SELECT * FROM df WHERE df.feat!='_' 
df_chinese[ df_chinese.feats!='_']

Unnamed: 0,id,form,lemma,upos,xpos,feats,head,deprel,deps,misc
3,4,的,_,PART,DEC,Case=Gen,3,case,_,SpaceAfter=No|Translit=de
4,5,許多,_,NUM,CD,NumType=Card,7,nummod,_,SpaceAfter=No|Translit=xǔduō
13,14,的,_,PART,DEC,Case=Gen,13,case,_,SpaceAfter=No|Translit=de
22,23,的,_,PART,DEC,Case=Gen,22,case,_,SpaceAfter=No|Translit=de
30,31,一,_,NUM,CD,NumType=Card,30,nummod,_,SpaceAfter=No|Translit=yī
...,...,...,...,...,...,...,...,...,...,...
21384,5,1,_,NUM,CD,NumType=Card,6,nummod,_,SpaceAfter=No|Translit=1
21386,7,1,_,NUM,CD,NumType=Card,8,nummod,_,SpaceAfter=No|Translit=1
21393,14,了,_,PART,AS,Aspect=Perf,13,aux,_,SpaceAfter=No|Translit=le
21394,15,一,_,NUM,CD,NumType=Card,16,nummod,_,SpaceAfter=No|Translit=yī


In [168]:
lex = df_chinese.upos.to_dict()
idxs=[nr for (nr, pos) in lex.items() if pos=='NOUN' and lex[nr+1]=='VERB']

In [170]:
for i in idxs:
    print(df_chinese[i:i+2])

  id form lemma  upos xpos feats head deprel deps                              misc
6  7   轉型     _  NOUN   NN     _    8  nsubj    _  SpaceAfter=No|Translit=zhuǎnxíng
7  8   都是     _  VERB   VC     _   18  advcl    _     SpaceAfter=No|Translit=dōushì
   id form lemma  upos xpos feats head     deprel deps                            misc
66  8    次     _  NOUN  NNB     _   16        clf    _       SpaceAfter=No|Translit=cì
67  9   進行     _  VERB   VV     _   16  acl:relcl    _  SpaceAfter=No|Translit=jìnxíng
    id form lemma  upos xpos feats head    deprel deps                                           misc
78  20  共和黨     _  NOUN   NN     _   21  compound    _  Proper=True|SpaceAfter=No|Translit=gònghédǎng
79  21  候選人     _  VERB   VV     _   18       obj    _              SpaceAfter=No|Translit=houxuǎnrén
     id form lemma  upos xpos feats head    deprel deps                                      misc
130  27   大學     _  NOUN   NN     _   28  compound    _  Proper=True|SpaceAfter=No|

     id form lemma  upos xpos feats head deprel deps                            misc
704  17   組件     _  NOUN   NN     _   14    obj    _   SpaceAfter=No|Translit=zǔjiàn
705  18   改用     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=gǎiyòng
    id form lemma  upos xpos feats head deprel deps                            misc
727  3   手機     _  NOUN   NN     _    4  nsubj    _   SpaceAfter=No|Translit=shǒujī
728  4   缺少     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=quēshǎo
    id form lemma  upos xpos feats head deprel deps                          misc
730  6    個     _  NOUN  NNB     _   18    clf    _     SpaceAfter=No|Translit=gè
731  7    像     _  VERB   VV     _   18   amod    _  SpaceAfter=No|Translit=xiàng
    id form lemma  upos xpos feats head    deprel deps                             misc
747  4   專家     _  NOUN   NN     _    5  compound    _  SpaceAfter=No|Translit=zhuānjiā
748  5   警告     _  VERB   VV     _    2     ccomp    _   SpaceAfter=No|

      id form lemma  upos xpos feats head     deprel deps                            misc
1611  10   大群     _  NOUN  NNB     _   17        clf    _    SpaceAfter=No|Translit=dàqún
1612  11   自稱     _  VERB   VV     _   17  acl:relcl    _  SpaceAfter=No|Translit=zìchēng
     id form lemma  upos xpos feats head deprel deps                            misc
1627  2   人員     _  NOUN   NN     _   10  nsubj    _  SpaceAfter=No|Translit=rényuán
1628  3   付出     _  VERB   VV     _   10    acl    _    SpaceAfter=No|Translit=fùchū
      id form lemma  upos xpos feats head     deprel deps                           misc
1635  10   研究     _  NOUN   NN     _    0       root    _  SpaceAfter=No|Translit=yánjiū
1636  11   發展     _  VERB   VV     _   16  acl:relcl    _  SpaceAfter=No|Translit=fāzhǎn
      id form lemma  upos xpos feats head    deprel deps                           misc
1638  13   激素     _  NOUN   NN     _   14  compound    _    SpaceAfter=No|Translit=jīsù
1639  14  避孕藥     _  VERB   VV  

      id form lemma  upos xpos feats head deprel deps                             misc
2542  18   客戶     _  NOUN   NN     _   19  nsubj    _      SpaceAfter=No|Translit=kèhù
2543  19   監控     _  VERB   VV     _   17  ccomp    _  SpaceAfter=No|Translit=jiānkòng
      id form lemma  upos xpos feats head    deprel deps                                 misc
2545  21   數據     _  NOUN   NN     _   22  compound    _         SpaceAfter=No|Translit=shùjù
2546  22  使用量     _  VERB   VV     _   19       obj    _  SpaceAfter=No|Translit=shǐyòngliàng
      id form lemma  upos xpos feats head deprel deps                           misc
2608  23   夜晚     _  NOUN   NN     _   24    obl    _   SpaceAfter=No|Translit=yèwǎn
2609  24   入睡     _  VERB   VV     _   20  ccomp    _  SpaceAfter=No|Translit=rùshuì
      id form lemma  upos xpos feats head deprel deps                         misc
2627  42   大麻     _  NOUN   NN     _   43  nsubj    _  SpaceAfter=No|Translit=dàmá
2628  43    是     _  VERB   VC     _

      id form lemma  upos xpos feats head     deprel deps                           misc
3541  18   風格     _  NOUN   NN     _   19      nsubj    _  SpaceAfter=No|Translit=fēnggé
3542  19    似     _  VERB   VV     _   22  acl:relcl    _     SpaceAfter=No|Translit=shì
     id form lemma  upos xpos feats head  deprel deps                            misc
3603  3    次     _  NOUN  NNB     _    4  advmod    _       SpaceAfter=No|Translit=cì
3604  4   展出     _  VERB   VV     _   17   csubj    _  SpaceAfter=No|Translit=zhǎnchū
     id form lemma  upos xpos feats head deprel deps                           misc
3630  1   選區     _  NOUN   NN     _    2  nsubj    _  SpaceAfter=No|Translit=xuǎnqū
3631  2    位     _  VERB   VV     _   10    dep    _     SpaceAfter=No|Translit=wèi
      id form lemma  upos xpos feats head deprel deps                            misc
3638   9   選民     _  NOUN   NN     _   10  nsubj    _  SpaceAfter=No|Translit=xuǎnmín
3639  10    有     _  VERB   VV     _    0   root   

      id form lemma  upos xpos feats head deprel deps                            misc
4987  33   博士     _  NOUN   NN     _   34  nsubj    _    SpaceAfter=No|Translit=bóshì
4988  34   補充     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=bǔchōng
      id form lemma  upos xpos feats head deprel deps                            misc
5094  15   概念     _  NOUN   NN     _   16  nsubj    _  SpaceAfter=No|Translit=gàiniàn
5095  16   叫做     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=jiàozuò
     id form lemma  upos xpos feats head deprel deps                         misc
5142  5    站     _  NOUN   NN     _    6  nsubj    _  SpaceAfter=No|Translit=zhàn
5143  6    是     _  VERB   VC     _    0   root    _   SpaceAfter=No|Translit=shì
     id form lemma  upos xpos feats head deprel deps                        misc
5160  8    南     _  NOUN   NN     _    9    obl    _  SpaceAfter=No|Translit=nán
5161  9    走     _  VERB   VV     _    6  xcomp    _  SpaceAfter=No|Translit

6394  20   看到     _  VERB   VV     _   15   ccomp    _  SpaceAfter=No|Translit=kàndào
     id form lemma  upos xpos feats head deprel deps                                      misc
6525  6   大廈     _  NOUN   NN     _    4  appos    _  Proper=True|SpaceAfter=No|Translit=dàshà
6526  7   聘請     _  VERB   VV     _    0   root    _              SpaceAfter=No|Translit=聘qǐng
      id form lemma  upos xpos feats head deprel deps                                       misc
6576  15   法庭     _  NOUN   NN     _   12    obj    _  Proper=True|SpaceAfter=No|Translit=fǎtíng
6577  16   受審     _  VERB   VV     _   20    dep    _            SpaceAfter=No|Translit=shòushěn
      id form lemma  upos xpos feats head deprel deps                              misc
6583  22    起     _  NOUN  NNB     _   23    clf    _         SpaceAfter=No|Translit=qǐ
6584  23  謀殺罪     _  VERB   VV     _   20    obj    _  SpaceAfter=No|Translit=móushāzuì
      id form lemma  upos xpos feats head    deprel deps                  

      id form lemma  upos xpos feats head deprel deps                            misc
7713  10   建設     _  NOUN   NN     _    7    obj    _  SpaceAfter=No|Translit=jiànshè
7714  11   看到     _  VERB   VV     _    5  xcomp    _   SpaceAfter=No|Translit=kàndào
     id form lemma  upos xpos feats head deprel deps                            misc
7745  4   首都     _  NOUN   NN     _    5  nsubj    _  SpaceAfter=No|Translit=shǒudōu
7746  5   大放     _  VERB   VV     _    3  ccomp    _   SpaceAfter=No|Translit=dàfàng
      id form lemma  upos xpos feats head deprel deps                                misc
7760  19  時間段     _  NOUN   NN     _   14  appos    _  SpaceAfter=No|Translit=shíjiānduàn
7761  20    去     _  VERB   VV     _   11  xcomp    _           SpaceAfter=No|Translit=qù
     id form lemma  upos xpos feats head deprel deps                          misc
7789  3   地區     _  NOUN   NN     _    1    obj    _   SpaceAfter=No|Translit=deqū
7790  4   不是     _  VERB   VC     _    0   root    

      id form lemma  upos xpos feats head deprel deps                           misc
8884  13   位置     _  NOUN   NN     _   14  nsubj    _  SpaceAfter=No|Translit=wèizhì
8885  14   近海     _  VERB   VV     _   16  xcomp    _  SpaceAfter=No|Translit=jìnhǎi
      id form lemma  upos xpos feats head deprel deps                           misc
8918  27   食品     _  NOUN   NN     _   12   conj    _  SpaceAfter=No|Translit=shípǐn
8919  28   提供     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=tígōng
      id form lemma  upos xpos feats head deprel deps                               misc
8978  26  海岸線     _  NOUN   NN     _   23    obj    _  SpaceAfter=No|Translit=hǎi'ànxiàn
8979  27   進行     _  VERB   VV     _    0   root    _     SpaceAfter=No|Translit=jìnxíng
     id form lemma  upos xpos feats head     deprel deps                        misc
9009  5    個     _  NOUN  NNB     _   11        clf    _   SpaceAfter=No|Translit=gè
9010  6    由     _  VERB   VV     _   11  acl:relcl  

      id form lemma  upos xpos feats head deprel deps                            misc
9776  25   原因     _  NOUN   NN     _   26  nsubj    _  SpaceAfter=No|Translit=yuányīn
9777  26    有     _  VERB   VV     _    0   root    _      SpaceAfter=No|Translit=yǒu
      id form lemma  upos xpos feats head    deprel deps                          misc
9804  12   自然     _  NOUN   NN     _   13  compound    _  SpaceAfter=No|Translit=zìrán
9805  13  棲息地     _  VERB   VV     _   11       obj    _  SpaceAfter=No|Translit=棲xide
      id form lemma  upos xpos feats head     deprel deps                         misc
9847  17    個     _  NOUN  NNB     _   20        clf    _    SpaceAfter=No|Translit=gè
9848  18   獨立     _  VERB   VV     _   20  acl:relcl    _  SpaceAfter=No|Translit=dúlì
     id form lemma  upos xpos feats head deprel deps                                        misc
9857  6    年     _  NOUN  NNB     _   11    obl    _                 SpaceAfter=No|Translit=nián
9858  7   揚帆     _  VERB  

       id form lemma  upos xpos feats head deprel deps                          misc
11033   9    年     _  NOUN  NNB     _   10    obl    _   SpaceAfter=No|Translit=nián
11034  10   舉辦     _  VERB   VV     _    6  xcomp    _  SpaceAfter=No|Translit=jǔbàn
      id form lemma  upos xpos feats head    deprel deps                             misc
11041  5   官府     _  NOUN   NN     _    6  compound    _    SpaceAfter=No|Translit=guānfǔ
11042  6  代理人     _  VERB   VV     _    3       obj    _  SpaceAfter=No|Translit=dàilǐrén
       id form lemma  upos xpos feats head deprel deps                           misc
11067  12   淤泥     _  NOUN   NN     _   13  nsubj    _     SpaceAfter=No|Translit=淤ní
11068  13   沉積     _  VERB   VV     _   15   amod    _  SpaceAfter=No|Translit=chénjī
      id form lemma  upos xpos feats head deprel deps                              misc
11130  5    次     _  NOUN  NNB     _    8    clf    _         SpaceAfter=No|Translit=cì
11131  6  挑釁性     _  VERB   VV     _    8

       id form lemma  upos xpos feats head     deprel deps                             misc
11764  27   附近     _  NOUN   NN     _   28     advmod    _     SpaceAfter=No|Translit=fùjìn
11765  28   生活     _  VERB   VV     _   30  acl:relcl    _  SpaceAfter=No|Translit=shēnghuó
       id form lemma  upos xpos feats head deprel deps                          misc
11767  30   社區     _  NOUN   NN     _   31    obl    _  SpaceAfter=No|Translit=shèqū
11768  31   持續     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=chíxù
      id form lemma  upos xpos feats head deprel deps                         misc
11774  3   區域     _  NOUN   NN     _    4  nsubj    _  SpaceAfter=No|Translit=qūyù
11775  4    是     _  VERB   VC     _    0   root    _   SpaceAfter=No|Translit=shì
      id form lemma  upos xpos feats head       deprel deps                           misc
11789  5   日出     _  NOUN   NN     _    6  obl:patient    _   SpaceAfter=No|Translit=rìchū
11790  6   標記     _  VERB   VV     _  

      id form lemma  upos xpos feats head deprel deps                                       misc
12667  8   軍隊     _  NOUN   NN     _    9  nsubj    _  Proper=True|SpaceAfter=No|Translit=jūnduì
12668  9   佔領     _  VERB   VV     _    2  xcomp    _               SpaceAfter=No|Translit=佔lǐng
      id form lemma  upos xpos feats head deprel deps                           misc
12696  5   軍隊     _  NOUN   NN     _    6  nsubj    _  SpaceAfter=No|Translit=jūnduì
12697  6   遭到     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=zāodào
      id form lemma  upos xpos feats head deprel deps                             misc
12710  8   人口     _  NOUN   NN     _    9  nsubj    _    SpaceAfter=No|Translit=rénkǒu
12711  9   降到     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=jiàngdào
       id form lemma  upos xpos feats head deprel deps                            misc
12727  11   歷史     _  NOUN   NN     _    9   conj    _    SpaceAfter=No|Translit=lìshǐ
12728  12   組成     

       id form lemma  upos xpos feats head    deprel deps                              misc
13591  16   坐標     _  NOUN   NN     _   17  compound    _    SpaceAfter=No|Translit=zuòbiāo
13592  17  分辨率     _  VERB   VV     _   12       obj    _  SpaceAfter=No|Translit=fēnbiànlǜ
       id form lemma  upos xpos feats head    deprel deps                                misc
13597  22   雷達     _  NOUN   NN     _   23  compound    _        SpaceAfter=No|Translit=léidá
13598  23  控制員     _  VERB   VV     _   36     nsubj    _  SpaceAfter=No|Translit=kòngzhìyuán
       id form lemma  upos xpos feats head    deprel deps                              misc
13602  27  高精度     _  NOUN   NN     _   28  compound    _  SpaceAfter=No|Translit=gāojīngdù
13603  28  顯示器     _  VERB   VV     _   26       obj    _  SpaceAfter=No|Translit=xiǎnshìqì
       id form lemma  upos xpos feats head deprel deps                          misc
13616  41   飛機     _  NOUN   NN     _   42  nsubj    _  SpaceAfter=No|Translit=fē

      id form lemma  upos xpos feats head deprel deps                          misc
14868  3    人     _  NOUN   NN     _   10  nsubj    _    SpaceAfter=No|Translit=rén
14869  4   參與     _  VERB   VV     _   10  advcl    _  SpaceAfter=No|Translit=cānyǔ
       id form lemma  upos xpos feats head  deprel deps                            misc
14874   9   其中     _  NOUN   NN     _   10  advmod    _  SpaceAfter=No|Translit=qízhōng
14875  10   包括     _  VERB   VV     _    0    root    _   SpaceAfter=No|Translit=bāokuò
      id form lemma  upos xpos feats head deprel deps                          misc
14913  8   數據     _  NOUN   NN     _    9  nsubj    _  SpaceAfter=No|Translit=shùjù
14914  9   證實     _  VERB   VV     _    0   root    _   SpaceAfter=No|Translit=證shí
       id form lemma  upos xpos feats head deprel deps                           misc
14917  12   地點     _  NOUN   NN     _   13  nsubj    _  SpaceAfter=No|Translit=dediǎn
14918  13   符合     _  VERB   VV     _    9  ccomp    _    Sp

       id form lemma  upos xpos feats head deprel deps                            misc
15761  10    日     _  NOUN  NNB     _   11    obl    _       SpaceAfter=No|Translit=rì
15762  11   出院     _  VERB   VV     _   13  advcl    _  SpaceAfter=No|Translit=chūyuàn
      id form lemma  upos xpos feats head deprel deps                                        misc
15782  4   上帝     _  NOUN   NN     _   11  nsubj    _  Proper=True|SpaceAfter=No|Translit=shàngdì
15783  5   拯救     _  VERB   VV     _   11  advcl    _                 SpaceAfter=No|Translit=拯jiù
       id form lemma  upos xpos feats head deprel deps                           misc
15831   9   必要     _  NOUN   NN     _    8    obj    _   SpaceAfter=No|Translit=bìyào
15832  10   建立     _  VERB   VV     _    0   root    _  SpaceAfter=No|Translit=jiànlì
       id form lemma  upos xpos feats head       deprel deps                            misc
15857  18    家     _  NOUN   NN     _   19  obl:patient    _      SpaceAfter=No|Translit=jiā
1

       id form lemma  upos xpos feats head     deprel deps                           misc
16757  13    個     _  NOUN  NNB     _   17        clf    _      SpaceAfter=No|Translit=gè
16758  14   嵌入     _  VERB   VV     _   17  acl:relcl    _  SpaceAfter=No|Translit=qiànrù
       id form lemma  upos xpos feats head deprel deps                               misc
16825  26    名     _  NOUN  NNB     _   27    clf    _        SpaceAfter=No|Translit=míng
16826  27  駕駛員     _  VERB   VV     _   28  nsubj    _  SpaceAfter=No|Translit=jiàshǐyuán
       id form lemma  upos xpos feats head deprel deps                          misc
16829  30   飛機     _  NOUN   NN     _   33  nsubj    _  SpaceAfter=No|Translit=fēijī
16830  31    在     _  VERB   VV     _   33  advcl    _    SpaceAfter=No|Translit=zài
       id form lemma  upos xpos feats head deprel deps                             misc
16831  32   海上     _  NOUN   NN     _   31    obj    _  SpaceAfter=No|Translit=hǎishàng
16832  33   迫降     _  VERB   

       id form lemma  upos xpos feats head deprel deps                            misc
17713  21   安保     _  NOUN   NN     _   22  nsubj    _   SpaceAfter=No|Translit='ānbǎo
17714  22   使用     _  VERB   VV     _   20  ccomp    _  SpaceAfter=No|Translit=shǐyòng
       id form lemma  upos xpos feats head deprel deps                             misc
17730  10    日     _  NOUN  NNB     _   11    obl    _        SpaceAfter=No|Translit=rì
17731  11   出生     _  VERB   VV     _   31    dep    _  SpaceAfter=No|Translit=chūshēng
       id form lemma  upos xpos feats head     deprel deps                        misc
17739  19    個     _  NOUN  NNB     _   24        clf    _   SpaceAfter=No|Translit=gè
17740  20    反     _  VERB   VV     _   24  acl:relcl    _  SpaceAfter=No|Translit=fǎn
       id form lemma  upos xpos feats head deprel deps                            misc
17750  30   家庭     _  NOUN   NN     _   31  nsubj    _  SpaceAfter=No|Translit=jiātíng
17751  31    受     _  VERB   VV     _   

       id form lemma  upos xpos feats head deprel deps                             misc
18793  19   員工     _  NOUN   NN     _   17   conj    _  SpaceAfter=No|Translit=yuángōng
18794  20   接管     _  VERB   VV     _    0   root    _   SpaceAfter=No|Translit=jiēguǎn
      id form lemma  upos xpos feats head     deprel deps                           misc
18803  8   戰末     _  NOUN   NN     _    9        obl    _  SpaceAfter=No|Translit=zhànmò
18804  9   沒有     _  VERB   VV     _   16  acl:relcl    _  SpaceAfter=No|Translit=méiyǒu
      id form lemma  upos xpos feats head deprel deps                                 misc
18879  6  市政廳     _  NOUN   NN     _   18  nsubj    _  SpaceAfter=No|Translit=shìzhèngtīng
18880  7    建     _  VERB   VV     _   18  advcl    _          SpaceAfter=No|Translit=jiàn
      id form lemma  upos xpos feats head     deprel deps                         misc
18931  4    個     _  NOUN  NNB     _   12        clf    _    SpaceAfter=No|Translit=gè
18932  5    供     _  V

       id form lemma  upos xpos feats head deprel deps                               misc
20042   9  紡織品     _  NOUN   NN     _    7   conj    _  SpaceAfter=No|Translit=fǎngzhīpǐn
20043  10  批發商     _  VERB   VV     _    2    obj    _   SpaceAfter=No|Translit=pīfāshāng
      id form lemma  upos xpos feats head deprel deps                          misc
20058  4    中     _  NOUN   NN     _    5    obl    _  SpaceAfter=No|Translit=zhōng
20059  5   逃離     _  VERB   VV     _   16  xcomp    _  SpaceAfter=No|Translit=táolí
       id form lemma  upos xpos feats head deprel deps                           misc
20117  20  藝術家     _  NOUN   NN     _   32  nsubj    _  SpaceAfter=No|Translit=yì術jiā
20118  21    給     _  VERB   VV     _   25  advcl    _     SpaceAfter=No|Translit=gěi
      id form lemma  upos xpos feats head deprel deps                           misc
20140  7   妻子     _  NOUN   NN     _    8    obl    _    SpaceAfter=No|Translit=qīzi
20141  8   回到     _  VERB   VV     _   13  advcl  

      id form lemma  upos xpos feats head deprel deps                             misc
20834  3   中心     _  NOUN   NN     _    4  nsubj    _  SpaceAfter=No|Translit=zhōngxīn
20835  4    屬     _  VERB   VV     _   19    dep    _       SpaceAfter=No|Translit=shǔ
       id form lemma  upos xpos feats head deprel deps                         misc
20849  18   地區     _  NOUN   NN     _   19  nsubj    _  SpaceAfter=No|Translit=deqū
20850  19    是     _  VERB   VC     _    0   root    _   SpaceAfter=No|Translit=shì
       id form lemma  upos xpos feats head deprel deps                            misc
20853  22   地殼     _  NOUN   NN     _   23  nsubj    _     SpaceAfter=No|Translit=deké
20854  23   潛沒     _  VERB   VV     _   24  xcomp    _  SpaceAfter=No|Translit=qiánméi
       id form lemma  upos xpos feats head deprel deps                           misc
20870  13   不前     _  NOUN   NN     _    8   conj    _  SpaceAfter=No|Translit=bùqián
20871  14   導致     _  VERB   VV     _    0   root    _

In [148]:
def print_freq(df, column, top_freq=20):
    '''Prints a column's value counts and percentages in descending order'''
    freq_dict = df[column].value_counts()[:top_freq].to_dict()
    total = sum([val for val in freq_dict.values()])
    for pos in freq_dict:
        freq_dict[pos] = ( freq_dict[pos], 100*freq_dict[pos]/total )
        print(f'{pos}\t{freq_dict[pos][0]}\t{round(freq_dict[pos][1],2)}%')

In [149]:
print('ENG UPOS:')
print_freq(df_english, 'upos')
print('\nCHI UPOS:')
print_freq(df_chinese, 'upos')

ENG UPOS:
NOUN	4040	19.07%
ADP	2493	11.77%
PUNCT	2451	11.57%
VERB	2156	10.18%
DET	2086	9.85%
PROPN	1727	8.15%
ADJ	1540	7.27%
PRON	1021	4.82%
AUX	1014	4.79%
ADV	849	4.01%
CCONJ	576	2.72%
NUM	455	2.15%
PART	426	2.01%
SCONJ	290	1.37%
SYM	42	0.2%
X	16	0.08%
INTJ	1	0.0%

CHI UPOS:
NOUN	5410	25.26%
VERB	3467	16.19%
PUNCT	2902	13.55%
PART	1881	8.78%
PROPN	1361	6.36%
ADP	1288	6.01%
ADV	1283	5.99%
NUM	873	4.08%
PRON	710	3.32%
ADJ	650	3.04%
AUX	618	2.89%
DET	355	1.66%
X	306	1.43%
CCONJ	283	1.32%
SCONJ	28	0.13%


In [147]:
ch = df_chinese.upos.value_counts().to_dict()
en = df_english.upos.value_counts().to_dict()
en

{'NOUN': 4040,
 'ADP': 2493,
 'PUNCT': 2451,
 'VERB': 2156,
 'DET': 2086,
 'PROPN': 1727,
 'ADJ': 1540,
 'PRON': 1021,
 'AUX': 1014,
 'ADV': 849,
 'CCONJ': 576,
 'NUM': 455,
 'PART': 426,
 'SCONJ': 290,
 'SYM': 42,
 'X': 16,
 'INTJ': 1}

In [151]:
print('ENG XPOS:')
print_freq(df_english, 'xpos')
print('\nCHI XPOS:')
print_freq(df_chinese, 'xpos')

ENG XPOS:
NN	3119	15.72%
IN	2716	13.69%
DT	2121	10.69%
NNP	1483	7.48%
JJ	1445	7.29%
NNS	1125	5.67%
,	1002	5.05%
.	1000	5.04%
VBD	875	4.41%
RB	773	3.9%
VBN	591	2.98%
CC	576	2.9%
VB	507	2.56%
PRP	490	2.47%
CD	460	2.32%
VBZ	439	2.21%
VBG	332	1.67%
TO	267	1.35%
VBP	259	1.31%
PRP$	255	1.29%

CHI XPOS:
NN	4667	22.75%
VV	3350	16.33%
NNP	1361	6.63%
DEC	1335	6.51%
RB	1283	6.25%
IN	1191	5.81%
,	1134	5.53%
.	1001	4.88%
CD	919	4.48%
NNB	799	3.89%
JJ	606	2.95%
PRP	543	2.65%
AS	398	1.94%
VC	364	1.77%
DT	355	1.73%
FW	298	1.45%
MD	286	1.39%
CC	285	1.39%
)	171	0.83%
(	170	0.83%
