<a href="https://colab.research.google.com/github/harisont/Beamer-mhthm/blob/master/Chapter_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 1: explore the parallel UD treebank (PUD)

## Question 4 

First, we read the relevant data:

In [0]:
from pandas import read_csv # cause it's easier to abuse read_csv

en_table = read_csv('en_pud-ud-test.conllu', sep="\t", comment="#", usecols=[1,3,7], names=["token", "POS tag", "Dep. label"])
es_table = read_csv('es_pud-ud-test.conllu', sep="\t", comment="#", usecols=[1,3,7], names=["token", "POS tag", "Dep. label"])

In [10]:
en_table

Unnamed: 0,token,POS tag,Dep. label
0,“,PUNCT,punct
1,While,SCONJ,mark
2,much,ADJ,nsubj
3,of,ADP,case
4,the,DET,det
...,...,...,...
21178,a,DET,det
21179,friend,NOUN,xcomp
21180,of,ADP,case
21181,peace,NOUN,nmod


In [11]:
es_table

Unnamed: 0,token,POS tag,Dep. label
0,Aunque,ADP,mark
1,no,ADV,advmod
2,haya,VERB,advcl
3,precedentes,NOUN,obj
4,para,ADP,case
...,...,...,...
23300,amigo,NOUN,acl:relcl
23301,de,ADP,case
23302,la,DET,det
23303,paz,NOUN,nmod


We then get the top 10 tags and dependency labels and their number of occurrences.

In [0]:
from nltk import FreqDist

def top10(col):
  return FreqDist(col).most_common(10)

top10_en_tags = top10(en_table["POS tag"])
top10_es_tags = top10(es_table["POS tag"])
top10_en_deps = top10(en_table["Dep. label"])
top10_es_deps = top10(es_table["Dep. label"])

In [0]:
def print_sidebyside(top10en, top10es):
  fmt = '{:<5}{:<20}{}'
  print(fmt.format('', 'English', 'Spanish'))
  for i, (p1, p2) in enumerate(zip(top10en, top10es)):
      print(fmt.format(i + 1, str(p1), str(p2)))

As shown below, in both treebanks `NOUN` and `ADP` are respectively the first and second most frequent POS, but the number of adpositions in Spanish is way bigger than it is with respects to English. This is not surprising, since, for instance, there are many verbs that are directly followed by the objet in English but require a preposition in Spanish (e.g. _dream_ vs _soñar con_).

Unlike in Spanish, the third most frequent tag in the English is `PUNCT`, indicating that English sentences and clauses are generally shorter than their Spanish counterparts.

Another way in which the two languages differ significantly is the amount of determiners. This may be due to the fact that in English it is more common to omit articles, for instance in sentences like "Children should go to the park every day" (where we say something about children in general), which can be translated to Spanish as "_Los_ niños tendrìan que ir al parque todos los días".

At a higher level, however, the fact that the 10 most common POS are the same in both treebanks highlights the fact that, even though morphologically very different, English and Spanish do have some similarities.

In [14]:
print_sidebyside(top10_en_tags, top10_es_tags)

     English             Spanish
1    ('NOUN', 4040)      ('NOUN', 4721)
2    ('ADP', 2493)       ('ADP', 4167)
3    ('PUNCT', 2301)     ('DET', 3260)
4    ('VERB', 2156)      ('PUNCT', 2216)
5    ('DET', 2086)       ('VERB', 2115)
6    ('PROPN', 1727)     ('ADJ', 1434)
7    ('ADJ', 1540)       ('PROPN', 1220)
8    ('PRON', 1021)      ('PRON', 1035)
9    ('AUX', 1014)       ('ADV', 895)
10   ('ADV', 849)        ('AUX', 735)


When it comes to dependency labels, again we see that the top 10 for English, disregarding the order, is the same as the top 10 for Spanish, confirming the syntactic similarities between the two languages.

In many ways - for instance with respect to `det` and `punct` - these statistics "match" those regarding POS tags.

Apart from this, the two languages seem to differ significantly in terms of number of `nsubj`, `nmod` and `amod` relations. The fact that English has more `nsubj` relations seems to suggest that seems to suggest that English sentences are divided into more clauses, each with its own syntactic subject. With respect to modifiers, it seems that some Spanish nominal modifiers are expressed as adjectival modifiers in English and/or vice versa.

In [15]:
print_sidebyside(top10_en_deps, top10_es_deps)

     English             Spanish
1    ('case', 2499)      ('case', 3648)
2    ('punct', 2301)     ('det', 3425)
3    ('det', 2047)       ('punct', 2216)
4    ('nsubj', 1393)     ('nmod', 1766)
5    ('amod', 1336)      ('obl', 1521)
6    ('obl', 1237)       ('amod', 1276)
7    ('nmod', 1076)      ('nsubj', 1168)
8    ('root', 1000)      ('root', 996)
9    ('obj', 876)        ('advmod', 829)
10   ('advmod', 852)     ('obj', 773)


## Question 5

![Question 5, part a](https://raw.githubusercontent.com/harisont/comp-syntax-2020/master/lab1/q5a.JPG)

![Question 5, part b](https://raw.githubusercontent.com/harisont/comp-syntax-2020/master/lab1/q5b.JPG)

## Question 6

![Question 6](https://raw.githubusercontent.com/harisont/comp-syntax-2020/master/lab1/q6.JPG)

In terms of word order, this sentence shows that, while in English adjectives generally precede the nouns they refer to, in Spanish it is the opposite.

Furthermore, the sentence contains an example of a phenomenon described in the above: in English, "games" has no determiner, while in Spanish it does.

The other things that appear to be different are mostly due to the fact that the sentence is not translated literally.