## Natural Language Processing - Summer Term 2024
### Hochschule Karlsruhe
### Lecturer: Prof. Dr. Jannik Strötgen
### Tutor: Paul Löhr

Gruppe:
- Daniel Schneider
- Leonie Bäder
- Maximilian Hoffmann

# Exercise 03

You will learn about:
    
- The Brown Corpus
- Part of Speech (POS) tagging
- Unigram and Bigram tagger

---

## Task 1 - The Brown Corpus (8 P):

---

### Part 1

In the following, we will use the _Brown Corpus_. In one or two sentences, describe what the _Brown Corpus_ is and how it can be used for POS tagging.

### Answer:
The Brown Corpus is a large collection of English texts/ tagset, totaling over one million words and sampled from a wide range of sources and genres. It serves as a valuable resource for training and evaluating POS tagging algorithms in NLP. It provides labeled examples of word sequences with their corresponding POS tags, which can be used to develop and improve automatic tagging systems.

### Part 2

We start by analyzing which tags occur in the brown corpus. For this, you should extract the `tagged_words` first. Then

1. List the first 20 entries and
2. then list the ten most common tags in the category `news`.

In the lecture, we use the Brown Corpus POS tags (default, i.e., `tagset=None`).

In [15]:
import nltk
nltk.download('brown')
from nltk.corpus import brown

### Answer:

In [4]:
# Get tagged words from the Brown Corpus
tagged_words = brown.tagged_words(tagset=None)

# List the first 20 entries
print("First 20 entries of tagged words:")
print(tagged_words[:20])
print('\n')
# Get tagged words specifically from the 'news' category
tagged_words_news = brown.tagged_words(categories='news', tagset=None)

# Calculate frequency distribution of tags in the 'news' category
tag_freq_dist = nltk.FreqDist(tag for (word, tag) in tagged_words_news)
print("10 most common tags in the category 'news':")
print(tag_freq_dist.most_common(10))

First 20 entries of tagged words:
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS')]


10 most common tags in the category 'news':
[('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133), ('NNS', 5066), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524)]


### Part 3

In the previous part, you should get ten different POS tags. For each tag, what does it stand for?

### Answer:

```
NN  --> singular or mass noun (e.g. car)
IN  --> preposition (e.g. at, on)
AT  --> article (e.g. a, the)
NP  --> proper noun or part of name phrase (e.g. London)
,   --> comma 
NNS --> plural noun (e.g. cars)
.   --> sentence closer (e.g. ., !, ?)
JJ  --> adjective (e.g. funny)
CC  --> coordinating conjunction (e.g. and, or)
VBD --> verb, past tense (e.g. took)
```

---

## Task 2 - POS Tagging (12 P)

### Part 1

Use a Unigram tagger, trained on the Brown corpus, to tag the example sentence from the Penn treebank (see also https://www.nltk.org/_modules/nltk/corpus/reader/tagged.html)

For which words does it completely fail?

In [14]:
import nltk
nltk.download('treebank')
from nltk.corpus import brown, treebank
from nltk.tag import UnigramTagger

In [6]:
treebank_test = list(treebank.words()[0:20])
print(treebank_test)

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.', 'Mr.', 'Vinken']


### Answer:

In [7]:
# Train a Unigram tagger on the Brown Corpus
unigram_tagger = UnigramTagger(brown.tagged_sents())

# Tag the example sentence using the Unigram tagger
for word, tag in unigram_tagger.tag(treebank_test):
    print(word, '->', tag)

Pierre -> NP
Vinken -> None
, -> ,
61 -> CD
years -> NNS
old -> JJ
, -> ,
will -> MD
join -> VB
the -> AT
board -> NN
as -> CS
a -> AT
nonexecutive -> None
director -> NN
Nov. -> NP
29 -> CD
. -> .
Mr. -> NP
Vinken -> None


### Answer:

The Unigram Tagger completely fails for those words that it has not seen during training (i.e. words that are not in the Brown corpus). In our case, those are the proper noun 'Vinken' and the adjective 'nonexecutive', which both receive the tag 'None'.

### Part 2

Compare the tags with the reference tags from the Penn treebank.

### Answer

In [8]:
treebank.tagged_words()[:20]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.'),
 ('Mr.', 'NNP'),
 ('Vinken', 'NNP')]

```
Different tags for the same POS: NNP == NP, DT == AT, IN == CS 

Word         | Treebank | Unigram Tagger | Match
-------------------------------------------------
Pierre       | NNP      | NP             | 1
Vinken       | NNP      | None           | 0
,            | ,        | ,              | 1
61           | CD       | CD             | 1
years        | NNS      | NNS            | 1
old          | JJ       | JJ             | 1
,            | ,        | ,              | 1
will         | MD       | MD             | 1
join         | VB       | VB             | 1
the          | DT       | AT             | 1
board        | NN       | NN             | 1
as           | IN       | CS             | 1
a            | DT       | AT             | 1
nonexecutive | JJ       | None           | 0
director     | NN       | NN             | 1
Nov.         | NNP      | NP             | 1
29           | CD       | CD             | 1
.            | .        | .              | 1
Mr.          | NNP      | NP             | 1
Vinken       | NNP      | None           | 0
-------------------------------------------------
                                         | 17/20
```

### Part 3

Now train 
 1. a Unigram tagger,
 2. a Bigram tagger,
 3. and a Brill tagger (using rule brill24)
 
with a subset of the Brown Corpus. This might take 1-2 minutes.

Then, validate and compare their performance on a different subset of the Brown corpus.

In [9]:
from nltk.tag import UnigramTagger, BigramTagger, DefaultTagger
from nltk.tag.brill_trainer import BrillTaggerTrainer
from nltk.tag.brill import brill24

In [10]:
n_cutoff = 20000
brown_sents_train = brown.tagged_sents()[:n_cutoff] # training corpus
brown_sents_test = brown.tagged_sents()[n_cutoff:] # reference corpus

### Answer:

In [83]:
# Train Unigram tagger
unigram_tagger = UnigramTagger(brown_sents_train)

# Train Bigram tagger
bigram_tagger = BigramTagger(brown_sents_train, backoff=unigram_tagger)

# Train Brill tagger with DefaultTagger
brill_tagger_default = BrillTaggerTrainer(DefaultTagger('NN'), templates=brill24()).train(brown_sents_train)

# Train Brill tagger with UnigramTagger
brill_tagger_unigram = BrillTaggerTrainer(unigram_tagger, templates=brill24()).train(brown_sents_train)

# Train Brill tagger with BrillTagger
brill_tagger_bigram = BrillTaggerTrainer(bigram_tagger, templates=brill24()).train(brown_sents_train)

In [84]:
# Evaluate Unigram tagger
acc_unigram_tagger = unigram_tagger.accuracy(brown_sents_test)
print("Accuracy Unigram Tagger:", acc_unigram_tagger)

# Evaluate Bigram tagger
acc_bigram_tagger = bigram_tagger.accuracy(brown_sents_test)
print("Accuracy Bigram Tagger:", acc_bigram_tagger)

# Evaluate Brill tagger with DefaultTagger
acc_brill_tagger_default = brill_tagger_default.accuracy(brown_sents_test)
print("Accuracy Brill Tagger Default:", acc_brill_tagger_default)

# Evaluate Brill tagger with UnigramTagger
acc_brill_tagger_unigram = brill_tagger_unigram.accuracy(brown_sents_test)
print("Accuracy Brill Tagger Unigram:", acc_brill_tagger_unigram)

# Evaluate Brill tagger with BigramTagger
acc_brill_tagger_bigram = brill_tagger_bigram.accuracy(brown_sents_test)
print("Accuracy Brill Tagger Bigram:", acc_brill_tagger_bigram)

Accuracy Unigram Tagger: 0.8615109858152631
Accuracy Bigram Tagger: 0.2070602081522892
Accuracy Brill Tagger Default: 0.6558425651587828
Accuracy Brill Tagger Unigram: 0.8909027459406198
Accuracy Brill Tagger Bigram: 0.727414929213168


In [13]:
# Train Bigram tagger with Unigram tagger as backoff
bigram_tagger_backoff = BigramTagger(brown_sents_train, backoff=unigram_tagger)

acc_bigram_tagger_backoff = bigram_tagger_backoff.accuracy(brown_sents_test)
print("Accuracy Bigram Tagger Backoff:", acc_bigram_tagger_backoff)

Accuracy Bigram Tagger Backoff: 0.9403488364136496


### Part 4

Discuss the scores of your taggers. Which one performs better, and why?

### Answer:

The performance of the Brill tagger with the highest accuracy among taggers can be attributed to its methodology. It combines statistical probability with rule-based correction. First, the Brill Tagger uses a probabilistic approach, starting from the underlying Unigram Tagger, to create basic tag assignments based on word frequencies. It then refines these using contextual transformation rules. These are designed to capture and correct common tagging errors or ambiguities observed in the training data. This iterative process allows the Brill Tagger to adjust its predictions and ultimately improve its accuracy by utilizing both statistical patterns and linguistic rules.

On the other hand, the better performance of the Unigram tagger compared to the Bigram tagger could be due to several factors. First, in certain datasets, single word frequencies (unigrams) may provide more reliable indicators of correct tagging than contextual word pairs (bigrams), especially for sparse data or ambiguous tagging. Secondly, due to its simplicity, the unigram tagger is less prone to overfitting than the bigram tagger, which takes into account the immediate context but may have problems with non-local dependencies or limited training data.

### Part 5

Discuss ideas for improving the implementations and the quality of the taggers. You are not required to implement the improvement ideas.

### Answer:

- Improve by using a larger and more diverse (POS-tagged) training corpus (at least 250.000 words)
- Explore N-gram models with higher order (e.g. Trigram ...)
- Optimize hyperparameter for each tagger model
- Choose more suitable templates

---

## Task 3 - Unigram and Bigram Taggers (pen and paper) (10 P):

**Training data:**

His [PRP] raise [NN] was [VB] five [CD] dollars [NN] . [SYM]
We [PRP] usually [RB] get [VB] a [DT] raise [NN] at [IN] the [DT] start [NN] of [IN] the [DT] year [NN] . [SYM]
A [DT] major [JJ] success [NN] helped [VB] to [TO] raise [VB] our [PRP] spirits [NN] . [SYM]



**Test sentence:**

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


### Part 1: Unigram Tagger

Given the training data, determine the most likely tag for the word "raise" in the test sentence, using Unigram tagging method:


### Answer:

**Naive solution**
1. Create a frequency distribution for the different tags in the training data given the word 'raise'
2. $p(tag=NN|word=raise) = \frac{2}{3}, p(tag=VB|word=raise) = \frac{1}{3}$
3. Most likely tag for the word 'raise' is the tag with the greatest frequency in the training data --> tag = NN --> wrong for the test sentence

**Consider frequence (Bayes rule), so apply the adapted formula:**
$$ P(A|B) = \frac{P(B|A)P(A)}{P(B)} $$

1. $P(B)$ is constant and can be ignored
2. Create frequency distributions to apply Bayes rule
3. $p(tag=NN|word=raise) = p(word=raise|tag=NN) * p(tag=NN) = \frac{2}{7} * \frac{7}{27} \approx 0.074$
$p(tag=VB|word=raise) = p(word=raise|tag=VB) * p(tag=VB) = \frac{1}{4} * \frac{4}{27} \approx 0.037$
4. Most likely tag for the word 'raise' is the tag with the greatest frequency in the training data according to Bayes rule --> tag = NN --> wrong for the test sentence

### Part 2 - Bigram Tagger:

Given the training data (in Task 3), determine the most likely tag for the word "raise" in the test sentence, using Bigram tagging method:

It [PRP] looks [VB] like [CC] a [DT] fine [JJ] place [NN] to [TO] raise [NN or VB?] children [NN] . [SYM]


1. Create frequency distributions based on the preceding tags of the word 'raise'
2. $p(tag=NN|word=raise) = p(tag=NN|prev\_tag=TO) * p(word=raise|tag=NN) = \frac{0}{1} * \frac{2}{7} = 0$
$p(tag=VB|word=raise) = p(tab=VB|prev\_tag=TO) * p(word=raise|tag=VB) = \frac{1}{1} * \frac{1}{4} = 0.25$
3. Most likely tag for the word 'raise' is the tag with the greatest frequency in the training data according to Bigram tagging method --> tag = VB --> correct for the test sentence

---

#### Submitting your results:

To submit your results, please:

- save this file, i.e., `ex??_assignment.ipynb`.
- if you reference any external files (e.g., images), please create a zip or rar archieve and put the notebook files and all referenced files in there.

**Remarks:**
    
- Do not copy any code from the Internet. In case you want to use publicly available code, please, add the reference to the respective code snippet.
- Check your code compiles and executes, even after you have restarted the Kernel.
- Submit your written solutions and the coding exercises within the provided spaces and not otherwise.
- Write the names of your partner and your name in the top section.