# Homework 2

The maximum score of this homework is 100+10 points. Grading is listed in this table:

| Grade | Score range |
| --- | --- |
| 5 | 85+ |
| 4 | 70-84 |
| 3 | 55-69 |
| 2 | 40-54 |
| 1 | 0-39 |

Most exercises include tests which should pass if your solution is correct.
However successful test do not guarantee that your solution is correct.
The homework is partially autograded using many hidden tests.
Test cells cannot be modified and empty cells cannot be deleted.

Your solution should replace placeholder lines such as:

    ### YOUR CODE HERE
    raise NotImplementedError()

You don't need to write separate functions except when the function header is predefined.
Variable names must be derived from the public test.
Please do not add new cells, they will be ignored by the autograder.

**VERY IMPORTANT** Before submitting your solution (pushing to the git repo),
run your notebook with `Kernel -> Restart & Run All` and make sure it
runs without exceptions.

If your code fails the public tests (the ones you see), you will automatically receive 0 points for that exercise.

## Submission

GitHub Classroom will accept your last pushed version before the deadline.
You do not need to send the homework to the instructor.

## Plagiarism

When preparing their homework, students are reminded to pay special attention to Title 32, Sections 92-93 of Code of Studies (quoted below). Any content from external sources must be stated in the students own words AND accompanied by citations. Copying and pasting from an external source should be avoided and any text copied must be placed between quotation marks. Reports that violate these rules cannot receive a passing grade.

"**Section 92**

(1) The works of another person will be used as follows: a) if a work of another person is used in whole or in part (e.g. by copying, citation, translation from another language or presentation), the source and the name of the author will be indicated if this name is included in the source or – in case of orally presented works – may be clearly identified; b) the work of another person or any part of that will be used – up to a quantity reasonably corresponding to the nature and purpose of the student work – identified as quotations.

(2) Instructors are entitled to review compliance with requirements in this article with computer programmes and databases.

(3) The use of works of another person and the acknowledgement of use will be governed by applicable laws and the relevant rules of the specific discipline.

**Section 93**

(1) If a student fails to meet rules regarding use of works of another person in whole or in part, the student work will be considered as not assessable and the student will not be allowed to obtain the credit of the concerned subject in the specific term.

(2) It will be deemed a disciplinary offence if a student – in breach of the rules regarding use of works of another person – submits or presents a work of another person fully or in a significant part verbatim (word for word) or in terms of its basic concepts or the combined version of several works of another person(s) as their own work.

(3) Based on subsection (1) of Section 52/A. of the Higher Education Act, compliance with the rules regarding the use of works of another person in a master thesis may be reviewed up to five years following the issue of the degree certificate. In case of violation of the above rules, section 52/A of the Higher Education Act will apply."

(BME Code of Studies, p.50)

In [None]:
import numpy as np
from collections import defaultdict
import os
import types

# Data setup

You will work with a sample of appr. 1 million words from the UMBC WebBase corpus.
There is also a smaller toy sample of a 100 sentences used for the public test and quick debugging.
Both files are prepackaged as a zip file in your starter repository.

If you are running the code from somewhere else and the data is not there, it will be downloaded and unpacked.
You can set the data directory by changing the value of the `HW2_DATA_DIR` environmental variable.

If you just cloned your starter repository in the recommended way, you should see the data extracted on the first run and no output on later runs.

In [None]:
# setting up data directory (used by the instructor)

data_dir = os.getenv("HW2_DATA_DIR")
if data_dir is None:
    data_dir = ""
    
data_zip_path = os.path.join(data_dir, "data.zip")

if not os.path.exists(data_zip_path):
    print("Download data")
    import urllib.request
    u = urllib.request.URLopener()
    u.retrieve("http://avalon.aut.bme.hu/~judit/resources/umbc/data.zip", data_zip_path)
    print("Data downloaded")

unzip_path = os.path.join(data_dir, "umbc_sample_toy.txt")

if not os.path.exists(os.path.join(data_dir, "umbc_sample_toy.txt")) or \
        not os.path.exists(os.path.join(data_dir, "umbc_sample_1M.txt")):
    print("Extracting data")
    from zipfile import ZipFile
    with ZipFile(data_zip_path) as myzip:
        myzip.extractall(data_dir)
    print("Data extraction done")

# 1. Write a generator function that reads a text corpus. (10 points)

Each sentence occupies two lines. The first line is the sentence itself. Tokens (words) are separated by spaces.
The second line is the POS (part-of-speech) tag of each token, also separated by space.
Then comes an empty line before the next sentence.
The end of the file may or may not contain an empty line.

```
This is a sentence .
DT VBZ DT NN .

This is the next sentence .
DT VBZ DT JJ NN .
```

Your task is to create a generator function with one parameter, the file's name.
The generator should yield one sentence at a time.

A sentence is a list of token-POS tag pairs (tuples), so the first sentence in the example would look like this:

```
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]
```

and the second sentence:

```
[('This', 'DT'), ('is', 'VBZ'), ('the', 'DT'), ('next', 'JJ'), ('sentence', 'NN'), ('.', '.')]

```
Some sentences may be malformatted, and the number of tokens and POS tags differ, **skip** these sentences.

In [None]:
def read_corpus(fn):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
toy_data_fn = os.path.join(data_dir, "umbc_sample_toy.txt")

toy_sentences = read_corpus(toy_data_fn)

assert isinstance(toy_sentences, types.GeneratorType)

toy_sentences = list(toy_sentences)

# there are invalid sentences in the data, that you should skip
assert len(toy_sentences) < 100
assert toy_sentences[0] == [('Then', 'RB'),
 ('Jon', 'NNP'),
 ('falls', 'VBZ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('pretty', 'JJ'),
 ('new', 'JJ'),
 ('vet', 'NN'),
 ('played', 'VBN'),
 ('by', 'IN'),
 ('Jennifer', 'NNP'),
 ('Love', 'NNP'),
 ('Hewitt', 'NNP'),
 ('.', '.')]

# 2. Basic statistics (15 points)

Compute basic statistics on the corpus. Try to iterate over the file only once and compute all of these statistics as you go.

The statistics you are supposed to compute and store in the appropriate variables (see the tests below):

1. number of sentences,
2. number of tokens (words),
3. number of types (unique words),
4. length of the longest sentence,
5. length of the shortest sentence,
6. word frequencies,
7. POS frequencies.

You are **NOT** allowed to use `collections.Counter` but you are free to use `collections.defaultdict`.

Use the variable `umbc_fn` as a parameter of `read_corpus`.

In [None]:
umbc_fn = os.path.join(data_dir, "umbc_sample_1M.txt")

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print("Number of sentences: {}".format(sentence_no))
print("Number of tokens (words): {}".format(token_no))
print("Number of types (unique words)): {}".format(type_no))
print("Length of the longest sentence: {}".format(sentence_maxlen))
print("Length of the shortest sentence: {}".format(sentence_minlen))

In [None]:
assert type(sentence_no) == int
assert type(token_no) == int
assert type(type_no) == int
assert type(sentence_maxlen) == int
assert type(sentence_minlen) == int
assert sentence_no == 41989

In [None]:
assert type(word_freq) == defaultdict or type(word_freq) == dict
assert type(pos_freq) == defaultdict or type(pos_freq) == dict

## 2.2 What are the 10 most frequent words and 5 most frequent POS tags.

The lists should be in decreasing order.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(most_frequent_words)
print(most_frequent_pos)
assert type(most_frequent_words) == list
assert type(most_frequent_pos) == list
assert len(most_frequent_words) == 10
assert type(most_frequent_words[0]) == str
assert most_frequent_pos[0] == 'NN'

# 3. Hapax legomenon (10 points)

Hapax legomenon or hapax is a word that only appears once in the dataset. Since the distribution of words roughly follows Zipf's distribution, regardless of the size of the corpus, there will always be a large number of hapax or words that only appear once.

Aside from theoretical, there are technical reasons for filtering these words such as the fact that we have limited memory.

Compute the number of hapax words in the corpus using your previous statistics.
Do not read the file again.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(hapax_no)
assert type(hapax_no) == int
assert 0 < hapax_no < len(word_freq)

## The `_UNK_` symbol

Rare words including hapax words are most often replaced by a common symbol such as `_UNK_` (unknown).
During inference time (when running your model on unseen data), we may encounter words either never seen in the training data or deemed too rare and replaced with `_UNK_`.
The best way to handle these words is to replace them with `_UNK_` since our model is prepared to handle these symbols.

Follow this strategy in your viterbi function.

Tip: you don't actually have to replace these words, you can check if they're in the `word_idx` dictionary. `dict.get` might be useful.

## 3.1 Set of frequent words.

Create a set of words that appear more at least `min_freq` times in the data.
Do not read the file again.

In [None]:
min_freq = 2

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(len(top_words))
assert type(top_words) == set
assert 0 < len(top_words) < len(word_freq)
assert "dog" in top_words
assert "titillating" not in top_words

# 4. Mapping and counters (25 points)

## 4.1 Word and POS mapping

Create a word and a POS mapping that maps each word to its unique integer id. The word ids should range from 0 to `len(top_words)-1`, while the POS ids should range from 0 to `len(pos_idx)-1`.
You only need to map the non-rare words.

Include the unknown symbol (`_UNK_`) in the word mapping. What should its id be?

Add a `START` symbol to the POS mapping. It will be used in the Viterbi algorithm.
Do not read the file again.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len(word_idx) == len(top_words) + 1  # _UNK_ included
assert '_UNK_' in word_idx
assert 'START' in pos_idx

## 4.2 Emission and transition counts


Hidden Markov Models were introduced in the [Tagging lecture](https://github.com/bmeaut/python_nlp_2018_spring/blob/master/course_material/07_Tagging/07_Tagging_lecture.ipynb) as follows:


* Like the Markov model, we take only the $n$ preceding tokens into consideration
* The idea behind the model is very different:
    * We imagine an automaton that is always in a **(hidden) state**
    * In each state, it emits something we can observe
    * The task is to find out which is _the most probable_ state sequence that generates the observations
* In the POS tagging context,
    * The words in the text are the **observed events**
    * The POS tags are the hidden states
    
    
You will build a simple Hidden Markov Model for English POS tagging.
In this case $n$ equals 1, so only the previous token is taken into consideration.

Transition and emission probabilities are derived from the corpus. Compute the emission and transition counts from the corpus. Do not iterate over the data more than once.

You need to build two matrices as follows ($\#$ means _count of_):

* a $|L| \times |V|$ matrix of integers
  $$M(i,j) = \# \ i^\text{th} \text{ pos tag emitting word } j$$
* an $|L|\times |L|$ matrix of integers
  $$N(i,j) = \# \ j^\text{th} \text{ pos after } i^\text{th} \text { pos}$$
  
**CAUTION** Both matrices are defined differently from the ones in the lab exercise.

### How to handle the POS tag of the first word in a sentence?

HMMs need an explicit way of handling starting probabilities which is usually solved with special start state(s).
Build the transition matrix as if there is an artificial `START` state at the beginning of every sentence.
You don't actually have to add a symbol to the sentence.
The start state does not affect the emission probabilities. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
print(emission_counts.shape)
print(transition_counts.shape)
assert(emission_counts[pos_idx["NN"], word_idx["_UNK_"]] == 4717)

dog_id = word_idx["dog"]
noun_id = pos_idx["NN"]
vbz_id = pos_idx["VBZ"]
noun_dog_count = emission_counts[noun_id, dog_id]
print("The state NN emitted the word dog {} times".format(emission_counts[noun_id, dog_id]))
print("The state NN followed the state VBZ {} times".format(transition_counts[noun_id, vbz_id]))
assert noun_dog_count == 64

assert type(emission_counts) == np.ndarray
assert type(transition_counts) == np.ndarray

# counts should not be negative
assert np.all(emission_counts >= 0)
assert np.all(transition_counts >= 0)

## 4.3 Emission and transition probabilities

Empirical emission and transition probabilities are defined as

* a $|L|\times |V|$ matrix of floats
  $$P_1(i,j) = \frac{\# \ i^\text{th} \text{ pos tag emitting word } j}{\# \ \text{pos tag } i}$$
* an $|L|\times |L|$ matrix of floats
  $$P_2(i,j) = \frac{\# \ j^\text{th} \text{ pos after } i^\text{th} \text { pos}}{\# \ \text{pos tag } i}$$

Create these probability matrices using the transition and emission counts from the previous exercise.

You should see a warning for emission probabilities since one of the rows sums to 0.
It is the row corresponding to the `START` state.
Add an artificial emission of the `_UNK_` symbol from the `START` state to solve this (see the tests below).

Use 64-bit floats.
Do not use log-probabilities.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert np.count_nonzero(emission_counts[pos_idx['START']]) != 0
assert type(emission_probs) == np.ndarray
assert type(transition_probs) == np.ndarray
assert emission_counts.shape == emission_probs.shape
assert transition_counts.shape == transition_probs.shape
assert np.allclose(emission_probs.sum(axis=1), 1)
assert np.allclose(transition_probs.sum(axis=1), 1)

# probabilities cannot be negative
assert np.all(emission_probs >= 0)
assert np.all(transition_probs >= 0)

# 5. Viterbi algorithm (30 points)

Implement the Viterbi algorithm for $n=1$ (the model only takes into account the previous state).

The `viterbi` function takes five parameters:
1. the sentence as a list of tokens,
2. the transition probability matrix,
3. the emission probability matrix,
4. the word - id mapping, 
5. and the POS tag - id mapping,

and it returns a list of POS tags.
It is the same length as the input sentence and each POS tag corresponds to one token in the input sentence.

Your function should handle out-of-vocabulary words as if they're `_UNK_` symbols.

Remember that the HMM always starts from an artificial `START` stat, then proceeds to the first _real_ state (POS in this case).

In [None]:
def viterbi(sentence, transitions, emissions, word_idx, pos_idx):
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
def run_viterbi(sentence):
    """Call viterbi with global parameters from previous exercises"""
    if isinstance(sentence, str):
        sentence = sentence.split()
    return viterbi(sentence, transition_probs, emission_probs, word_idx, pos_idx)

tags = run_viterbi("The cat runs .")
print(tags)
assert tags == ['DT', 'NN', 'VBZ', '.']
work1 = run_viterbi("My work here is done .")
work2= run_viterbi("I work here .")
print("Work should receive different POS tags\n"
      "My work here is done: {}\n"
      "I work here: {}".format(work1[1], work2[1]))
print(viterbi(["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "fox", "."],
              transition_probs, emission_probs, word_idx, pos_idx))

## 5.2 Addtional tests (extra 10 points)

Add your own tests that demonstrate the effect of context in POS tagging such as the previous example, where _work_ has a different POS tag in different context.

2 points / example.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# PEP8, Code cleanness (10 points)

This cell is here for technical reasons, you will receive feedback on your code quality here. You do not need to write anything here.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()