In [18]:
import re
import json
from collections import defaultdict

### Task 1: Vocabulary Creation (20 points)
create a vocabulary using the training data. In HMM,one important problem when creating the vocabulary is to handle unknown words. One simple solution is to replace rare words whose occurrences are less than a threshold (e.g. 3) with a special token ‘< unk >’. Task. Creating a vocabulary using the training data in the file train and
output the vocabulary into a txt file named vocab.txt.

 The format of the
vocabulary file is that each line contains a word type, its index in
the vocabulary and its occurrences, separated by the tab symbol
‘\t’. The first line should be the special token ‘< unk >’ and the
following lines should be sorted by its occurrences in descending 1 order. Note that we can only use the training data to create the vocabulary, without touching the development and test data. What is the selected
threshold for unknown words replacement? What is the total size of your
vocabulary and what is the total occurrences of the special token ‘< unk >’
after replacement?

In [15]:
n_threshold = 3
train_vocab = {}

# File importing
tr_file = open('../../data/vocab-data/train', 'r')
Lines = tr_file.readlines()
 
# Create vocab
for line in Lines:
    if line.strip():
        # print(line)
        word = re.split(r'\t', line)[1]
        cleaned_word = re.sub(r'\W+', '', word)     

    if word not in train_vocab:
        train_vocab[cleaned_word] = 0

    train_vocab[cleaned_word] += 1

# Handle <unk> tokens  
unk_count = sum(v for k, v in train_vocab.items() if v <= n_threshold)
new_vocab = {k: v for k, v in train_vocab.items() if v > n_threshold}
new_vocab['<unk>'] = unk_count
indexed_vocab = {word: (index, count) for index, (word, count) in enumerate(sorted(new_vocab.items(), key = lambda item: item[1], reverse=True), start = 1)}

# File Writing
f = open("../../data/hmm/train_vocab.txt", "a")
for k,v in indexed_vocab.items():
    # word index count
    new_line = f"{k}\t{v[0]}\t{v[1]}\n"
    f.write(new_line)
f.close()


The second task is to learn an HMM from the training data. Remember that
the solution of the emission and transition parameters in HMM are in the
following formulation:

t(s′|s) = count(s→s′)
count(s)
e(x|s) = count(s→x)
count(s)

where t(·|·) is the transition parameter and e(·|·) is the emission parameter.
Task. Learning a model using the training data in the file train and output
the learned model into a model file in json format, named hmm.json. The
model file should contains two dictionaries for the emission and transition
parameters, respectively. The first dictionary, named transition, contains
items with pairs of (s, s′) as key and t(s′|s) as value. The second dictionary,
named emission, contains items with pairs of (s, x) as key and e(x|s) as value.
How many transition and emission parameters in your HMM?

In [17]:
n_threshold = 1
train_vocab = defaultdict(int)
transition_counts = defaultdict(int)
emission_counts = defaultdict(int)
state_counts = defaultdict(int)

# Open training data
with open('../../data/vocab-data/train', 'r') as tr_file:
    Lines = tr_file.readlines()
    prev_state = None

    # Process each line
    for line in Lines:
        line = line.strip()
        if line:
            parts = line.split('\t')
            if len(parts) >= 2:
                word, state = parts[1], parts[2]
                cleaned_word = re.sub(r'\W+', '', word)
                train_vocab[cleaned_word] += 1
                
                # Emission and transition counts
                emission_counts[(state, cleaned_word)] += 1
                state_counts[state] += 1
                if prev_state is not None:
                    transition_counts[(prev_state, state)] += 1
                prev_state = state
        else:
            prev_state = None  # Reset at the end of a sentence

# Adjust vocab for <unk>
unk_count = sum(count for word, count in train_vocab.items() if count <= n_threshold)
filtered_vocab = {word: count for word, count in train_vocab.items() if count > n_threshold}
filtered_vocab['<unk>'] = unk_count

# Calculate probabilities
transition_probs = {k: v / state_counts[k[0]] for k, v in transition_counts.items()}
emission_probs = {k: v / state_counts[k[0]] for k, v in emission_counts.items()}

# HMM Model for JSON
hmm_model = {
    "transition": {f"{k[0]},{k[1]}": v for k, v in transition_probs.items()},
    "emission": {f"{k[0]},{k[1]}": v for k, v in emission_probs.items()}
}


with open("../../data/hmm/hmm.json", "w") as f:
    json.dump(hmm_model, f, ensure_ascii=False, indent=4)


The third task is to implement the greedy decoding algorithm with HMM.
Task. Implementing the greedy decoding algorithm and evaluate it on the
development data. What is the accuracy on the dev data? Predicting the
part-of-speech tags of the sentences in the test data and output the predic-
tions in a file named greedy.out, in the same format of training data.
We also provide an evaluation script eval.py to evaluate the results of the
model. To use the script, you need to prepare your prediction file in the same
format as the training data, then execute the command line:
python eval.py −p {predicted file} −g {gold-standard file}
2


The fourth task is to implement the viterbi decoding algorithm with HMM.
Task. Implementing the viterbi decoding algorithm and evaluate it on the
development data. What is the accuracy on the dev data? Predicting the
part-of-speech tags of the sentences in the test data and output the predic-
tions in a file named viterbi.out, in the same format of training data.