The UPOS field contains a part-of-speech tag from the universal POS tag set, while the XPOS optionally contains a language-specific (or even treebank-specific) part-of-speech / morphological tag

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from functions import pickle_load
from process_conllu import ConlluDataset

In [2]:
def merge_word_hidden_pos(viterbi_pos, true_pos, obs_seqs):
    merged_pos = []
    for seq_idx, (pred_hidden_seq, true_hidden_seq, obs_seq) in enumerate(zip(
        viterbi_pos, true_pos, obs_seqs)):
        for hidden_state, pos, obs_state in zip(pred_hidden_seq, true_hidden_seq, obs_seq):
            merged_pos.append((seq_idx, hidden_state, pos, obs_state))
    return pd.DataFrame(merged_pos, columns=["sentence", "hidden state", "pos", "token"])

In [12]:
hmm_upos_df = pd.read_csv("results/eval_hmm_upos.csv")
hmm_xpos_df = pd.read_csv("results/eval_hmm_xpos.csv")
bert_upos_df = pd.read_csv("results/eval_bert_upos.csv")
bert_xpos_df = pd.read_csv("results/eval_bert_xpos.csv")
dataset: ConlluDataset = pickle_load("checkpoints/dataset.pkl")
viterbi_upos: list[list[int]] = pickle_load("checkpoints/viterbi_upos.pkl")
viterbi_xpos: list[list[int]] = pickle_load("checkpoints/viterbi_xpos.pkl")

In [3]:
dataset: ConlluDataset = pickle_load("checkpoints/dataset.pkl")
viterbi_upos: list[list[int]] = pickle_load("checkpoints/viterbi_upos.pkl")

In [13]:
sentence_len = [len(seq) for seq in dataset.sequences]
# dropna later so pd will track which idx are dropped
hmm_upos_df["sentence length"] = sentence_len
hmm_xpos_df["sentence length"] = sentence_len
bert_upos_df["sentence length"] = sentence_len
bert_xpos_df["sentence length"] = sentence_len


# divide by zero entropy in some rows
hmm_upos_df["normalised voi"] = hmm_upos_df["normalised voi"].astype(float)
hmm_xpos_df["normalised voi"] = hmm_xpos_df["normalised voi"].astype(float)
bert_upos_df["normalised voi"] = bert_upos_df["normalised voi"].astype(float)
bert_xpos_df["normalised voi"] = hmm_upos_df["normalised voi"].astype(float)
hmm_upos_df.dropna(inplace=True)
hmm_xpos_df.dropna(inplace=True)
bert_upos_df.dropna(inplace=True)
bert_xpos_df.dropna(inplace=True)

I initially hypothesised forward backward algorithm to perform better on UPOS than XPOS because the transition matrix is sparser given the same dataset size. It is with great surprise that it managed to capture the more fine-grained XPOS better.

It is likewise surprising that v measure for the more fine-grained XPOS performed better. However, it comes with no surprise that BERT captured POS better than forward backward algorithm, by a huge margin of about 0.2 in both cases.

Given the large no. of zeros in v measure for hmm, upos, it was imperative to find out if sentence lengths played any role. I hypothesised that on extremely short and long sentences it would not perform well. For extremely short sentences, the first token is difficult to predict accurately. For extremely long sentences, probabilities are very close to zero and HMM will completely fail. Moreover, Viterbi algorithm is greedy and may enter into a local minima and the errors cascade.

It was clear that for sentences up to length 22, with the distribution centred 6-8, there was a large number of sentences that completely went off track. This effect completely disappeared for longer sentences.

In [4]:
viterbi_upos_df = merge_word_hidden_pos(viterbi_upos, dataset.upos, dataset.sequences)
# viterbi_xpos_df = merge_word_hidden_pos(viterbi_xpos, dataset.xpos, dataset.sequences)

In [5]:
viterbi_upos_df["hidden state"].groupby(viterbi_upos_df["hidden state"]).size()

hidden state
0      69307
1      81589
2      37603
3      52910
4      71532
5      44226
6      37772
7      29463
8      53126
9      50964
10     78790
11     62379
12    118718
13     49410
14     48730
15     20747
16     42435
Name: hidden state, dtype: int64

In [6]:
upos_hidden_state_to_token = viterbi_upos_df[["hidden state", "token"]]\
  .groupby("token")\
  .agg(count=("token", "count"),
       hidden_states=("hidden state", frozenset))\
  .reset_index()

In [7]:
upos_hidden_state_to_token.sort_values("count", ascending=False).head(20)

Unnamed: 0,token,count,hidden_states
19,",",48723,"(2, 5, 8, 9, 10, 11, 12, 15)"
30807,the,47975,"(2, 3, 4, 7, 9, 10, 13, 15, 16)"
27,.,39020,"(0, 1, 6, 7, 10, 12, 13)"
108,[NUM],23927,"(0, 3, 4, 6, 9, 12)"
21245,of,23005,"(0, 1, 2, 4, 6, 7, 11, 13, 15, 16)"
31175,to,22352,"(0, 2, 3, 4, 5, 7, 8, 10, 11, 12, 16)"
279,a,20149,"(0, 1, 2, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16)"
15156,in,16931,"(2, 4, 5, 9, 10, 12, 13, 14, 15)"
1383,and,16668,"(2, 3, 5, 6, 8, 9, 11, 12, 13, 16)"
15,'s,9326,"(0, 1, 3, 4, 5, 6, 8, 11, 12, 13, 14, 16)"


In [8]:
set([len(s) for s in upos_hidden_state_to_token["hidden_states"]])

{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}

In [5]:
viterbi_xpos_df["hidden state"].groupby(viterbi_xpos_df["hidden state"]).size()

hidden state
6     663367
10     48723
30     39020
40    104502
44     94089
Name: hidden state, dtype: int64

In [6]:
xpos_hidden_state_to_token = viterbi_xpos_df[["hidden state", "token"]]\
  .groupby("token")\
  .agg(count=("token", "count"),
       hidden_states=("hidden state", frozenset))\
  .reset_index()

In [7]:
xpos_hidden_state_to_token.sort_values("count", ascending=False).head(20)

Unnamed: 0,token,count,hidden_states
19,",",48723,(10)
30807,the,47975,(40)
27,.,39020,(30)
108,[NUM],23927,(6)
21245,of,23005,(6)
31175,to,22352,(6)
279,a,20149,(6)
15156,in,16931,(6)
1383,and,16668,(6)
15,'s,9326,(6)


In [26]:
set([len(s) for s in xpos_hidden_state_to_token["hidden_states"]])

{1, 2}

In [9]:
xpos_hidden_state_to_token[np.array([len(s) for s in xpos_hidden_state_to_token["hidden_states"]]) == 2]["hidden_states"]

24       (40, 6)
303      (40, 6)
426      (40, 6)
636      (40, 6)
724      (40, 6)
          ...   
33631    (40, 6)
33910    (40, 6)
34230    (40, 6)
34241    (40, 6)
34270    (40, 6)
Name: hidden_states, Length: 191, dtype: object

In [10]:
s = xpos_hidden_state_to_token[np.array([len(s) for s in xpos_hidden_state_to_token["hidden_states"]]) == 2]["hidden_states"]
s.groupby(s).size()

hidden_states
(40, 6)    191
Name: hidden_states, dtype: int64