<a href="https://colab.research.google.com/github/dadebeats/HMM/blob/main/Practice2_HMM_POS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **POS** tagger by using **Hidden Markov Models**

Imp Note: *Dev dataset is used to verify the unknown tags in both emission and transition matrices. We will also have an additional categoruy in the training dataset which wil have the UNK tag that contains unkown. The dev dataset will also be used for smoothing and more. We will categorize the most infrequent words into UNK.*

In [2]:
#comment this line if not run in Colab
!pip install conllu

Collecting conllu
  Downloading conllu-4.5.3-py2.py3-none-any.whl (16 kB)
Installing collected packages: conllu
Successfully installed conllu-4.5.3


In [3]:
#comment this cell if not run in Colab
from google.colab import drive

# Mounting Google Drive in Google Colab environment
# Mount Google Drive to the '/content/drive' directory
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [4]:
# Importing all the libraries here
import os
from collections import defaultdict
import itertools
import conllu
#from hmm import HMM

In [5]:
#comment this cell if not run in Colab and uncomment the corresponding import above
%run '/content/drive/MyDrive/Master HAP LAP/LAP-CS /hmm.py'

# Step 1: Converting Data into Tokens and Tags

## Initializing The Global Variables

- "dataset_paths" is a dictionary which has the language's names as its keys. It consists of a list that contains the file location of the traning, dev and testing datasets
- "parsed_datasets" is a dictionary which has the language's names as its keys. It consists if a list with all the datasets which have been parsed into sentences

- "tokenized_datasets" is a dictionary which has a list consisting of three further lists for training, dev and testing.
    - tokenized_datasets[ language ][ 0 ] = training dataset list of tuples, (tokens, tags)
    - tokenized_datasets[ language ][ 1 ] = dev dataset list of tuples, (tokens, tags)
    - tokenized_datasets[ language ][ 2 ] = testing dataset list of tuples, (tokens, tags)

In [6]:
train_data_index = 0
dev_data_index = 1
test_data_index = 2

- "POS_dataset_tags" is a dictionary which stores all the tags within a language

In [7]:

# Define the paths to the .conllu datasets
dataset_paths = {
    'English': ['/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-train.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-dev.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-test.conllu'],
    'Basque':  ['/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-train.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-dev.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-test.conllu']
}

# Initialize lists within a dictionary to store parsed data for each dataset
parsed_datasets = {
    'English': [],
    'Basque': []
}

# Initialize lists within a dictionary to store a tuple of (tokens, tags)
tokenized_datasets = {
    'English': [],
    'Basque': []
}

# Initialize lists within a dictionary to store a list of tags within each dataset
POS_dataset_tags = {
    'English': set(),
    'Basque': set()
}


## Parsing the Data Files

- In the following code we will parse the .conllu files and store all the three "Training, Dev and Testing" datasets of both languages into the "Paresed_datasets" variable.
- We will create a function "read_and_parse_conllu" which takes "file_path" as a parameter and returns conllu parsed class

In [8]:
# Function to read and parse a .conllu file
def read_and_parse_conllu(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return conllu.parse(file.read())

In [9]:
# Iterate through each language dataset
for language, data_paths in dataset_paths.items():
    print(f"\nLanguage: {language}\nData Paths:\n\t {data_paths}\n")

    for data_path in data_paths:
        parsed_data = read_and_parse_conllu(data_path)
        parsed_datasets[language].append(parsed_data)
        print(f"Parsed data for {language}: {data_path}")

    print(f"\nNumber of Sentence Datasets: {len(parsed_datasets[language])}\n")


Language: English
Data Paths:
	 ['/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-train.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-dev.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-test.conllu']

Parsed data for English: /content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-train.conllu
Parsed data for English: /content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-dev.conllu
Parsed data for English: /content/drive/MyDrive/Master HAP LAP/LAP-CS /English corpus/en_gum-ud-test.conllu

Number of Sentence Datasets: 3


Language: Basque
Data Paths:
	 ['/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-train.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-dev.conllu', '/content/drive/MyDrive/Master HAP LAP/LAP-CS /Basque corpus/eu_bdt-ud-test.conllu']

Parsed data for Basque: /content/drive/MyDrive/Master HAP LAP

## Exploring the Parsed Datasets

Here we explore the basic information regrading the datasets which we have just parsed

In [10]:
# Iterate through the languages and their corresponding datasets
for idx, language in enumerate(parsed_datasets.keys()):
    print(f"\nDataset of Language {idx + 1}: {language}")

    # Define dataset names
    dataset_names = ["training", "dev", "testing"]

    # Print the number of sentence datasets for each language
    print(f"Number of Sentence Datasets: {len(parsed_datasets[language])}")

    # Iterate through the sentence datasets for each language
    for idx, sentence_dataset in enumerate(parsed_datasets[language]):
        dataset_name = dataset_names[idx]

        # Print the number of sentences in each dataset for the current language
        print(f"{language} Sentences in the {dataset_name} Dataset: {len(sentence_dataset)}")


Dataset of Language 1: English
Number of Sentence Datasets: 3
English Sentences in the training Dataset: 8548
English Sentences in the dev Dataset: 1117
English Sentences in the testing Dataset: 1096

Dataset of Language 2: Basque
Number of Sentence Datasets: 3
Basque Sentences in the training Dataset: 5396
Basque Sentences in the dev Dataset: 1798
Basque Sentences in the testing Dataset: 1799


## Parsed Dataset into Token and Tags

Here we Extract the words(form) and POS(upos - universal part of speech) and store them within a tuple that is appended to our list.

In [11]:
for language in parsed_datasets.keys():
    # English, Basque

    for sentence_dataset in parsed_datasets[language]:
        # 0 train, 1 dev, 2 test
        temp_tokenized_list = []

        for sentence in sentence_dataset:
            # Single Sentence form the one of the datasets from above
            temp_sentence_list = []

            for token in sentence:
                # single token from the sentence

                temp_sentence_list.append((token['form'], token['upos']))
                POS_dataset_tags[language].add(token['upos'])

            temp_tokenized_list.append(temp_sentence_list)

        tokenized_datasets[language].append(temp_tokenized_list)

## Exploring the Tokenized and Tagged Data

Here we explore the token size and all the Tags which are within a certain language. Furthemore we also see some examples of each dataset

In [12]:
languages = ['English', 'Basque']

# Iterating through each language
for language in languages:
    for dataset_type, sentences in zip(['Train', 'Dev', 'Test'], tokenized_datasets[language]):
        print(f"{language} {dataset_type} Sentences: {len(sentences)}")
        print("Example:\n",*sentences[:2], sep='\n')
    print() # for better readability, no other purpose

# Printing the Part of Speech of both Languages
for language in languages:
    print(f"{language} POS Tags list: {POS_dataset_tags[language]}")


English Train Sentences: 8548
Example:

[('Aesthetic', 'ADJ'), ('Appreciation', 'NOUN'), ('and', 'CCONJ'), ('Spanish', 'ADJ'), ('Art', 'NOUN'), (':', 'PUNCT')]
[('Insights', 'NOUN'), ('from', 'ADP'), ('Eye', 'NOUN'), ('-', 'PUNCT'), ('Tracking', 'NOUN')]
English Dev Sentences: 1117
Example:

[('Introduction', 'NOUN')]
[('Research', 'NOUN'), ('on', 'ADP'), ('adult', 'NOUN'), ('-', 'PUNCT'), ('learned', 'VERB'), ('second', 'ADJ'), ('language', 'NOUN'), ('(', 'PUNCT'), ('L2', 'NOUN'), (')', 'PUNCT'), ('has', 'AUX'), ('provided', 'VERB'), ('considerable', 'ADJ'), ('insight', 'NOUN'), ('into', 'ADP'), ('the', 'DET'), ('neurocognitive', 'ADJ'), ('mechanisms', 'NOUN'), ('underlying', 'VERB'), ('the', 'DET'), ('learning', 'NOUN'), ('and', 'CCONJ'), ('processing', 'NOUN'), ('of', 'ADP'), ('L2', 'NOUN'), ('grammar', 'NOUN'), ('[', 'PUNCT'), ('1', 'NUM'), (']', 'PUNCT'), ('–', 'SYM'), ('[', 'PUNCT'), ('11', 'NUM'), (']', 'PUNCT'), ('.', 'PUNCT')]
English Test Sentences: 1096
Example:

[('The', 'D

In [16]:
tokenized_datasets['English'][0]

[[('Aesthetic', 'ADJ'),
  ('Appreciation', 'NOUN'),
  ('and', 'CCONJ'),
  ('Spanish', 'ADJ'),
  ('Art', 'NOUN'),
  (':', 'PUNCT')],
 [('Insights', 'NOUN'),
  ('from', 'ADP'),
  ('Eye', 'NOUN'),
  ('-', 'PUNCT'),
  ('Tracking', 'NOUN')],
 [('Claire', 'PROPN'),
  ('Bailey', 'PROPN'),
  ('-', 'PUNCT'),
  ('Ross', 'PROPN'),
  ('claire.bailey-ross@port.ac.uk', 'PROPN'),
  ('University', 'PROPN'),
  ('of', 'ADP'),
  ('Portsmouth', 'PROPN'),
  (',', 'PUNCT'),
  ('United', 'VERB'),
  ('Kingdom', 'PROPN')],
 [('Andrew', 'PROPN'),
  ('Beresford', 'PROPN'),
  ('a.m.beresford@durham.ac.uk', 'PROPN'),
  ('Durham', 'PROPN'),
  ('University', 'PROPN'),
  (',', 'PUNCT'),
  ('United', 'VERB'),
  ('Kingdom', 'PROPN')],
 [('Daniel', 'PROPN'),
  ('Smith', 'PROPN'),
  ('daniel.smith2@durham.ac.uk', 'PROPN'),
  ('Durham', 'PROPN'),
  ('University', 'PROPN'),
  (',', 'PUNCT'),
  ('United', 'VERB'),
  ('Kingdom', 'PROPN')],
 [('Claire', 'PROPN'),
  ('Warwick', 'PROPN'),
  ('c.l.h.warwick@durham.ac.uk', 'PROPN

In [13]:
# Training of HMM model

hmm_model = HMM()
hmm_model.train(tokenized_datasets['English'][0]) # default is use_log_prob=False


In [14]:
print(hmm_model.get_emission_matrix().keys())
print(hmm_model.get_transition_matrix().keys())

dict_keys(['NOUN', 'INTJ', 'PART', '_', '<START>', 'AUX', 'PUNCT', 'ADP', '<STOP>', 'ADV', 'DET', 'SCONJ', 'ADJ', 'CCONJ', 'VERB', 'NUM', 'PROPN', 'X', 'SYM', 'UNK', 'PRON'])
dict_keys(['NOUN', 'INTJ', 'PART', '_', '<START>', 'AUX', 'PUNCT', 'ADP', '<STOP>', 'ADV', 'DET', 'SCONJ', 'ADJ', 'CCONJ', 'VERB', 'NUM', 'PROPN', 'X', 'SYM', 'UNK', 'PRON'])


In [15]:
hmm_model.phi

{'NOUN': 0.14763900764773766,
 'INTJ': 0.008078259257709027,
 'PART': 0.020778646129192353,
 '_': 0.01312492899383517,
 'AUX': 0.04464867644508757,
 'PUNCT': 0.12330257894390663,
 'ADP': 0.08404738129264107,
 '<STOP>': 0.05111845921106919,
 'ADV': 0.042017711179808535,
 'DET': 0.07221401706539743,
 'SCONJ': 0.013800608709690921,
 'ADJ': 0.058413408355706506,
 'CCONJ': 0.02905422778179731,
 'VERB': 0.09365638397742153,
 'NUM': 0.017687261942489492,
 'PROPN': 0.053821178074492194,
 'X': 0.001817757819647331,
 'SYM': 0.0014111540968314807,
 'UNK': -inf,
 'PRON': 0.07236948319470937}

In [16]:
tokenized_datasets['English'][0][0]

[('Aesthetic', 'ADJ'),
 ('Appreciation', 'NOUN'),
 ('and', 'CCONJ'),
 ('Spanish', 'ADJ'),
 ('Art', 'NOUN'),
 (':', 'PUNCT')]

In [17]:
prediction = hmm_model.predict([
                                ["I",  "really", "like", "the", "movie."],\
                                ["I", "flew", "inside", "the", "ocean", "of", "wonders"], \
                                ["Aesthetic", "Appreciation", "and", "Spanish", "Art", ":"]
                                ])
print(*prediction, sep='\n')


-- automatically select first possible tag: NOUN
-- automatically select first possible tag: PART
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: ADP
['PRON', 'ADV', 'ADP', 'DET', 'UNK']
['PRON', 'UNK', 'NOUN', 'PART', 'NOUN', 'ADP', 'UNK']
['ADJ', 'NOUN', 'CCONJ', 'ADJ', 'NOUN', 'PUNCT']


# Step 2: Labeling the "UNK" tags in the dataset

# Filter dataset by giving words with low (n=5) frequency the UNK tag

CHECK THE NUMBER OF UNKS IN THE DATASET

In [18]:
word_to_count_map = defaultdict(int)
for language in languages:
    train_data = tokenized_datasets[language][0]
    for sentence in train_data:
        for word, tag in sentence:
            word_to_count_map[word] += 1

LOW_FREQUENCY = 5
for language in languages:
  for idx in range(3):
    temp_data = tokenized_datasets[language][idx]

    for i, sentence in enumerate(temp_data):
        new_sentence = []
        for word, tag in sentence:
            if word_to_count_map[word] <= LOW_FREQUENCY:
                new_sentence.append((word, "UNK"))
            else:
                new_sentence.append((word, tag))
        tokenized_datasets[language][idx][i] = new_sentence  # Update the sentence in the original dataset

# Step 3: Use validation set to find the best hyperparameters
Training Function uses these hyperparams:
    - use_log_prob: bool = False
    - smoothing_factor: Optional[float] = None
    - apply_smoothing_in_emission_matrix: bool = False
    - apply_smoothing_in_transition_matrix: bool = False

In [19]:
# Define the hyperparam grid
use_log_prob_options = [True, False]
smoothing_factor_options = [0.1, 0.5, 1, 1.5, 2]
# smoothing_factor_options = [1]
apply_smoothing_in_emission_matrix_options = [True, False]
apply_smoothing_in_transition_matrix_options = [True, False]

# Generate all possible combinations of hyperparameters
hyperparams_configurations = list(itertools.product(
    use_log_prob_options,
    smoothing_factor_options,
    apply_smoothing_in_emission_matrix_options,
    apply_smoothing_in_transition_matrix_options
))

# Find the best configuration
best_accuracy = {
    'English': 0,
    'Basque': 0
}
best_configuration = {
    'English': (),
    'Basque': ()
}

for language in languages:

    validation_data = tokenized_datasets[language][dev_data_index]
    train_data = tokenized_datasets[language][train_data_index]
    test_data = tokenized_datasets[language][test_data_index]

    for hpc in hyperparams_configurations:
        use_log_prob, smoothing_factor, apply_smoothing_emission, apply_smoothing_transition = hpc
        hmm_model = HMM()
        hmm_model.train(train_data,
                        use_log_prob=use_log_prob,
                        smoothing_factor=smoothing_factor,
                        apply_smoothing_in_emission_matrix=apply_smoothing_emission,
                        apply_smoothing_in_transition_matrix=apply_smoothing_transition)

        print("Exploring accuracy for: " + str(hpc))
        hpc_accuracy, hpc_pred_tags, hpc_orig_tags = hmm_model.accuracy(validation_data)
        if hpc_accuracy > best_accuracy[language]:
            best_accuracy[language] = hpc_accuracy
            best_configuration[language] = hpc
            # print("Found new best config: ", best_configuration, hpc_accuracy)


print(f"Best config for English: ", best_configuration['English'], best_accuracy['English'])
print(f"Best config for Basque: ", best_configuration['Basque'], best_accuracy['Basque'])


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
-- automatically select first possible tag: CCONJ
-- automatically select first possible tag: VERB
-- automatically select first possible tag: ADV
-- automatically select first possible tag: AUX
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: CCONJ
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: AUX
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: AUX
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: ADP
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: CCONJ
-- automatically select first possible tag: AUX
-- automatically select first possible tag: AUX
-- automatically select first possible tag: NOUN
-- automatically select first possible tag: CCONJ
-- automatically select first possible tag: NO

# Step 4: Evaluation of the model on test set


# Evaluation English dataset


In [20]:
#training dataset for English examples
train_english = tokenized_datasets['English'][0]

In [21]:
#test dataset for English examples
test_english = tokenized_datasets['English'][2]

In [22]:
#training of the model for English
hmm_model.train(train_english, use_log_prob=True, smoothing_factor=0.1, apply_smoothing_in_emission_matrix=True, \
                apply_smoothing_in_transition_matrix=True)

In [23]:
acc_english, pred_tags_english, orig_tags_english = hmm_model.accuracy(test_english)

In [51]:
print(f"Accuracy for English test set:", acc_english)

Accuracy for English test set: 0.917505329433345


In [24]:
len(pred_tags_english) == len(orig_tags_english)

True

In [25]:
orig_tags = set()

for tag in orig_tags_english:
  orig_tags.add(tag)

orig_tags

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'UNK',
 'VERB',
 'X',
 '_'}

In [26]:
POS_dataset_tags['English']

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X',
 '_'}

In [27]:
correct_tags = {'UNK': 0}
total_tags = {'UNK': 0}

for tag in POS_dataset_tags['English']:
    correct_tags[tag] = 0
    total_tags[tag] = 0

for idx, tag in enumerate(orig_tags_english):
  if tag == pred_tags_english[idx]:
    correct_tags[tag] += 1

  total_tags[tag] += 1

In [28]:
#Accuracy for adjectives in English

adj_acc = correct_tags['ADJ']/total_tags['ADJ']
print(adj_acc)

0.9253365973072215


In [29]:
#Accuracy for adpositions in English

adp_acc = correct_tags['ADP']/total_tags['ADP']
print(adp_acc)

0.9174447174447175


In [30]:
#Accuracy for adverbs in English

adv_acc = correct_tags['ADV']/total_tags['ADV']
print(adv_acc)

0.7942998760842627


In [31]:
#Accuracy for auxiliaries in English

aux_acc = correct_tags['AUX']/total_tags['AUX']
print(aux_acc)

0.9349945828819068


In [32]:
#Accuracy for coordinating conjunctions in English

cconj_acc = correct_tags['CCONJ']/total_tags['CCONJ']
print(cconj_acc)

0.9740634005763689


In [33]:
#Accuracy for determiners in English

det_acc = correct_tags['DET']/total_tags['DET']
print(det_acc)

0.9685681024447031


In [34]:
#Accuracy for interjections in English

intj_acc = correct_tags['INTJ']/total_tags['INTJ']
print(intj_acc)

0.5873015873015873


In [35]:
#Accuracy for nouns in English

nounacc = correct_tags['NOUN']/total_tags['NOUN']
print(nounacc)

0.9229086932750137


In [36]:
#Accuracy for numerals in English

num_acc = correct_tags['NUM']/total_tags['NUM']
print(num_acc)

0.9809885931558935


In [37]:
#Accuracy for particles in English

part_acc = correct_tags['PART']/total_tags['PART']
print(part_acc)

0.7913669064748201


In [38]:
#Accuracy for pronouns in English

pron_acc = correct_tags['PRON']/total_tags['PRON']
print(pron_acc)

0.9194113524877365


In [39]:
#Accuracy for proper nouns in English

propn_acc = correct_tags['PROPN']/total_tags['PROPN']
print(propn_acc)

0.7542372881355932


In [40]:
#Accuracy for punctuation marks in English

punct_acc = correct_tags['PUNCT']/total_tags['PUNCT']
print(punct_acc)

0.9968503937007874


In [41]:
#Accuracy for subordinating conjunctions in English

sconj_acc = correct_tags['SCONJ']/total_tags['SCONJ']
print(sconj_acc)

0.5368852459016393


In [42]:
#Accuracy for symbols in English

sym_acc = correct_tags['SYM']/total_tags['SYM']
print(sym_acc)

0.7142857142857143


In [43]:
#Accuracy for verbs in English

verb_acc = correct_tags['VERB']/total_tags['VERB']
print(verb_acc)

0.8347169811320755


In [44]:
#Accuracy for 'other' in English

x_acc = correct_tags['X']/total_tags['X']
print(x_acc)

0.36363636363636365


In [46]:
#Accuracy for unknown words in English

unk_acc = correct_tags['UNK']/total_tags['UNK']
print(unk_acc)

0.9391143911439115


# Evaluation Basque dataset

In [81]:
#training dataset for Basque examples
train_basque = tokenized_datasets['Basque'][0]

In [82]:
#test dataset for Basque examples
test_basque = tokenized_datasets['Basque'][2]

In [83]:
#training of the model for Basque
hmm_model.train(train_basque, use_log_prob=True, smoothing_factor=0.1, \
                apply_smoothing_in_emission_matrix=False, apply_smoothing_in_transition_matrix=True)

In [84]:
acc_basque, pred_tags_basque, orig_tags_basque = hmm_model.accuracy(test_basque)

In [85]:
print(f"Accuracy for Basque test set:", acc_basque)

Accuracy for Basque test set: 0.9529006318207927


In [86]:
len(pred_tags_basque) == len(orig_tags_basque)

True

In [87]:
orig_tags = set()

for tag in orig_tags_basque:
  orig_tags.add(tag)

orig_tags

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'UNK',
 'VERB',
 'X'}

In [88]:
POS_dataset_tags['Basque']

{'ADJ',
 'ADP',
 'ADV',
 'AUX',
 'CCONJ',
 'DET',
 'INTJ',
 'NOUN',
 'NUM',
 'PART',
 'PRON',
 'PROPN',
 'PUNCT',
 'SCONJ',
 'SYM',
 'VERB',
 'X'}

In [89]:
correct_tags_basque = {'UNK': 0}
total_tags_basque = {'UNK': 0}

for tag in POS_dataset_tags['Basque']:
    correct_tags_basque[tag] = 0
    total_tags_basque[tag] = 0

for idx, tag in enumerate(orig_tags_basque):
  if tag == pred_tags_basque[idx]:
    correct_tags_basque[tag] += 1

  total_tags_basque[tag] += 1

In [90]:
#Accuracy for adjectives in Basque

adj_acc_basque = correct_tags_basque['ADJ']/total_tags_basque['ADJ']
print(adj_acc_basque)

0.9495967741935484


In [91]:
#Accuracy for adpositions in Basque

adp_acc_basque = correct_tags_basque['ADP']/total_tags_basque['ADP']
print(adp_acc_basque)

0.9021739130434783


In [92]:
#Accuracy for adverbs in Basque

adv_acc_basque = correct_tags_basque['ADV']/total_tags_basque['ADV']
print(adv_acc_basque)

0.8935611038107752


In [93]:
#Accuracy for auxiliaries in Basque

aux_acc_basque = correct_tags_basque['AUX']/total_tags_basque['AUX']
print(aux_acc_basque)

0.8947134606841404


In [94]:
#Accuracy for coordinating conjunctions in Basque

cconj_acc_basque = correct_tags_basque['CCONJ']/total_tags_basque['CCONJ']
print(cconj_acc_basque)

0.986863711001642


In [95]:
#Accuracy for determiners in Basque

det_acc_basque = correct_tags_basque['DET']/total_tags_basque['DET']
print(det_acc_basque)

0.9754335260115607


In [96]:
#Accuracy for interjections in Basque

intj_acc_basque = correct_tags_basque['INTJ']/total_tags_basque['INTJ']
print(intj_acc_basque)

0.0


In [97]:
#Accuracy for nouns in Basque

noun_acc_basque = correct_tags_basque['NOUN']/total_tags_basque['NOUN']
print(noun_acc_basque)

0.9288451012588944


In [98]:
#Accuracy for numerals in Basque

num_acc_basque = correct_tags_basque['NUM']/total_tags_basque['NUM']
print(num_acc_basque)

0.9976525821596244


In [99]:
#Accuracy for particles in Basque

part_acc_basque = correct_tags_basque['PART']/total_tags_basque['PART']
print(part_acc_basque)

1.0


In [100]:
#Accuracy for pronouns in Basque

pron_acc_basque = correct_tags_basque['PRON']/total_tags_basque['PRON']
print(pron_acc_basque)

0.9603960396039604


In [101]:
#Accuracy for proper nouns in Basque

propn_acc_basque = correct_tags_basque['PROPN']/total_tags_basque['PROPN']
print(propn_acc_basque)

0.9598930481283422


In [102]:
#Accuracy for punctuation marks in Basque

punct_acc_basque = correct_tags_basque['PUNCT']/total_tags_basque['PUNCT']
print(punct_acc_basque)

1.0


In [103]:
#Accuracy for subordinating conjunctions in Basque

sconj_acc_basque = correct_tags_basque['SCONJ']/(total_tags_basque['SCONJ'] + 0.01)
print(sconj_acc_basque)

0.0


In [104]:
#Accuracy for symbols in Basque

sym_acc_basque = correct_tags_basque['SYM']/(total_tags_basque['SYM'] + 0.01)
print(sym_acc_basque)

0.0


In [105]:
#Accuracy for verbs in Basque

verb_acc_basque = correct_tags_basque['VERB']/total_tags_basque['VERB']
print(verb_acc_basque)

0.9067401441288682


In [106]:
#Accuracy for 'other' in Basque

x_acc_basque = correct_tags_basque['X']/total_tags_basque['X']
print(x_acc_basque)

0.625


In [107]:
#Accuracy for unkown words in Basque

unk_acc_basque = correct_tags_basque['UNK']/total_tags_basque['UNK']
print(unk_acc_basque)

0.9614888123924269
