# **POS** tagger by using **Hidden Markov Models**

Imp Note: *Dev dataset is used to verify the unknown tags in both emission and transition matrices. We will also have an additional categoruy in the training dataset which wil have the UNK tag that contains unkown. The dev dataset will also be used for smoothing and more. We will categorize the most infrequent words into UNK.*

In [1]:
# Importing all the libraries here
import os
import conllu

# Step 1: Converting Data into Tokens and Tags 

## Initializing The Global Variables

- "dataset_paths" is a dictionary which has the language's names as its keys. It consists of a list that contains the file location of the traning, dev and testing datasets
- "parsed_datasets" is a dictionary which has the language's names as its keys. It consists if a list with all the datasets which have been parsed into sentences

- "tokenized_datasets" is a dictionary which has a list consisting of three further lists for training, dev and testing. 
    - tokenized_datasets[ language ][ 0 ] = training dataset list of tuples, (tokens, tags)
    - tokenized_datasets[ language ][ 1 ] = dev dataset list of tuples, (tokens, tags)
    - tokenized_datasets[ language ][ 2 ] = testing dataset list of tuples, (tokens, tags)

- "POS_dataset_tags" is a dictionary which stores all the tags within a language 

In [2]:
# Define the paths to the .conllu datasets
dataset_paths = {
    'English': ['Data/English/en_gum-ud-train.conllu', 'Data/English/en_gum-ud-dev.conllu', 'Data/English/en_gum-ud-test.conllu'],
    'Basque':  ['Data/Basque/eu_bdt-ud-train.conllu', 'Data/Basque/eu_bdt-ud-dev.conllu', 'Data/Basque/eu_bdt-ud-test.conllu']
}

# Initialize lists within a dictionary to store parsed data for each dataset
parsed_datasets = {
    'English': [],
    'Basque': []
}

# Initialize lists within a dictionary to store a tuple of (tokens, tags)
tokenized_datasets = {
    'English': [],
    'Basque': []
}

# Initialize lists within a dictionary to store a list of tags within each dataset
POS_dataset_tags = {
    'English': set(),
    'Basque': set()
}


## Parsing the Data Files

- In the following code we will parse the .conllu files and store all the three "Training, Dev and Testing" datasets of both languages into the "Paresed_datasets" variable.
- We will create a function "read_and_parse_conllu" which takes "file_path" as a parameter and returns conllu parsed class

In [3]:
# Function to read and parse a .conllu file
def read_and_parse_conllu(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return conllu.parse(file.read())

In [4]:
# Iterate through each language dataset
for language, data_paths in dataset_paths.items():
    print(f"\nLanguage: {language}\nData Paths:\n\t {data_paths}\n")
    
    for data_path in data_paths:
        parsed_data = read_and_parse_conllu(data_path)
        parsed_datasets[language].append(parsed_data)
        print(f"Parsed data for {language}: {data_path}")
    
    print(f"\nNumber of Sentence Datasets: {len(parsed_datasets[language])}\n")


Language: English
Data Paths:
	 ['Data/English/en_gum-ud-train.conllu', 'Data/English/en_gum-ud-dev.conllu', 'Data/English/en_gum-ud-test.conllu']

Parsed data for English: Data/English/en_gum-ud-train.conllu
Parsed data for English: Data/English/en_gum-ud-dev.conllu
Parsed data for English: Data/English/en_gum-ud-test.conllu

Number of Sentence Datasets: 3


Language: Basque
Data Paths:
	 ['Data/Basque/eu_bdt-ud-train.conllu', 'Data/Basque/eu_bdt-ud-dev.conllu', 'Data/Basque/eu_bdt-ud-test.conllu']

Parsed data for Basque: Data/Basque/eu_bdt-ud-train.conllu
Parsed data for Basque: Data/Basque/eu_bdt-ud-dev.conllu
Parsed data for Basque: Data/Basque/eu_bdt-ud-test.conllu

Number of Sentence Datasets: 3



## Exploring the Parsed Datasets

Here we explore the basic information regrading the datasets which we have just parsed 

In [5]:
# Iterate through the languages and their corresponding datasets
for idx, language in enumerate(parsed_datasets.keys()):
    print(f"\nDataset of Language {idx + 1}: {language}")

    # Define dataset names
    dataset_names = ["training", "dev", "testing"]

    # Print the number of sentence datasets for each language
    print(f"Number of Sentence Datasets: {len(parsed_datasets[language])}")

    # Iterate through the sentence datasets for each language
    for idx, sentence_dataset in enumerate(parsed_datasets[language]):
        dataset_name = dataset_names[idx]

        # Print the number of sentences in each dataset for the current language
        print(f"{language} Sentences in the {dataset_name} Dataset: {len(sentence_dataset)}")


Dataset of Language 1: English
Number of Sentence Datasets: 3
English Sentences in the training Dataset: 8548
English Sentences in the dev Dataset: 1117
English Sentences in the testing Dataset: 1096

Dataset of Language 2: Basque
Number of Sentence Datasets: 3
Basque Sentences in the training Dataset: 5396
Basque Sentences in the dev Dataset: 1798
Basque Sentences in the testing Dataset: 1799


## Parsed Dataset into Token and Tags

Here we Extract the words(form) and POS(upos - universal part of speech) and store them within a tuple that is appended to our list. 

In [6]:
for language in parsed_datasets.keys():
    # English, Basque
    
    for sentence_dataset in parsed_datasets[language]:
        # 0 train, 1 dev, 2 test
        temp_tokenized_list = []
        
        for sentence in sentence_dataset:
            # Single Sentence form the one of the datasets from above
            
            for token in sentence:
                # single token from the sentence

                temp_tokenized_list.append((token['form'], token['upos']))
                POS_dataset_tags[language].add(token['upos'])
        
        tokenized_datasets[language].append(temp_tokenized_list)

## Exploring the Tokenized and Tagged Data

Here we explore the token size and all the Tags which are within a certain language. Furthemore we also see some examples of each dataset

In [7]:
languages = ['English', 'Basque']

# Iterating through each language
for language in languages:
    for dataset_type, tokens in zip(['Train', 'Dev', 'Test'], tokenized_datasets[language]):
        print(f"{language} {dataset_type} Tokens: {len(tokens)}")
        print(f"Example: {tokens[:5]}")
    print() # for better readability, no other purpose

# Printing the Part of Speech of both Languages
for language in languages:
    print(f"{language} POS Tags list: {POS_dataset_tags[language]}")


English Train Tokens: 150143
Example: [('Aesthetic', 'ADJ'), ('Appreciation', 'NOUN'), ('and', 'CCONJ'), ('Spanish', 'ADJ'), ('Art', 'NOUN')]
English Dev Tokens: 19964
Example: [('Introduction', 'NOUN'), ('Research', 'NOUN'), ('on', 'ADP'), ('adult', 'NOUN'), ('-', 'PUNCT')]
English Test Tokens: 20171
Example: [('The', 'DET'), ('prevalence', 'NOUN'), ('of', 'ADP'), ('discrimination', 'NOUN'), ('across', 'ADP')]

Basque Train Tokens: 72974
Example: [('Gero', 'ADV'), (',', 'PUNCT'), ('lortutako', 'VERB'), ('masa', 'NOUN'), ('molde', 'NOUN')]
Basque Dev Tokens: 24095
Example: [('Atenasen', 'PROPN'), ('ordea', 'CCONJ'), (',', 'PUNCT'), ('beste', 'DET'), ('bost', 'NUM')]
Basque Test Tokens: 24374
Example: [('Familian', 'NOUN'), (',', 'PUNCT'), ('aldiz', 'CCONJ'), (',', 'PUNCT'), ('ez', 'PART')]

English POS Tags list: {'PROPN', 'PRON', 'NOUN', '_', 'NUM', 'PUNCT', 'VERB', 'PART', 'ADP', 'AUX', 'ADV', 'DET', 'X', 'SCONJ', 'ADJ', 'INTJ', 'SYM', 'CCONJ'}
Basque POS Tags list: {'PROPN', 'PRON',

# Step 2: Implementing HMM and Viterbi

Here we will implement the Hidden Markov model by calculating the Emission matrix as well as the Transition matrix of the Tokens. Then after that we will use the Viterbi algorithm to calculate the best path of a given input.

## Part 1: Calculating the Emission Matrix