### Transformers - Assignment 3



In [1]:
!pip install -U transformers datasets peft evaluate seqeval



In [2]:
!pip install fasttext



In [3]:
import requests
import numpy as np
import pandas as pd
import fasttext
import fasttext.util
import torch
import transformers
import evaluate
import datasets

- **Downloading and Processing CoNLL-U Data**
  - The function `download_and_read_conllu(url)` fetches a CoNLL-U formatted dataset from a URL and processes it into a structured DataFrame.
  - It extracts **words (tokens)** and **Part-of-Speech (POS) tags** from each sentence.
  - A unique **sentence ID** is assigned to each sentence.

- **Dataset Handling**
  - URLs for training, development (validation), and test datasets are specified.
  - The function is called to load and process each dataset into separate Pandas DataFrames (`train_df`, `dev_df`, `test_df`)

In [4]:
import requests
import pandas as pd

def download_and_read_conllu(url):
    """
    Downloads and processes a CoNLL-U formatted file.
    Extracts words and POS tags from each sentence and stores them in a DataFrame.
    Adds an 'id' column to uniquely identify each sentence.
    """
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception for HTTP errors

    data = []
    sentence_tokens, sentence_tags = [], []
    sentence_id = 1  # Unique sentence ID

    for line in response.text.splitlines():
        line = line.strip()

        if not line:  # End of sentence
            if sentence_tokens:
                data.append({"id": sentence_id, "tokens": " ".join(sentence_tokens), "POS": " ".join(sentence_tags)})
                sentence_tokens, sentence_tags = [], []
                sentence_id += 1
            continue

        if line.startswith('#'):
            continue  # Skip comments

        tokens = line.split('\t')
        if len(tokens) > 3:
            word, pos_tag = tokens[1], tokens[3]
            if pos_tag not in {"_"}:
                sentence_tokens.append(word)
                sentence_tags.append(pos_tag)

    return pd.DataFrame(data, columns=["id", "tokens", "POS"])

# URLs for dataset files
train_url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-train.conllu"
dev_url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-dev.conllu"
test_url = "https://raw.githubusercontent.com/UniversalDependencies/UD_English-EWT/master/en_ewt-ud-test.conllu"

# Load datasets
train_df = download_and_read_conllu(train_url)
dev_df = download_and_read_conllu(dev_url)
test_df = download_and_read_conllu(test_url)

# Print dataset sizes
print(f"Loaded {len(train_df)} training sentences")
print(f"Loaded {len(dev_df)} validation sentences")
print(f"Loaded {len(test_df)} test sentences")


Loaded 12544 training sentences
Loaded 2001 validation sentences
Loaded 2077 test sentences


In [5]:
train_df

Unnamed: 0,id,tokens,POS
0,1,Al - Zaman : American forces killed Shaikh Abd...,PROPN PUNCT PROPN PUNCT ADJ NOUN VERB PROPN PR...
1,2,[ This killing of a respected cleric will be c...,PUNCT DET NOUN ADP DET ADJ NOUN AUX AUX VERB P...
2,3,DPA : Iraqi authorities announced that they ha...,PROPN PUNCT ADJ NOUN VERB SCONJ PRON AUX VERB ...
3,4,Two of them were being run by 2 officials of t...,NUM ADP PRON AUX AUX VERB ADP NUM NOUN ADP DET...
4,5,"The MoI in Iraq is equivalent to the US FBI , ...",DET PROPN ADP PROPN AUX ADJ ADP DET PROPN PROP...
...,...,...,...
12539,12540,"Of course , they could n't call him either to ...",ADP NOUN PUNCT PRON AUX PART VERB PRON ADV PAR...
12540,12541,On Monday I called and again it was a big to -...,ADP PROPN PRON VERB CCONJ ADV PRON AUX DET ADJ...
12541,12542,Supposedly they will be holding it for me this...,ADV PRON AUX AUX VERB PRON ADP PRON DET NOUN P...
12542,12543,The employees at this Sear's are completely ap...,DET NOUN ADP DET PROPN AUX ADV ADJ CCONJ PRON ...


In [6]:
dev_df.head(1)

Unnamed: 0,id,tokens,POS
0,1,From the AP comes this story :,ADP DET PROPN VERB DET NOUN PUNCT


In [7]:
test_df.head(1)

Unnamed: 0,id,tokens,POS
0,1,What if Google Morphed Into GoogleOS ?,PRON SCONJ PROPN VERB ADP PROPN PUNCT


- The code flattens all POS tags from the training dataset (train_df), collects them into a list, and removes duplicates by converting them into a set. Finally, it converts the set back into a list and prints the unique POS tags.

In [8]:
all_pos_tags = [tags.split() for tags in train_df['POS']]
flat_pos_tags = [tag for sublist in all_pos_tags for tag in sublist]
unique_pos_tags = set(flat_pos_tags)
unique_pos_tags = list(unique_pos_tags)
print(unique_pos_tags)

['CCONJ', 'ADV', 'VERB', 'ADP', 'DET', 'PRON', 'AUX', 'PART', 'PUNCT', 'X', 'SCONJ', 'NOUN', 'SYM', 'ADJ', 'NUM', 'PROPN', 'INTJ']


In [10]:
pos_mapping_dict = {
    'CCONJ': 0, 'ADV': 1, 'VERB': 2, 'ADP': 3, 'DET': 4, 'PRON': 5, 'AUX': 6,
    'PART': 7, 'PUNCT': 8, 'X': 9, 'SCONJ': 10, 'NOUN': 11, 'SYM': 12,
    'ADJ': 13, 'NUM': 14, 'PROPN': 15, 'INTJ': 16
}

def map_pos_to_category(pos_tags): return [pos_mapping_dict[tag] for tag in pos_tags]

- The tokens and POS columns in train_df, test_df, and dev_df are converted from space-separated strings to lists using .str.split().
- The POS tags are further transformed into numerical categories using the map_pos_to_category function.

In [11]:
train_df['tokens'] = train_df['tokens'].str.split()
test_df['tokens'] = test_df['tokens'].str.split()
dev_df['tokens'] = dev_df['tokens'].str.split()
train_df['POS'] = train_df['POS'].str.split()
test_df['POS'] = test_df['POS'].str.split()
dev_df['POS'] = dev_df['POS'].str.split()
train_df['POS'] = train_df['POS'].apply(map_pos_to_category)
test_df['POS'] = test_df['POS'].apply(map_pos_to_category)
dev_df['POS'] = dev_df['POS'].apply(map_pos_to_category)
train_df.head(5)

Unnamed: 0,id,tokens,POS
0,1,"[Al, -, Zaman, :, American, forces, killed, Sh...","[15, 8, 15, 8, 13, 11, 2, 15, 15, 15, 8, 15, 8..."
1,2,"[[, This, killing, of, a, respected, cleric, w...","[8, 4, 11, 3, 4, 13, 11, 6, 6, 2, 5, 11, 3, 11..."
2,3,"[DPA, :, Iraqi, authorities, announced, that, ...","[15, 8, 13, 11, 2, 10, 5, 6, 2, 3, 14, 13, 11,..."
3,4,"[Two, of, them, were, being, run, by, 2, offic...","[14, 3, 5, 6, 6, 2, 3, 14, 11, 3, 4, 15, 3, 4,..."
4,5,"[The, MoI, in, Iraq, is, equivalent, to, the, ...","[4, 15, 3, 15, 6, 13, 3, 4, 15, 15, 8, 1, 5, 6..."


- We Converted Pandas DataFrames into Hugging Face Dataset objects.
- We Stored them in a DatasetDict for structured management.
- Useful for NLP tasks like POS tagging and token-based analyses.

In [12]:
from datasets import Dataset, DatasetDict

# Convert pandas DataFrames to Hugging Face Datasets
train_dataset = Dataset.from_pandas(train_df[['id','tokens', 'POS']])
validation_dataset = Dataset.from_pandas(dev_df[['id','tokens', 'POS']])
test_dataset = Dataset.from_pandas(test_df[['id','tokens', 'POS']])

# Construct a DatasetDict containing train, validation, and test sets
pos_dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset
})

# Display dataset structure
print(pos_dataset)

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'POS'],
        num_rows: 12544
    })
    validation: Dataset({
        features: ['id', 'tokens', 'POS'],
        num_rows: 2001
    })
    test: Dataset({
        features: ['id', 'tokens', 'POS'],
        num_rows: 2077
    })
})


### Explanation of the `DatasetDict` Output  

- **Structure of the DatasetDict**  
  - The `DatasetDict` consists of three subsets: `train`, `validation`, and `test`.  
  - Each subset is stored as a `Dataset` object with predefined features and row counts.  

- **Features (Columns) in Each Dataset**  
  - **`id`**: A unique identifier for each sentence.  
  - **`tokens`**: A list of words (tokens) in each sentence.  
  - **`POS`**: A list of numerical POS tag labels corresponding to the tokens.  

- **Dataset Sizes**  
  - **Training Set (`train`)**: Contains **12,544** sentences.  
  - **Validation Set (`validation`)**: Contains **2,001** sentences.  
  - **Test Set (`test`)**: Contains **2,077** sentences.  

This dataset is now ready for NLP tasks, such as training a POS-tagging model.


- **Model Selection (it will be used next-below)**  
  - `model_name = "distilbert-base-cased"` specifies **DistilBERT**, a lightweight version of BERT.  
  - The `AutoTokenizer` from Hugging Face is used to load the tokenizer for this model.  

- **Tokenization Process**  
  - `tokenizer = AutoTokenizer.from_pretrained(model_name)` loads the pretrained tokenizer.  
  - `tokenizer(pos_dataset["train"][0]["tokens"], is_split_into_words=True)` tokenizes the first sentence from the training dataset.  
    - The `is_split_into_words=True` argument ensures that the input is treated as a **list of words**, not a single string.  
  - `tokenized_input["input_ids"]` provides the **token IDs** representing the sentence in numerical form.  

- **Token Conversion**  
  - `tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])` converts the token IDs back into **actual tokens**.  
  - The script prints the **original tokens** (`pos_dataset["train"][0]["tokens"]`) and the **tokenized version** (`tokens`).  

This step is essential because **BERT-based models use subword tokenization (WordPiece)**, meaning some words may be split into smaller units, which affects the alignment of POS tags.
- https://huggingface.co/distilbert/distilbert-base-cased

In [13]:
model_name  = "distilbert-base-cased"
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_input = tokenizer(pos_dataset["train"][0]["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(pos_dataset["train"][0]["tokens"])
print(tokens)

['Al', '-', 'Zaman', ':', 'American', 'forces', 'killed', 'Shaikh', 'Abdullah', 'al', '-', 'Ani', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Qaim', ',', 'near', 'the', 'Syrian', 'border', '.']
['[CLS]', 'Al', '-', 'Z', '##aman', ':', 'American', 'forces', 'killed', 'S', '##hai', '##kh', 'Abdullah', 'al', '-', 'An', '##i', ',', 'the', 'preacher', 'at', 'the', 'mosque', 'in', 'the', 'town', 'of', 'Q', '##ai', '##m', ',', 'near', 'the', 'Syrian', 'border', '.', '[SEP]']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Explanation of the Code

#### **1. Function: `align_ner_labels_with_tokens(examples)`**
This function tokenizes input text and aligns **Part-of-Speech (POS) labels** with the tokenized words, ensuring correct label assignment for subword tokenization.

- **Tokenization Step**  
  - The function applies `tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)`, which:
    - Tokenizes words while maintaining word-token alignment.
    - Ensures input is treated as a **list of words**, not a single string.
    - Truncates if the sentence exceeds the model’s max token limit.

- **Aligning POS Labels to Tokenized Words**  
  - `word_mapping = tokenized_data.word_ids(batch_index=index)`: Retrieves word indices for each token after tokenization.  
  - The loop iterates through each **token**:
    - **Special tokens** (e.g., `[CLS]`, `[SEP]`) are assigned `-100` to be ignored during training.
    - **First subword** of a word receives the correct POS label (`ner_tags[word_idx]`).
    - **Subsequent subwords** get `-100`, preventing incorrect label assignments.

- **Final Output**  
  - The function returns a tokenized dictionary where the `"labels"` field contains aligned POS tags.

---

#### **2. Applying the Function to the Dataset**
```python
tokenized_POS = pos_dataset.map(align_ner_labels_with_tokens, batched=True)


In [14]:
def align_ner_labels_with_tokens(examples):
    """
    Tokenizes input text and aligns Named Entity Recognition (NER) labels with tokenized words.

    :param examples: A dictionary containing "tokens" (list of words) and "ner_tags" (corresponding labels).
    :type examples: dict
    :return: A dictionary containing tokenized inputs with aligned NER labels.
    :rtype: dict
    """

    tokenized_data = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )

    updated_labels = []
    for index, ner_tags in enumerate(examples["POS"]):
        word_mapping = tokenized_data.word_ids(batch_index=index)
        label_ids = []
        prev_word_idx = None

        for word_idx in word_mapping:
            if word_idx is None:
                label_ids.append(-100)  # Ignore special tokens
            elif word_idx != prev_word_idx:
                label_ids.append(ner_tags[word_idx])  # Assign label only to the first subword
            else:
                label_ids.append(-100)  # Ignore subsequent subword tokens
            prev_word_idx = word_idx

        updated_labels.append(label_ids)

    tokenized_data["labels"] = updated_labels
    return tokenized_data
tokenized_POS = pos_dataset.map(align_ner_labels_with_tokens, batched=True)
tokenized_train_dataset = pos_dataset['train'].map(align_ner_labels_with_tokens, batched=True)

Map:   0%|          | 0/12544 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/2077 [00:00<?, ? examples/s]

Map:   0%|          | 0/12544 [00:00<?, ? examples/s]

In [15]:
tokenized_train_dataset[1]

{'id': 2,
 'tokens': ['[',
  'This',
  'killing',
  'of',
  'a',
  'respected',
  'cleric',
  'will',
  'be',
  'causing',
  'us',
  'trouble',
  'for',
  'years',
  'to',
  'come',
  '.',
  ']'],
 'POS': [8, 4, 11, 3, 4, 13, 11, 6, 6, 2, 5, 11, 3, 11, 7, 2, 8, 8],
 'input_ids': [101,
  164,
  1188,
  3646,
  1104,
  170,
  9581,
  172,
  2879,
  1596,
  1209,
  1129,
  3989,
  1366,
  3819,
  1111,
  1201,
  1106,
  1435,
  119,
  166,
  102],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1],
 'labels': [-100,
  8,
  4,
  11,
  3,
  4,
  13,
  11,
  -100,
  -100,
  6,
  6,
  2,
  5,
  11,
  3,
  11,
  7,
  2,
  8,
  8,
  -100]}

- The `DataCollatorForTokenClassification` dynamically pads input sequences to ensure that **variable-length sentences** are processed correctly during training.  

In [16]:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

 - Loads a transformer model (e.g., `distilbert-base-cased`) using `AutoModelForTokenClassification`.  
  - Adds a classification head to predict **POS tags** for each token in a sentence.  
  - `num_labels=len(unique_pos_tags)`: Sets the number of possible POS tag outputs.  
  - Utilizes **transfer learning** for efficient and accurate **POS tagging**.

In [17]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer, EarlyStoppingCallback
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(unique_pos_tags))

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
model

DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
   

### **DistilBERT for Token Classification Model Breakdown**

- **`DistilBertForTokenClassification`**  
  - A DistilBERT model adapted for **token classification tasks** (e.g., POS tagging, Named Entity Recognition).  
  - Includes a classification head (`classifier`) to predict **POS tags** for each token.

#### **1. Embeddings Layer**  
- **`word_embeddings`:** Converts input tokens into 768-dimensional vectors.  
- **`position_embeddings`:** Adds positional information to tokens.  
- **`LayerNorm`:** Normalizes embeddings for stability.  
- **`dropout`:** Prevents overfitting by randomly disabling neurons.

#### **2. Transformer Layers**  
- **Contains 6 `TransformerBlock` layers** (as opposed to 12 in BERT for efficiency).  
- **Each block consists of:**  
  - **Self-Attention (`DistilBertSdpaAttention`)**:  
    - `q_lin`, `k_lin`, `v_lin`: Linear layers for query, key, and value transformations.  
    - `out_lin`: Final output transformation after attention.  
  - **Feed-Forward Network (`FFN`)**:  
    - **`lin1` & `lin2`**: Linear layers to expand and compress hidden states.  
    - **`GELUActivation`**: Non-linear activation function.  
  - **Layer Normalization (`LayerNorm`)** ensures stable training.  
  - **Dropout layers** prevent overfitting.

#### **3. Classification Head**  
- **`dropout`:** Applied before classification to improve generalization.  
- **`classifier (Linear(768 → 17))`:**  
  - Maps 768-dimensional embeddings to **17 POS tags**.  
  - Predicts one label per token.

In [20]:
!pip install optuna



### **Hyperparameter Optimization with Optuna**

- **Optuna for Hyperparameter Tuning**  
  - Searches for the best **learning rate, batch size, and weight decay**.  
  - **Disables Weights & Biases logging** to prevent external tracking.  

- **Training Arguments & Model Training**  
  - Runs **5 epochs** and evaluates **at steps** using `Trainer`.  
  - **Early stopping** prevents overfitting if no improvement in 3 steps.  

- **Optimization Process**  
  - Runs **5 trials** (`n_trials=5`) to minimize **evaluation loss**.  
  - Stores and **retrieves the best hyperparameters** for fine-tuning.

In [21]:
import optuna
import datetime
import os
from transformers import TrainingArguments, Trainer, EarlyStoppingCallback

# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

def objective(trial):
    """
    Objective function for Optuna hyperparameter optimization.

    :param trial: Optuna trial object for suggesting hyperparameters.
    :type trial: optuna.trial.Trial
    :return: Evaluation loss for the given set of hyperparameters.
    :rtype: float
    """

    # Suggest hyperparameters
    learning_rate = trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True)
    batch_size = trial.suggest_categorical("per_device_train_batch_size", [16, 32, 48, 64])
    weight_decay = trial.suggest_float("weight_decay", 0.0, 0.3)
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

    # Define training arguments
    training_args = TrainingArguments(
        output_dir=f'./results_{timestamp}',
        num_train_epochs=5,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size * 2,
        weight_decay=weight_decay,
        logging_dir='./logs',
        warmup_steps=500,
        eval_steps=max(1, 60 - (batch_size // 8) * 5),
        save_steps=max(1, 60 - (batch_size // 8) * 5),
        evaluation_strategy="steps",
        load_best_model_at_end=True,
        save_total_limit=3
        # report_to="none"  # Prevents logging to external services
    )

    # Initialize Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_POS["validation"],
        tokenizer=tokenizer,
        data_collator=data_collator,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.005)]
    )

    # Train and evaluate model
    trainer.train()

    return trainer.evaluate()["eval_loss"]

# Create and run the Optuna study
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=5)

# Retrieve and display the best hyperparameters
best_params = study.best_params
print("Best hyperparameters:", best_params)

[I 2025-03-09 17:13:46,291] A new study created in memory with name: no-name-d0a39c2f-8afb-4f55-8ace-f73dea3d5a2e
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss
30,No log,2.870504
60,No log,2.779237
90,No log,2.61137
120,No log,2.287072
150,No log,1.716381
180,No log,1.000399
210,No log,0.554113
240,No log,0.364774
270,No log,0.288595
300,No log,0.248307


[I 2025-03-09 17:19:05,115] Trial 0 finished with value: 0.14634588360786438 and parameters: {'learning_rate': 1.5272274250707158e-05, 'per_device_train_batch_size': 48, 'weight_decay': 0.18693442360451104}. Best is trial 0 with value: 0.14634588360786438.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss
40,No log,0.140381
80,No log,0.139779
120,No log,0.138417
160,No log,0.134814


[I 2025-03-09 17:20:16,129] Trial 1 finished with value: 0.13481415808200836 and parameters: {'learning_rate': 4.66979459200209e-05, 'per_device_train_batch_size': 32, 'weight_decay': 0.1693780446459859}. Best is trial 1 with value: 0.13481415808200836.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss
20,No log,0.133555
40,No log,0.132324
60,No log,0.132414
80,No log,0.132605


[I 2025-03-09 17:21:22,376] Trial 2 finished with value: 0.1323237270116806 and parameters: {'learning_rate': 1.8461392465934126e-05, 'per_device_train_batch_size': 64, 'weight_decay': 0.27338349586364324}. Best is trial 2 with value: 0.1323237270116806.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss
50,No log,0.13255
100,No log,0.132699
150,No log,0.134789
200,No log,0.135926


[I 2025-03-09 17:22:14,789] Trial 3 finished with value: 0.1325504183769226 and parameters: {'learning_rate': 1.7717009481526326e-05, 'per_device_train_batch_size': 16, 'weight_decay': 0.21536890692269628}. Best is trial 2 with value: 0.1323237270116806.
Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss
30,No log,0.131147
60,No log,0.131936
90,No log,0.133508
120,No log,0.13179


[I 2025-03-09 17:23:30,695] Trial 4 finished with value: 0.1311473548412323 and parameters: {'learning_rate': 2.861470579172581e-05, 'per_device_train_batch_size': 48, 'weight_decay': 0.007091599862263786}. Best is trial 4 with value: 0.1311473548412323.


Best hyperparameters: {'learning_rate': 2.861470579172581e-05, 'per_device_train_batch_size': 48, 'weight_decay': 0.007091599862263786}


### **Best Hyperparameters Found by Optuna**

- **`learning_rate`:** `2.86e-05`  
  - An optimal **learning rate** that balances **fast convergence** while avoiding instability.  

- **`per_device_train_batch_size`:** `48`  
  - A well-chosen batch size that ensures **efficient training** without excessive memory usage.  

- **`weight_decay`:** `0.0071`  
  - A small **weight decay** to prevent overfitting while maintaining generalization.  


### **Computing Evaluation Metrics for POS Tagging**

- **Accuracy & Macro F1-Score Calculation**  
  - Uses **Hugging Face's `evaluate` library** to compute **accuracy**.  
  - Uses **Scikit-learn's `f1_score`** to compute **macro F1-score**, treating all POS tags equally.  

- **Steps in `compute_metrics(p)` Function**  
  1. **Extract predictions & labels**, converting logits to label indices.  
  2. **Map numerical labels to POS tags** using `pos_mapping_dict`.  
  3. **Ignore special tokens (`-100`)** to focus on actual words.  
  4. **Flatten predictions & labels** to calculate **accuracy** and **macro F1-score**.

In [22]:
import numpy as np
from sklearn.metrics import f1_score, accuracy_score
import evaluate

# Load Hugging Face accuracy metric
accuracy_metric = evaluate.load("accuracy")

def compute_metrics(p):
    """
    Computes accuracy and macro F1-score for POS tagging evaluation.
    """
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)  # Convert logits to label indices

    # Reverse mapping: {0: "PART", 1: "NUM", ..., 16: "ADP"}
    index_to_pos = {v: k for k, v in pos_mapping_dict.items()}

    # Convert numerical labels back to text using the pos_mapping_dict
    true_labels = [ [index_to_pos[label] for label in label_row if label != -100] for label_row in labels ]
    pred_labels = [ [index_to_pos[pred] for pred, label in zip(pred_row, label_row) if label != -100]
                    for pred_row, label_row in zip(predictions, labels) ]

    # Flatten lists to compute accuracy and F1-score
    true_labels_flat = [label for seq in true_labels for label in seq]
    pred_labels_flat = [label for seq in pred_labels for label in seq]

    # Compute accuracy
    accuracy = accuracy_score(true_labels_flat, pred_labels_flat)

    # Compute macro F1-score
    macro_f1 = f1_score(true_labels_flat, pred_labels_flat, average="macro", labels=list(pos_mapping_dict.keys()))

    return {
        "accuracy": accuracy,
        "macro_f1": macro_f1
    }


### **Training the POS Tagging Model with Optimized Hyperparameters**

- **Best Hyperparameter Selection**  
  - Uses the **best hyperparameter combination** (`learning_rate`, `batch_size`, `weight_decay`) found via **Optuna optimization**.  

- **Training Arguments (`TrainingArguments`)**  
  - **`num_train_epochs=5`**: Trains for 5 epochs.  
  - **`learning_rate`**: Set to the optimal **2.86e-05**.  
  - **Dynamic `eval_steps` and `save_steps`**: Adjusts based on batch size for **efficient evaluation and checkpointing**.  
  - **`load_best_model_at_end=True`**: Ensures the best model (lowest eval loss) is retained.  

- **Trainer Setup (`Trainer`)**  
  - Uses **tokenized train & validation datasets**.  
  - Includes **early stopping** (`patience=5`) to **prevent overfitting**.  
  - Uses **`compute_metrics` function** to evaluate **accuracy & F1-score**.  

- **Model Training (`trainer.train()`)**  
  - Starts training using **DistilBERT for POS tagging** with the optimized hyperparameters.  


In [23]:
best_params = {'learning_rate': 2.861470579172581e-05, 'per_device_train_batch_size': 48, 'weight_decay': 0.007091599862263786}# best hyperparameter combination.

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    learning_rate=best_params['learning_rate'],
    per_device_train_batch_size=best_params['per_device_train_batch_size'],
    per_device_eval_batch_size=best_params['per_device_train_batch_size'] * 2,
    weight_decay=best_params['weight_decay'],
    logging_dir='./logs',
    warmup_steps=500,
    eval_steps=60 - (best_params['per_device_train_batch_size'] / 8) * 5,
    save_steps=60 - (best_params['per_device_train_batch_size'] / 8) * 5,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    save_total_limit=3
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_POS["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=5, early_stopping_threshold=0.005)],
    compute_metrics=compute_metrics
)

trainer.train()

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Step,Training Loss,Validation Loss,Accuracy,Macro F1
30,No log,0.13076,0.963782,0.894495
60,No log,0.1317,0.96402,0.895044
90,No log,0.13338,0.963901,0.895907
120,No log,0.131699,0.963623,0.895605
150,No log,0.128118,0.964259,0.896611
180,No log,0.129765,0.963941,0.896307


TrainOutput(global_step=180, training_loss=0.08387692769368489, metrics={'train_runtime': 101.11, 'train_samples_per_second': 620.315, 'train_steps_per_second': 12.956, 'total_flos': 150406168353408.0, 'train_loss': 0.08387692769368489, 'epoch': 0.6870229007633588})

In [24]:
trainer.evaluate() # for validation set

{'eval_loss': 0.12811832129955292,
 'eval_accuracy': 0.9642587365324216,
 'eval_macro_f1': 0.8966113454943577,
 'eval_runtime': 3.6912,
 'eval_samples_per_second': 542.107,
 'eval_steps_per_second': 5.689,
 'epoch': 0.6870229007633588}

### **Model Evaluation Results In DEV Set for POS Tagging**

- **Performance Metrics**  
  - **`eval_loss = 0.1281`** → Low loss indicates **good model performance**.  
  - **`eval_accuracy = 96.42%`** → Model correctly predicts **96.4%** of POS tags.  
  - **`eval_macro_f1 = 0.8966`** → Strong **macro F1-score** suggests **balanced** performance across all POS tags.  

- **Evaluation Efficiency**  
  - **`eval_runtime = 3.69s`** → Evaluation completed in **under 4 seconds**.  
  - **`eval_samples_per_second = 542.1`** → Model processes **542 samples/sec**.  
  - **`eval_steps_per_second = 5.69`** → Model evaluates **~5.7 steps/sec**.  

- **Epoch Progress**  
  - **`epoch = 0.69`** → Metrics are reported **before completing the first epoch**, indicating early effectiveness.  


### **Generating a Detailed Classification Report for POS Tagging Test Set**


In [25]:
from sklearn.metrics import classification_report
import numpy as np

# Get predictions from trainer
predictions, labels, _ = trainer.predict(tokenized_POS["test"])
predictions = np.argmax(predictions, axis=2)

# Reverse mapping from index to POS tag
index_to_pos = {v: k for k, v in pos_mapping_dict.items()}

# Convert numerical labels back to text using the pos_mapping_dict
true_labels = [ [index_to_pos[label] for label in label_row if label != -100] for label_row in labels ]
pred_labels = [ [index_to_pos[pred] for pred, label in zip(pred_row, label_row) if label != -100]
                for pred_row, label_row in zip(predictions, labels) ]

# Flatten the lists for sklearn
true_labels_flat = [label for seq in true_labels for label in seq]
pred_labels_flat = [label for seq in pred_labels for label in seq]

# Print classification report using sklearn
print("Classification Report:")
print(classification_report(true_labels_flat, pred_labels_flat, labels=list(pos_mapping_dict.keys())))


Classification Report:
              precision    recall  f1-score   support

       CCONJ       0.99      0.99      0.99       736
         ADV       0.94      0.93      0.94      1183
        VERB       0.97      0.98      0.97      2606
         ADP       0.97      0.98      0.98      2030
         DET       1.00      0.99      0.99      1896
        PRON       0.99      0.99      0.99      2166
         AUX       0.99      0.99      0.99      1543
        PART       0.99      0.99      0.99       649
       PUNCT       0.99      1.00      0.99      3096
           X       0.00      0.00      0.00        42
       SCONJ       0.98      0.94      0.96       384
        NOUN       0.94      0.94      0.94      4123
         SYM       0.93      0.80      0.86       109
         ADJ       0.93      0.93      0.93      1794
         NUM       0.94      0.96      0.95       542
       PROPN       0.88      0.91      0.90      2076
        INTJ       0.96      0.76      0.85       121

   

### **POS Tagging Model - Classification Report Summary**

- **Overall Performance**  
  - **Accuracy**: **96%** – The model correctly predicts POS tags for most tokens.  
  - **Weighted F1-score**: **96%** – Strong overall performance across all POS tags.  
  - **Macro F1-score**: **90%** – Suggests slight performance imbalance across classes.  

- **High-Performing POS Tags**  
  - **Punctuation (`PUNCT`)**: **99% Precision & 100% Recall** – Very high accuracy in identifying punctuation.  
  - **Determiners (`DET`), Pronouns (`PRON`), and Auxiliary Verbs (`AUX`)**: **99% F1-score** – Strong prediction accuracy.  
  - **Verbs (`VERB`) & Adpositions (`ADP`)**: **97-98% F1-score** – Well-classified key word categories.  

- **Challenging POS Tags**  
  - **`X` (Unknown words)**: **0% F1-score** – Model struggles with rare/ambiguous tokens.  
  - **`SYM` (Symbols) & `INTJ` (Interjections)**: **80-86% Recall** – Model has difficulty classifying these correctly.  
  - **Proper Nouns (`PROPN`)**: **88% Precision & 91% Recall** – Slightly lower precision, possibly due to capitalization variations.  


- **Predicts POS Tags for Input Text**  
  - Tokenizes the input and runs it through a **DistilBERT-based POS tagging model**.  
  - **Filters out special tokens** (`[CLS]`, `[SEP]`) and merges subword tokens (`##`).  
  - Maps predicted **POS tag indices to text labels** using `pos_mapping_dict`.  
  - Returns a **Pandas DataFrame** with **tokenized words and their POS tags**.

In [26]:
import torch
import pandas as pd

# Invert dictionary to map ID -> POS Tag
pos_mapping_dict = {v: k for k, v in pos_mapping_dict.items()}  # Fix here

def get_prediction(text):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    inputs = tokenizer(text, truncation=True, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs)

    probs = outputs[0][0].softmax(1)
    token_list = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

    # Extract tokens and their predicted tags, skipping special tokens ([CLS], [SEP])
    tokens_n_tags = []
    for i, (token, tag_id) in enumerate(zip(token_list, probs.argmax(axis=1))):
        if token not in ["[CLS]", "[SEP]"]:  # Ignore special tokens
            token = token.replace("##", "")  # Merge subword tokens properly
            pos_tag = pos_mapping_dict.get(tag_id.item(), "UNKNOWN")  # Now should work correctly!
            tokens_n_tags.append((i, token, pos_tag))  # Include index for numbering

    # Convert to Pandas DataFrame
    df = pd.DataFrame(tokens_n_tags, columns=['index', 'token', 'tag'])

    return df

# Example usage
text1 = "I want to go to Mykonos."
result_df = get_prediction(text1)

result_df

Unnamed: 0,index,token,tag
0,1,I,PRON
1,2,want,VERB
2,3,to,PART
3,4,go,VERB
4,5,to,ADP
5,6,My,PROPN
6,7,kon,PROPN
7,8,os,PROPN
8,9,.,PUNCT


- From the above example it seem that our model work properly.

## **Comparison of POS Tagging Models with Baselines (on Test set)**

| Model                 | Accuracy (F1 Score) | Macro AVG (F1 Score) |
|-----------------------|--------------------|----------------------|
| **Majority Baseline** | **0.86**           | **0.80**             |
| **Optimal MLP**       | **0.91**           | **0.83**             |
| **Optimal RNN**       | **0.92**           | **0.86**             |
| **Optimal CNN**       | **0.92**           | **0.84**             |
| **Transformers Model** | **0.96**           | **0.90**             |


### **Observations:**
- **Transformer Model achieves the highest accuracy (96%)**, surpassing **CNN, RNN, MLP, and Majority Baseline**.
- **Macro F1-score (0.90)** is higher than all previous models, indicating strong performance across all POS tags.
- **Fast evaluation runtime (3.69s)** with high efficiency (`542 samples/sec`).
- **RNN and CNN are still competitive**, but Transformers clearly outperform them.