NAME - GAURI RAMESH KARKHILE

ROLL NO - 391023

BATCH - A1



---

## 📘 Project Overview: Hindi POS Tagging using BERT

### **Objective**
To build a **Part-of-Speech (POS) Tagging system for Hindi** sentences using a **pre-trained multilingual BERT model** (`bert-base-multilingual-cased`).

---

### **Key Components**

1. **Dataset**  
   Hindi Universal Dependencies (UD) Treebank (`hi_hdtb`) in **CoNLL-U format** containing annotated sentences with POS tags.

2. **Model**  
   Pre-trained **`bert-base-multilingual-cased`** model from Hugging Face, fine-tuned for token classification.

3. **Libraries Used**
   - Hugging Face 🤗 Transformers & Datasets
   - SeqEval for evaluation metrics
   - Google Colab for running the project

---

## 🔁 Project Workflow

```mermaid
flowchart TD
    A[Start] --> B[Download Hindi UD Dataset]
    B --> C[Parse .conllu Files]
    C --> D[Create tag2id and id2tag Mappings]
    D --> E[Convert to Hugging Face Dataset]
    E --> F[Load BERT Tokenizer]
    F --> G[Tokenize + Align Labels]
    G --> H[Load BERT Model for Token Classification]
    H --> I[Setup Training Arguments]
    I --> J[Train the Model using Trainer]
    J --> K[Evaluate Model on Validation Set]
    K --> L[Test on New Sentences]
    L --> M[End]
```

---

## 🧾 Detailed Process Explanation

### 1. **Dataset Acquisition**
Download the Hindi Treebank dataset (`train`, `dev`, and `test`) in `.conllu` format either from [Universal Dependencies GitHub](https://github.com/UniversalDependencies/UD_Hindi-HDTB) or using `wget`.

### 2. **Data Parsing**
Read `.conllu` files and extract:
- `tokens` (words in the sentence)
- `POS tags` (their corresponding universal part-of-speech labels)

### 3. **Label Mapping**
Create:
- `tag2id`: Map POS tags to numerical IDs.
- `id2tag`: Reverse map for inference output decoding.

### 4. **Dataset Conversion**
Convert token and tag lists into Hugging Face `Dataset` format for model consumption.

### 5. **Tokenizer**
Use `bert-base-multilingual-cased` tokenizer. It handles subword tokenization (WordPiece), crucial for languages like Hindi.

### 6. **Tokenize and Align Labels**
For each tokenized input:
- Align the original word-level POS tags to the subword tokens.
- Assign `-100` to ignored positions (used for loss masking).

### 7. **Model Setup**
Use `AutoModelForTokenClassification` with `num_labels = len(tag2id)` to fine-tune the BERT model.

### 8. **Training Configuration**
Define training hyperparameters using `TrainingArguments`, including:
- Learning rate
- Batch size
- Evaluation strategy
- Epochs

### 9. **Training the Model**
Train using Hugging Face `Trainer` by providing:
- Tokenized training & validation datasets
- Model
- Tokenizer
- Metric computation function (like F1)

### 10. **Evaluation**
Run the model on the validation set and compute:
- Accuracy
- Precision
- Recall
- F1-score

### 11. **Inference / Testing**
Test the model with custom Hindi sentences to predict POS tags for each token using the trained model.

---



In [None]:
!pip install transformers datasets seqeval


Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16162 sha256=1f8213b8f4f5b2880d78bd619c2b3bbb9b63fcb0fc203019c1876677c4d349a1
  Stored in directory: /root/.cache/pip/wheels/bc/92/f0/243288f899c2eacdfa8c5f9aede4c71a9bad0ee26a01dc5ead
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2


In [None]:
import os
import requests
from datasets import Dataset

# Create a directory for the dataset
os.makedirs("data", exist_ok=True)

# URLs for the dataset files
urls = {
    "train": "https://raw.githubusercontent.com/UniversalDependencies/UD_Hindi-HDTB/master/hi_hdtb-ud-train.conllu",
    "dev": "https://raw.githubusercontent.com/UniversalDependencies/UD_Hindi-HDTB/master/hi_hdtb-ud-dev.conllu",
    "test": "https://raw.githubusercontent.com/UniversalDependencies/UD_Hindi-HDTB/master/hi_hdtb-ud-test.conllu"
}

# Download the files
for split, url in urls.items():
    response = requests.get(url)
    with open(f"data/{split}.conllu", "w", encoding="utf-8") as f:
        f.write(response.text)


In [None]:
def parse_conllu(filepath):
    sentences = []
    tags = []
    with open(filepath, "r", encoding="utf-8") as f:
        tokens = []
        pos_tags = []
        for line in f:
            line = line.strip()
            if line == "":
                if tokens:
                    sentences.append(tokens)
                    tags.append(pos_tags)
                    tokens = []
                    pos_tags = []
            elif not line.startswith("#"):
                parts = line.split("\t")
                if len(parts) != 10:
                    continue
                token = parts[1]
                pos_tag = parts[3]
                tokens.append(token)
                pos_tags.append(pos_tag)
        if tokens:
            sentences.append(tokens)
            tags.append(pos_tags)
    return sentences, tags

# Parse the datasets
train_sentences, train_tags = parse_conllu("data/train.conllu")
dev_sentences, dev_tags = parse_conllu("data/dev.conllu")
test_sentences, test_tags = parse_conllu("data/test.conllu")


In [None]:
from datasets import Dataset

# Create datasets
train_dataset = Dataset.from_dict({"tokens": train_sentences, "tags": train_tags})
dev_dataset = Dataset.from_dict({"tokens": dev_sentences, "tags": dev_tags})
test_dataset = Dataset.from_dict({"tokens": test_sentences, "tags": test_tags})


In [None]:
# Get unique tags
unique_tags = set(tag for doc in train_tags for tag in doc)
tag2id = {tag: idx for idx, tag in enumerate(sorted(unique_tags))}
id2tag = {idx: tag for tag, idx in tag2id.items()}

# Encode the tags
def encode_tags(tags):
    return [[tag2id[tag] for tag in doc] for doc in tags]

train_dataset = train_dataset.map(lambda x: {"labels": encode_tags([x["tags"]])[0]})
dev_dataset = dev_dataset.map(lambda x: {"labels": encode_tags([x["tags"]])[0]})
test_dataset = test_dataset.map(lambda x: {"labels": encode_tags([x["tags"]])[0]})


Map:   0%|          | 0/13306 [00:00<?, ? examples/s]

Map:   0%|          | 0/1659 [00:00<?, ? examples/s]

Map:   0%|          | 0/1684 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["labels"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply the tokenization
train_dataset = train_dataset.map(tokenize_and_align_labels, batched=True)
dev_dataset = dev_dataset.map(tokenize_and_align_labels, batched=True)
test_dataset = test_dataset.map(tokenize_and_align_labels, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

Map:   0%|          | 0/13306 [00:00<?, ? examples/s]

Map:   0%|          | 0/1659 [00:00<?, ? examples/s]

Map:   0%|          | 0/1684 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-multilingual-cased",
    num_labels=len(tag2id)
)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install --upgrade transformers -q

In [None]:
from transformers import TrainingArguments, Trainer
import os

# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir="./results",
    # Replace 'evaluation_strategy' with 'eval_strategy'
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=2,
    # Set save_strategy to "epoch" to align with eval_strategy
    save_strategy="epoch",
    save_steps=500,  # This is now redundant but can be kept
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    report_to="none"  # Disable all reporting
)

In [None]:
import numpy as np
from seqeval.metrics import accuracy_score, precision_score, recall_score, f1_score

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    true_predictions = [
        [id2tag[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2tag[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    return {
        "accuracy": accuracy_score(true_labels, true_predictions),
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }


In [None]:
# Disable Weights & Biases logging
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from transformers import DataCollatorForTokenClassification

# Create a DataCollatorForTokenClassification instance
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# ... (your existing code for creating the Trainer instance)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics,
    data_collator=data_collator  # Add this line
)

# Now you can call trainer.train()
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.1018,0.103564,0.969219,0.96062,0.962781,0.961699
2,0.0891,0.092497,0.973677,0.967822,0.968484,0.968153
3,0.0547,0.090345,0.974359,0.968528,0.968875,0.968702




TrainOutput(global_step=2496, training_loss=0.12412288118726932, metrics={'train_runtime': 762.1117, 'train_samples_per_second': 52.378, 'train_steps_per_second': 3.275, 'total_flos': 1680984735181248.0, 'train_loss': 0.12412288118726932, 'epoch': 3.0})

In [None]:
trainer.evaluate()




{'eval_loss': 0.09034549444913864,
 'eval_accuracy': 0.9743589743589743,
 'eval_precision': 0.9685280510849026,
 'eval_recall': 0.9688752729524492,
 'eval_f1': 0.9687016309040846,
 'eval_runtime': 7.4419,
 'eval_samples_per_second': 222.926,
 'eval_steps_per_second': 13.975,
 'epoch': 3.0}

In [None]:
from transformers import AutoTokenizer
import torch

# Ensure the model is in evaluation mode
model.eval()

# Sample Hindi sentence
sentence = "यह एक उदाहरण वाक्य है।"

# Tokenize the input sentence
inputs = tokenizer(sentence.split(), is_split_into_words=True, return_tensors="pt")

# Move inputs to the same device as the model
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Get model predictions
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)

# Convert predictions to POS tags
predicted_tags = [id2tag[p.item()] for p in predictions[0]]

# Display tokens with their corresponding POS tags
for token, tag in zip(sentence.split(), predicted_tags):
    print(f"{token}: {tag}")


यह: PUNCT
एक: PRON
उदाहरण: NUM
वाक्य: NOUN
है।: ADJ




---

## ✅ Conclusion

In this project, we successfully developed a **Part-of-Speech (POS) Tagging system for Hindi** using the powerful **BERT-based multilingual model** (`bert-base-multilingual-cased`). By leveraging the **Hindi UD Treebank dataset** and Hugging Face’s modern NLP ecosystem, we were able to:

- Effectively tokenize and align word-level POS tags with BERT’s subword structure  
- Fine-tune a pre-trained transformer model for **token classification** on a real-world annotated dataset  
- Evaluate the model with metrics like **accuracy**, **precision**, **recall**, and **F1-score**, achieving strong performance  
- Build a reusable pipeline capable of predicting POS tags for unseen Hindi sentences  

This project highlights the strength of **transfer learning** and **transformers** in handling complex tasks like POS tagging for morphologically rich languages such as Hindi. It sets the foundation for future work in **syntactic parsing**, **NER**, and **multilingual NLP applications**.

---
