# Report: Hindi Chunking using XLM-RoBERTa for Token Classification

## 1. Introduction
The objective of this project is to develop a chunking model for Hindi language using a pre-trained multilingual model, **XLM-RoBERTa**, for token classification. Chunking is the process of grouping individual tokens (words) into syntactic units or phrases, such as noun phrases (NP), verb phrases (VP), or prepositional phrases (PP). This project aims to preprocess the dataset, create chunk labels (BIO format), and fine-tune a transformer-based model for accurate chunk predictions.

In this report, we describe the dataset preparation, model architecture, experiments, and evaluation results for chunking Hindi text. We utilized the Universal Dependencies (UD) Hindi Treebank and categorized tokens based on their Universal POS tags into six categories: **NP**, **VP**, **ADJP**, **ADVP**, **PP**, and **Other**.

## 2. Dataset and Preprocessing
The dataset used for this task is the **Hindi UD Treebank (HDTB)**, which provides annotated text in the CoNLL-U format. We preprocessed the dataset to remove extra columns, normalize the sentence structure, and generate chunk labels. The data is split into training, development, and test sets:
<!-- - **Training Set**: 13,304 sentences
- **Development Set**: 1,657 sentences
- **Test Set**: 1,661 sentences -->

The preprocessing steps involved:
1. Trimming each line to retain the first 10 columns of CoNLL-U files.
2. Loading sentences and extracting tokens, Universal POS tags (UPOS), and dependency relations.
3. Assigning chunk labels (BIO format) based on UPOS tags.

Chunk labels were generated by categorizing tokens as follows:
- **NP**: For tokens labeled as PROPN, NOUN, or PRON.
- **VP**: For tokens labeled as VERB.
- **ADJP**: For tokens labeled as ADJ.
- **ADVP**: For tokens labeled as ADV.
- **PP**: For tokens labeled as ADP.
- **Other**: For all other tokens.

### Data Representation:
Each sentence was represented with:
- **Tokens**: Words or subwords in the sentence.
- **UPOS**: Universal part-of-speech tags for each token.
- **Chunk Labels**: BIO (Beginning-Inside-Outside) chunk labels for syntactic chunks.

## 3. Model and Experimental Setup
For this task, we employed **XLM-RoBERTa**, a transformer-based multilingual model, fine-tuned for token classification using chunk labels. The model was initialized with pre-trained weights from the **xlm-roberta-base** model and fine-tuned on the labeled Hindi chunking data.The model was fine-tuned with gradient checkpointing enabled to optimize memory usage.

### Training Configuration:
- **Batch Size**: 2 (due to memory constraints)
- **Learning Rate**: 3e-5
- **Epochs**: 5
- **Evaluation Strategy**: Evaluated on the development set after each epoch using precision, recall, and F1-score.

### Training Setup
The model was trained for 5 epochs with a learning rate of 3e-5, a batch size of 2, and weight decay of 0.01. Gradient accumulation was set to 4 to improve learning despite small batch sizes. The compute_metrics() function was used to evaluate the model, calculating precision, recall, and F1-score. Fine-tuning was performed using the Trainer API from the Hugging Face library.

### Tokenization and Alignment:
The Hindi sentences were tokenized using the XLM-RoBERTa tokenizer, which produces subword tokens. The labels were aligned with these subword tokens by propagating chunk labels over subword tokens using the following strategy:
- The first subword token of each word retains the chunk label.
- Subsequent subword tokens are ignored unless they belong to an "I-" chunk (e.g., I-NP for inside a noun phrase).

### Label Mapping:
The chunk labels were encoded as integers, and a label-to-ID mapping was created to convert BIO labels to numeric form. This was crucial for training the token classification model.

## 4. Results and Analysis
The performance of the fine-tuned model was evaluated on both the **development** and **test** sets. Below are the key results from the experiments:

### Validation Set Results:
| Metric      | Value    |
|-------------|----------|
| Precision   | 98.652%   |
| Recall      | 98.656%   |
| F1-score    | 98.653%   |

### Test Set Results:
| Metric      | Value    |
|-------------|----------|
| Precision   | 98.515%   |
| Recall      | 98.516%   |
| F1-score    | 98.515%   |

The model performed exceptionally well in chunking Hindi text, achieving an overall F1-score of **98.51%** on the test set. This demonstrates that XLM-RoBERTa is effective for syntactic chunking in a low-resource language like Hindi. 

The classification report shows that the model performed particularly well on noun phrases (NP) and verb phrases (VP), while the performance was slightly lower for smaller categories such as prepositional phrases (PP) and adjective phrases (ADJP).

### Error Analysis:
Some of the errors in chunking occurred due to:
1. **Ambiguity in short sentences**: The model sometimes misclassified short, ambiguous sentences where the POS context was insufficient.
2. **Subword tokenization errors**: Subword tokenization occasionally led to incorrect alignment between tokens and chunk labels, particularly for inflected forms or compound words in Hindi.

## 5. Conclusion and Future Work
This project successfully fine-tuned a pre-trained transformer model for the task of syntactic chunking in Hindi. The use of XLM-RoBERTa showed promising results, achieving high precision, recall, and F1 scores. The model demonstrated robustness in recognizing larger phrase chunks, such as noun and verb phrases.

<!-- In future work, we could explore:
1. **Subword alignment improvements**: Improving the handling of subword tokens to reduce misalignment errors.
2. **Data augmentation**: Introducing additional training data to cover more linguistic phenomena in Hindi.
3. **Cross-lingual transfer**: Applying the model to other low-resource languages using transfer learning techniques. -->

### References:
1. Universal Dependencies Hindi Treebank (HDTB).
2. XLM-RoBERTa: Conneau, A., et al. "Unsupervised cross-lingual representation learning at scale." (2020). 

This report demonstrates the potential of transformer models for syntactic chunking in low-resource languages, highlighting the flexibility and power of pre-trained multilingual models like XLM-RoBERTa for NLP tasks.