<a href="https://colab.research.google.com/github/dhanu902/FoodieChat-Bot/blob/main/BOT_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##**Load DataSets**

In [1]:
import json
from datasets import Dataset

# -----> Load from Drive
with open('/content/drive/MyDrive/ChatBot/Preprocessed/bio_train.json') as f:
  train_data = json.load(f)

with open('/content/drive/MyDrive/ChatBot/Preprocessed/bio_test.json') as f:
  test_data = json.load(f)

with open('/content/drive/MyDrive/ChatBot/Preprocessed/bio_val.json') as f:
  val_data = json.load(f)

# -----> Convert to HuggingFace DataSet
train_dataset = Dataset.from_list(train_data)
test_dataset = Dataset.from_list(test_data)
val_dataset = Dataset.from_list(val_data)

---
**🔄 Why use dataset.Dataset**
  * HuggingFace `Trainer` usually works with this format
  * Enable batch processing, smart catching and easy tokenization
  * More efficient than Pandas or plain list for token classification tasks

##**Extract unique tags & Build label maps**

In [11]:
u_tags = sorted({tag for sample in train_dataset for tag in sample['tags']})
tag_2_id = {tag: index for index, tag in enumerate(u_tags)}
id_2_tag = {index: tag for tag, index in tag_2_id.items()}
num_labels = len(u_tags)

🔄 **Why**
  * BERT expects numeric IDs, not string labels
  * These maps allows:
    * convert `tags` → IDs for training
    * convert IDs → `tags` during inference

---
---
👉 Example:
If your dataset has:

```c
train_dataset = [
    {"tokens": ["John", "lives", "in", "London"], "ner_tags": ["B-PER", "O", "O", "B-LOC"]},
    {"tokens": ["IBM", "is", "a", "company"], "ner_tags": ["B-ORG", "O", "O", "O"]}
]
```

then:

```
u_tags = ['B-LOC', 'B-ORG', 'B-PER', 'O']
```

---

#### 2. Map Tags → IDs

```c
tag_2_id = {tag: index for index, tag in enumerate(u_tags)}
```

* Creates a dictionary mapping **each tag to a unique ID** (integer).
* Needed because ML models work with numbers, not text.

👉 Example:

```
tag_2_id = {
  'B-LOC': 0,
  'B-ORG': 1,
  'B-PER': 2,
  'O': 3
}
```

---

#### 3. Map IDs → Tags

```c
id_2_tag = {index: tag for tag, index in tag_2_id.items()}
```

* Reverse dictionary to map **IDs back to tags**.
* Useful when decoding model predictions back to human-readable NER labels.

👉 Example:

```
id_2_tag = {
  0: 'B-LOC',
  1: 'B-ORG',
  2: 'B-PER',
  3: 'O'
}
```

---

#### 4. Count Labels

```c
num_labels = len(u_tags)
```

* Counts total unique NER tags.
* Used to define the **output layer size** of the model (e.g., final softmax classifier).

👉 Example:

```
num_labels = 4
```

---

### 🎯 Summary in Words

* `u_tags` → List of all unique NER tags.
* `tag_2_id` → Dictionary: Tag → Numeric ID (for training).
* `id_2_tag` → Dictionary: Numeric ID → Tag (for decoding).
* `num_labels` → Number of unique tags (size of model’s output).

---


## **Tokenize + Align Labels**

In [22]:
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

def Tokenize_AlignLabels(ex):
    tokenize = tokenizer(ex['tokens'],
                         is_split_into_words=True,
                         padding='max_length',
                         truncation=True,
                         max_length=64)

    word_ids = tokenize.word_ids()
    label_ids = []
    prev_wordIDX = None

    for wordIDX in word_ids:
      if wordIDX is None:
        label_ids.append(-100)
      elif wordIDX != prev_wordIDX:
        label_ids.append(tag_2_id[ex['tags'][wordIDX]])
      else:
        label_ids.append(tag_2_id[ex['tags'][wordIDX]])

      prev_wordIDX = wordIDX

    tokenize['labels'] = label_ids
    return tokenize

🧠 **What is `BertTokenizerFast`?**

* ✅ It's a **"Fast"** version of the standard `BertTokenizer`.
* ✅ Built using the 🤗 **Tokenizers library** (written in Rust), which makes it:

  * ⚡ Much faster
  * ✅ More efficient
  * 🔍 Better support for **word-level alignment** (important for NER tasks)

---

**🔍 Why use `BertTokenizerFast` instead of `BertTokenizer`?**

| Feature                  | BertTokenizer (slow)       | BertTokenizerFast 🚀           |
| ------------------------ | -------------------------- | ------------------------------ |
| **Speed**                | Slower (Python backend)    | 🚀 Faster (Rust backend)       |
| **Word alignment (NER)** | Hard to implement manually | ✅ Built-in `word_ids()`        |
| **Token offsets**        | ❌ No                       | ✅ Supports offset mappings     |
| **Subword tracking**     | Limited                    | ✅ Better control over subwords |
| **Hugging Face use**     | Deprecated for new tasks   | ✅ Recommended for modern use   |

---

**🧪 Why it’s perfect for NER / BIO tagging?**

In **NER**, you must align each **tag** with the correct **token**.

Example:

```c
sentence = ["book", "a", "hotel"]
tokens = tokenizer(sentence, is_split_into_words=True)
print(tokens.word_ids())
# [0, 1, 2, None, None]
```

This maps token IDs back to **original words**.

✅ `BertTokenizerFast` makes it easy with `.word_ids()`, so you can:

* Assign tags to **subwords**
* Skip `[CLS]`, `[SEP]`, `[PAD]` tokens

---

**✅ Code Comparison**

```c
# Fast tokenizer
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
```

⚠️ If you accidentally use `BertTokenizer`, you won’t have `.word_ids()` — which breaks label alignment in NER.

---

**📌 TL;DR**

👉 Always use **`BertTokenizerFast`** for modern **NER / token classification** tasks —
especially when you care about:

* ✅ Label alignment
* ✅ Speed
* ✅ Hugging Face compatibility

In [23]:
train_dataset = train_dataset.map(Tokenize_AlignLabels)
test_dataset = test_dataset.map(Tokenize_AlignLabels)
val_dataset = val_dataset.map(Tokenize_AlignLabels)

train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])
val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/58107 [00:00<?, ? examples/s]

Map:   0%|          | 0/12452 [00:00<?, ? examples/s]

Map:   0%|          | 0/12452 [00:00<?, ? examples/s]

#### 1. Tokenize input words

* `ex['tokens']`: the sentence as a list of words (e.g., `["John", "lives", "in", "London"]`).
* `is_split_into_words=True`: tells the tokenizer these are **already split words**, not a raw string.
* `padding='max_length'`: pad/truncate to fixed length `64`.
* Output: dictionary with `input_ids`, `attention_mask`, etc.

---

#### 2. Get word-to-token alignment

```python
word_ids = tokenize.word_ids()
```

* Maps each **subword token** back to the original **word index**.
  Example:

```
tokens: ["playing"]
subwords: ["play", "##ing"]
word_ids: [0, 0]
```

Special tokens like `[CLS]` and `[SEP]` → `None`.

---

#### 3. Initialize label list

```python
label_ids = []
prev_wordIDX = None
```

* We’ll build a list of labels (NER tags) aligned with tokens.
* `prev_wordIDX` helps track repeated subwords.

---

#### 4. Assign labels to tokens

```python
for wordIDX in word_ids:
    if wordIDX is None:
        label_ids.append(-100)
```

* If token is special (`[CLS]`, `[SEP]`, `[PAD]`) → assign `-100`.
  👉 `-100` is ignored by PyTorch’s loss function (`CrossEntropyLoss`).

```python
elif wordIDX != prev_wordIDX:
    label_ids.append(tag_2_id[ex['tags'][wordIDX]])
```

* If it’s the **first token** of a word → assign the corresponding NER label.
* Example: `"John" → B-PER`.

```python
else:
    label_ids.append(tag_2_id[ex['tags'][wordIDX]])
```

* If it’s a **continuation subword**, same label is applied.
  (Optionally, some setups use `"I-XXX"` here instead — depends on BIO scheme.)

```python
prev_wordIDX = wordIDX
```

* Update tracker so we know whether the next token belongs to the same word.

---

#### 5. Attach labels to tokenized data

```python
tokenize['labels'] = label_ids
return tokenize
```

* Adds the aligned labels to the tokenized dictionary.
* Final dictionary has:

  * `input_ids`
  * `attention_mask`
  * `labels` (NER tag IDs aligned to tokens)

---

### 📊 Example Walkthrough

Sentence:

```python
ex = {
  "tokens": ["John", "lives", "in", "New", "York"],
  "tags": ["B-PER", "O", "O", "B-LOC", "I-LOC"]
}
```

Tokenizer output (simplified):

```
tokens: [CLS], John, lives, in, New, York, [SEP], [PAD] ...
word_ids: [None, 0, 1, 2, 3, 4, None, None ...]
```

Aligned labels:

```
labels: [-100, B-PER, O, O, B-LOC, I-LOC, -100, -100 ...]
```

---

### 🎯 Summary

* **Purpose**: Aligns NER tags with subword tokens from the tokenizer.
* **Why needed**: Word-level labels must match subword tokens for training.
* **Special tokens**: `-100` so they’re ignored in loss.
* **Output**: Dictionary with `input_ids`, `attention_mask`, and aligned `labels`.

## **Lad BERT for Token Classification**

In [24]:
from transformers import BertForTokenClassification

model = BertForTokenClassification.from_pretrained('bert-base-cased',
                                                      num_labels=num_labels,
                                                      id2label=id_2_tag,
                                                      label2id=tag_2_id)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## **Trainer & Training Arguments**

In [None]:
from transformers import TrainingArguments, Trainer
from sklearn.metrics import classification_report
import numpy as np

def compute_metrics(pred):
    predictions, labels = pred
    preds = np.argmax(predictions, axis=2)

    true_labels = [[id_2_tag[label] for label in label_seq if label != -100]
                   for label_seq in labels]
    true_preds = [[id_2_tag[pred] for pred, label in zip(pred_seq, label_seq) if label != -100]
                  for pred_seq, label_seq in zip(preds, labels)]

    report = classification_report(true_labels, true_preds, output_dict=True)
    return{
        'f1': report["weigted avg"]["f1-score"],
        'accuracy': report["accuracy"]
    }

args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    save_strategy='epoch',