# üåæ AgriBot NER Training with PhoBERT (IoT Focus)

Train PhoBERT-based NER model for Vietnamese Agricultural IoT Chatbot

**Entity Types (6 types):**
- `DATE`: th√°ng n√†y, qu√Ω 1, nƒÉm nay, th√°ng 11, ...
- `CROP`: cam s√†nh, l√∫a ST25, xo√†i c√°t chu, ...
- `AREA`: khu A, khu B, khu 1, ...
- `DURATION`: 5 ph√∫t, 10 ph√∫t, 1 gi·ªù, ...
- `DEVICE`: m√°y b∆°m, ƒë√®n, t∆∞·ªõi, b∆°m, ...
- `METRIC`: nhi·ªát ƒë·ªô, ƒë·ªô ·∫©m, √°nh s√°ng, ...

## üì¶ Step 1: Install Dependencies

In [None]:
!pip install transformers datasets torch scikit-learn seqeval pandas -q

## üìä Step 2: Load Training Data

**Upload CSV file** generated by `generate_ner_data_v2.py`

In [None]:
import json
import pandas as pd
from typing import List, Dict
from google.colab import files

"""
chuy·ªÉn ƒë·ªãnh d·∫°ng data sang
 1. Text: ng√¥ ng·ªçt b·ªã b·ªánh g√¨
   Entities: [(0, 8, 'CROP')]
"""

# Upload CSV file
print("üì§ Please upload your CSV file (generated by generate_ner_data_v2.py)")
uploaded = files.upload()

# Get the uploaded filename
csv_filename = list(uploaded.keys())[0]
print(f"\nüìÇ Loading data from {csv_filename}...")

# Load CSV
df = pd.read_csv(csv_filename)

# Convert to training format: [(text, [(start, end, label), ...]), ...]
training_data = []

for _, row in df.iterrows():
    text = row['text']
    entities_json = json.loads(row['entities'])
    # Convert to (start, end, type) tuples
    entities = [(e['start'], e['end'], e['type']) for e in entities_json]
    training_data.append((text, entities))

print(f"\n‚úÖ Loaded {len(training_data)} training examples from CSV")

# Show sample
print("\nüìù Sample data:")
for i, (text, entities) in enumerate(training_data[:5]):
    print(f"\n{i+1}. Text: {text}")
    print(f"   Entities: {entities}")
"""
üìù Sample data:
1. Text: ng√¥ ng·ªçt b·ªã b·ªánh g√¨
   Entities: [(0, 8, 'CROP')]
2. Text: Th√¥ng tin v·ªÅ gi·ªëng h√†nh t√≠m
   Entities: [(19, 27, 'CROP')]
3. Text: S√¢u ƒë·ª•c th√¢n ·ªü m∆∞·ªõp x·ª≠ l√Ω nh∆∞ th·∫ø n√†o
   Entities: [(15, 19, 'CROP')]
4. Text: C√°ch tr·ªìng d∆∞a l∆∞·ªõi
   Entities: [(11, 19, 'CROP')]
5. Text: T√¥i mu·ªën bi·∫øt v·ªÅ gi·ªëng cam Cao Phong
   Entities: [(23, 36, 'CROP')]
"""

## üîÑ Step 3: Convert to BIO Format

In [None]:
def convert_to_bio_format(data: List[tuple]) -> List[Dict]:
    """
    T·ª©c l√† ƒë·∫ßu v√†o l√†
    1. Text: ng√¥ ng·ªçt b·ªã b·ªánh g√¨
    Entities: [(0, 8, 'CROP')]
    || Entity ƒëang ƒëc g√°n theo v·ªã tr√≠ k√Ω t·ª± , ko ph·∫£i theo word
    =>
    ƒê·∫ßu ra 
#   {
#     "tokens": ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"],
#     "ner_tags": ["B-CROP", "I-CROP", "O", "O", "O"]
#   }

B1. Duy·ªát t·ª´ng m·∫´u d·ªØ li·ªáu
B2. Tokenizen c√¢u b·∫±ng split() ->   tokens = ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"]
                                    labels = ["B-CROP", "I-CROP", "O", "O", "O"]
B3. T·∫°o mapping char index -> word index
    """


    """
    Convert annotated data to BIO format
    Chuy·ªÉn data t·ª´ format v·ªã tr√≠ k√Ω t·ª± sang format BIO
    Data g·ªëc ƒë√°nh d·∫•u v·ªã tr√≠ k√Ω t·ª±(start,end)
    AI c·∫ßn nh√£n cho t·ª´ng t·ª´ (BIO tags)
text = "ng√¥ ng·ªçt b·ªã b·ªánh g√¨"
entities = [(0, 8, 'CROP')]
tokens = ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"]
labels = ["B-CROP", "I-CROP", "O", "O", "O"]

    """
    bio_data = []
    # duy·ªát t·ª´ng m·∫´u, l·∫•y text, entities t·ª´ng m·∫´u
    for text, entities in data:
        # Tokenize by word
        words = text.split()
        labels = ['O'] * len(words)
# text = "ng√¥ ng·ªçt b·ªã b·ªánh g√¨"
# words = ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"]
# labels = ["O", "O", "O", "O", "O"]
        
        """
        t·∫°o b·∫£n ƒë·ªì k√Ω t·ª± -> t·ª´
        """
        # Create character to word index mapping
        char_to_word = {}
        current_pos = 0
        # FOR 1 -  CHO WORD 
        for word_idx, word in enumerate(words):
            # word_idx=0, word="ng√¥"
            # word_idx=1, word="ng·ªçt"
            # word_idx=2, word="b·ªã"
            word_start = text.find(word, current_pos)
            word_end = word_start + len(word)
            """
            v√≠ d·ª• "ng√¥" range(0,3) = [0,1,2]
            char_to_word[0] = 0  # K√Ω t·ª± 'n' thu·ªôc t·ª´ 0 (idx) l√† ng√¥
            char_to_word[1] = 0  # K√Ω t·ª± 'g' thu·ªôc t·ª´ 0
            char_to_word[2] = 0  # K√Ω t·ª± '√¥' thu·ªôc t·ª´ 0

            v√≠ d·ª• "ng·ªçt" range(4, 8) = [4, 5, 6, 7]
            char_to_word[4] = 1  # K√Ω t·ª± 'n' thu·ªôc t·ª´ 1
            char_to_word[5] = 1  # K√Ω t·ª± 'g' thu·ªôc t·ª´ 1
            char_to_word[6] = 1  # K√Ω t·ª± '·ªç' thu·ªôc t·ª´ 1
            char_to_word[7] = 1  # K√Ω t·ª± 't' thu·ªôc t·ª´ 1
            """
            # char_idx 0 -> 3
            for char_idx in range(word_start, word_end):
                # n = 0
                # g = 0
                # √¥ = 0
                char_to_word[char_idx] = word_idx
            current_pos = word_end

            """
             V√ç D·ª§ HO√ÄN CH·ªàNH
Input:
text = "B·∫≠t t∆∞·ªõi khu A trong 5 ph√∫t"
entities = [
    (4, 8, 'DEVICE'),    # "t∆∞·ªõi"
    (9, 14, 'AREA'),     # "khu A"
    (21, 28, 'DURATION') # "5 ph√∫t"
]
B∆∞·ªõc 1: T√°ch t·ª´
words = ["B·∫≠t", "t∆∞·ªõi", "khu", "A", "trong", "5", "ph√∫t"]
labels = ["O", "O", "O", "O", "O", "O", "O"]
B∆∞·ªõc 2: T·∫°o b·∫£n ƒë·ªì
text = "B·∫≠t t∆∞·ªõi khu A trong 5 ph√∫t"
       0123456789...
char_to_word = {
    0: 0,  # 'B' ‚Üí t·ª´ 0 ("B·∫≠t")
    1: 0,  # '·∫≠' ‚Üí t·ª´ 0
    2: 0,  # 't' ‚Üí t·ª´ 0
    4: 1,  # 't' ‚Üí t·ª´ 1 ("t∆∞·ªõi")
    5: 1,  # '∆∞' ‚Üí t·ª´ 1
    6: 1,  # '·ªõ' ‚Üí t·ª´ 1
    7: 1,  # 'i' ‚Üí t·ª´ 1
    9: 2,  # 'k' ‚Üí t·ª´ 2 ("khu")
    10: 2, # 'h' ‚Üí t·ª´ 2
    11: 2, # 'u' ‚Üí t·ª´ 2
    13: 3, # 'A' ‚Üí t·ª´ 3 ("A")
    ...
}
B∆∞·ªõc 3: G√°n nh√£n
Entity 1: (4, 8, 'DEVICE') ‚Üí "t∆∞·ªõi"

range(4, 8) = [4, 5, 6, 7]
entity_words = {1}  # Ch·ªâ t·ª´ 1 ("t∆∞·ªõi")
labels[1] = "B-DEVICE"
K·∫øt qu·∫£: ["O", "B-DEVICE", "O", "O", "O", "O", "O"]
Entity 2: (9, 14, 'AREA') ‚Üí "khu A"

range(9, 14) = [9, 10, 11, 12, 13]
entity_words = {2, 3}  # T·ª´ 2 ("khu") v√† t·ª´ 3 ("A")
labels[2] = "B-AREA"
labels[3] = "I-AREA"
K·∫øt qu·∫£: ["O", "B-DEVICE", "B-AREA", "I-AREA", "O", "O", "O"]
Entity 3: (21, 28, 'DURATION') ‚Üí "5 ph√∫t"

range(21, 28) = [21, 22, 23, 24, 25, 26, 27]
entity_words = {5, 6}  # T·ª´ 5 ("5") v√† t·ª´ 6 ("ph√∫t")
labels[5] = "B-DURATION"
labels[6] = "I-DURATION"
K·∫øt qu·∫£: ["O", "B-DEVICE", "B-AREA", "I-AREA", "O", "B-DURATION", "I-DURATION"]
Output cu·ªëi c√πng:
{
    "tokens": ["B·∫≠t", "t∆∞·ªõi", "khu", "A", "trong", "5", "ph√∫t"],
    "ner_tags": ["O", "B-DEVICE", "B-AREA", "I-AREA", "O", "B-DURATION", "I-DURATION"]
}
    trong c√°i range c·ªßa t·ª´ vd ng√¥ idx=0 th√¨ range(0,3) ƒë√°nh =0 h·∫øt bi·ªÉu th·ªã ƒë√≥ l√† ch·ªó ng√¥
    range(4,8) c·ªßa ng·ªçt (idx=1) ƒë√°nh =1 h·∫øt bi·ªÉu th·ªã ƒë√≥ l√† ch·ªó c·ªßa ng·ªçt  
    char_to_word = {
    0: 0,  # 'n' ‚Üí t·ª´ 0 ("ng√¥")
    1: 0,  # 'g' ‚Üí t·ª´ 0
    2: 0,  # '√¥' ‚Üí t·ª´ 0
    3: 0,  # ' ' ‚Üí KH√îNG G√ÅN (kho·∫£ng tr·∫Øng)
    4: 1,  # 'n' ‚Üí t·ª´ 1 ("ng·ªçt")
    5: 1,  # 'g' ‚Üí t·ª´ 1
    6: 1,  # '·ªç' ‚Üí t·ª´ 1
    7: 1,  # 't' ‚Üí t·ª´ 1
    8: 1,  # ' ' ‚Üí KH√îNG G√ÅN
    9: 2,  # 'b' ‚Üí t·ª´ 2 ("b·ªã")
    10: 2, # '·ªã' ‚Üí t·ª´ 2
    ...
}
            """
        
        # Assign BIO labels
#           1. Text: ng√¥ ng·ªçt b·ªã b·ªánh g√¨
#           Entities: [(0, 8, 'CROP')]
        # FOR 2 - CHO ENTITIES
        for start, end, entity_type in entities:
            # Find words that overlap with entity span
            entity_words = set()
            # range(0, 8) = [0, 1, 2, 3, 4, 5, 6, 7]
            for char_idx in range(start, end):
                # char_idx 0 -> 8
#      char_to_word =
#       {
#           0:0, 1:0, 2:0,           # ng√¥
#           4:1, 5:1, 6:1, 7:1,      # ng·ªçt
#           9:2, 10:2,               # b·ªã
#           12:3,13:3,14:3,15:3,     # b·ªánh
#           17:4,18:4                # g√¨
#       }
                if char_idx in char_to_word:
                    entity_words.add(char_to_word[char_idx])
                # 0 -> 8 l√† c√≥ 0,1
                # entity_words = {0,1}
            """
char_idx=0 ‚Üí char_to_word[0]=0 ‚Üí entity_words={0}
char_idx=1 ‚Üí char_to_word[1]=0 ‚Üí entity_words={0}
char_idx=2 ‚Üí char_to_word[2]=0 ‚Üí entity_words={0}
char_idx=3 ‚Üí KH√îNG T·ªíN T·∫†I (kho·∫£ng tr·∫Øng)
char_idx=4 ‚Üí char_to_word[4]=1 ‚Üí entity_words={0, 1}
char_idx=5 ‚Üí char_to_word[5]=1 ‚Üí entity_words={0, 1}
char_idx=6 ‚Üí char_to_word[6]=1 ‚Üí entity_words={0, 1}
char_idx=7 ‚Üí char_to_word[7]=1 ‚Üí entity_words={0, 1}
K·∫øt qu·∫£: entity_words = {0, 1 ||can be 2,3,4,7,9,....}
Entity "ng√¥ ng·ªçt" (k√Ω t·ª± 0-8) bao g·ªìm t·ª´ 0 ("ng√¥") v√† t·ª´ 1 ("ng·ªçt")
            """
            entity_words = sorted(entity_words)
            if entity_words:
                # First word gets B- tag
                # nh·ªØng th·∫±ng 0 th√¨ ƒë√°nh l√† B-CROP
                labels[entity_words[0]] = f"B-{entity_type}"
                # Remaining words get I- tag
                # c√≤n l·∫°i ƒë√°nh I-CROP
                for word_idx in entity_words[1:]:
                    labels[word_idx] = f"I-{entity_type}"
#   BIO_DATA C√ì D·∫†NG N√ÄY:
#   {
#     "tokens": ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"],
#     "ner_tags": ["B-CROP", "I-CROP", "O", "O", "O"]
#   }

        bio_data.append({
            "tokens": words,
            "ner_tags": labels
        })
    
    return bio_data

bio_dataset = convert_to_bio_format(training_data)

# Display first example
print("\nüìù Example BIO format:")
example = bio_dataset[0]
for token, tag in zip(example['tokens'], example['ner_tags']):
    print(f"{token:20} ‚Üí {tag}")

print(f"\n‚úÖ Converted {len(bio_dataset)} examples to BIO format")
"""
üìù Example BIO format:
ng√¥                  ‚Üí B-CROP
ng·ªçt                 ‚Üí I-CROP
b·ªã                   ‚Üí O
b·ªánh                 ‚Üí O
g√¨                   ‚Üí O
‚úÖ Converted 2000 examples to BIO format
"""

## üè∑Ô∏è Step 4: Create Label Mapping

In [None]:
# Extract all unique labels
all_labels = set()
for example in bio_dataset:
    all_labels.update(example['ner_tags'])

# Sort labels (O first, then B- tags, then I- tags)
label_list = sorted(all_labels, key=lambda x: (x != 'O', x))

# Create label mappings
label2id = {label: idx for idx, label in enumerate(label_list)}
id2label = {idx: label for label, idx in label2id.items()}

print(f"\nüè∑Ô∏è Total labels: {len(label_list)}")
print("\nLabel mapping:")
for label, idx in label2id.items():
    print(f"{idx:2d}: {label}")

# Save label mapping
label_mapping = {
    "label_to_id": label2id,
    "id_to_label": id2label,
    "entity_types": list(set([label.split('-')[1] for label in label_list if '-' in label]))
}

with open('label_mapping.json', 'w', encoding='utf-8') as f:
    json.dump(label_mapping, f, ensure_ascii=False, indent=2)
# üè∑Ô∏è Total labels: 13

# Label mapping:
#  0: O
#  1: B-AREA
#  2: B-CROP
#  3: B-DATE
#  4: B-DEVICE
#  5: B-DURATION
#  6: B-METRIC
#  7: I-AREA
#  8: I-CROP
#  9: I-DATE
# 10: I-DEVICE
# 11: I-DURATION
# 12: I-METRIC

print("\n‚úÖ Saved label_mapping.json")

## üìö Step 5: Prepare Dataset for Training

In [None]:
from datasets import Dataset
from sklearn.model_selection import train_test_split

# Convert to HuggingFace Dataset format
def prepare_dataset(bio_data, label2id):
    dataset_dict = {
        "tokens": [], #danh s√°ch c√°c token
        "ner_tags": [] #danh s√°ch c√°c nh√£n (s·ªë theo label2id)
    }
    
    for example in bio_data:
        dataset_dict["tokens"].append(example["tokens"])
        # Convert labels to IDs
        tag_ids = [label2id[tag] for tag in example["ner_tags"]]
        dataset_dict["ner_tags"].append(tag_ids)
    # chuy·ªÉn dict th√†nh dataset obj
    #dataset obj l√† ƒë·ªãnh d·∫°ng m√† th∆∞ vi·ªán Transformers y√™u c·∫ßu
    return Dataset.from_dict(dataset_dict)
#   BIO_DATA C√ì D·∫†NG N√ÄY:
#   {
#     "tokens": ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"],
#     "ner_tags": ["B-CROP", "I-CROP", "O", "O", "O"]
#   }
# Split train/validation (80/20)
train_data, val_data = train_test_split(bio_dataset, test_size=0.2, random_state=42)
# v√≠ d·ª• v·ªõi 3 m·∫´u 
# dataset_dict = {
#    "tokens": [
#         ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"],
#         ["B·∫≠t", "t∆∞·ªõi", "khu", "A"],
#         ["Chi", "ph√≠", "th√°ng", "n√†y"]
#     ],
#     "ner_tags": [
#         [1, 2, 0, 0, 0],           # B-CROP, I-CROP, O, O, O
#         [0, 3, 5, 6],              # O, B-DEVICE, B-AREA, I-AREA
#         [0, 0, 7, 8]               # O, O, B-DATE, I-DATE
#     ]
# }
train_dataset = prepare_dataset(train_data, label2id)
val_dataset = prepare_dataset(val_data, label2id)

print(f"\nüìä Dataset split:")
print(f"  Training: {len(train_dataset)} examples")
print(f"  Validation: {len(val_dataset)} examples")
"""
bio_dataset = [
    {
        "tokens": ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"],
        "ner_tags": ["B-CROP", "I-CROP", "O", "O", "O"]
    },
    {
        "tokens": ["B·∫≠t", "t∆∞·ªõi", "khu", "A"],
        "ner_tags": ["O", "B-DEVICE", "B-AREA", "I-AREA"]
    },
    # ... 1998 m·∫´u kh√°c
]
# M·∫´u 1:
tokens: ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"]
ner_tags: ["B-CROP", "I-CROP", "O", "O", "O"]
‚Üí tag_ids: [1, 2, 0, 0, 0]
# M·∫´u 2:
tokens: ["B·∫≠t", "t∆∞·ªõi", "khu", "A"]
ner_tags: ["O", "B-DEVICE", "B-AREA", "I-AREA"]
‚Üí tag_ids: [0, 3, 5, 6]
"""

## ü§ñ Step 6: Load PhoBERT Model

In [None]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "vinai/phobert-base"
num_labels = len(label_list)

print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

print(f"‚úÖ Model loaded on {device}")
print(f"   Number of labels: {num_labels}")

## üîß Step 7: Tokenize Dataset

In [None]:
"""
PhoBERT tokenizer kh√¥ng support word_ids(), c·∫ßn manual alignment
"""
"""
Input (word-level):
T·ª´:    ["m√°y", "b∆°m"]
Nh√£n:  [B-DEVICE, I-DEVICE]
PhoBERT tokenize (subword-level):
Token: ["_m√°y", "_b∆°m"]
Nh√£n:  [?, ?]  ‚Üê C·∫ßn g√°n nh√£n

Cell 7 th·ª±c hi·ªán 4 b∆∞·ªõc ch√≠nh:"

Gh√©p t·ª´ th√†nh c√¢u ho√†n ch·ªânh

Input: ["B·∫≠t", "m√°y", "b∆°m"]
Output: "B·∫≠t m√°y b∆°m"
Tokenization v·ªõi PhoBERT

S·ª≠ d·ª•ng PhoBERT tokenizer
Tham s·ªë: max_length=128, padding="max_length"
Kh·ªüi t·∫°o labels v·ªõi -100

-100 = ignore index (b·ªè qua khi t√≠nh loss)
D√πng cho special tokens v√† padding
Manual Alignment

CƒÉn ch·ªânh nh√£n t·ª´ word-level sang token-level

V·ªõi m·ªói token:
  1. B·ªè qua n·∫øu l√† special token (<s>, </s>, <pad>)
  2. Decode token th√†nh text
  3. So kh·ªõp v·ªõi t·ª´ hi·ªán t·∫°i (word_idx)
  4. N·∫øu kh·ªõp:
     - G√°n nh√£n: labels[i] = ner_tags[word_idx]
     - N·∫øu h·∫øt t·ª´: word_idx++
  5. N·∫øu kh√¥ng kh·ªõp:
     - Chuy·ªÉn sang t·ª´ ti·∫øp theo: word_idx++
     - Th·ª≠ g√°n nh√£n l·∫°i
"""

"""
M·ª•c ƒë√≠ch bi·∫øn d·ªØ li·ªáu NER d·∫°ng word-level(BIO) th√†nh subword-level ƒë·ªÉ PhoBERT c√≥ th·ªÉ train ƒë∆∞·ª£c
ƒê·∫ßu v√†o 
tokens   = ["ng√¥", "ng·ªçt", "b·ªã", "b·ªánh", "g√¨"]
ner_tags = ["B-CROP", "I-CROP", "O", "O", "O"]
PhoBERT ko l√†m vi·ªác vs word m√† v·ªõi subword token 
<s> _ng√¥ _ng·ªçt _b·ªã _b·ªánh _g√¨ </s> <pad> <pad> ... t·ª©c 1 word c√≥ th·ªÉ b·ªã t√°ch th√†nh nhi·ªÅu subword -> c·∫ßn nh√¢n b·∫£n / cƒÉn ch·ªânh label cho ƒë√∫ng token
"""

"""
ƒêang c√≥ data d·∫°ng word-level 
ng√¥     ‚Üí B-CROP
ng·ªçt   ‚Üí I-CROP
b·ªã     ‚Üí O
b·ªánh   ‚Üí O
g√¨     ‚Üí O

NH∆ØNG PhoBERT KH√îNG l√†m vi·ªác v·ªõi t·ª´
PhoBERT nh√¨n th·∫•y subword token: <s>  _ng√¥  _ng·ªçt  _b·ªã  _b·ªánh  _g√¨  </s> ,
Model ch·ªâ bi·∫øt t·ª´ng token,
kh√¥ng bi·∫øt ‚Äúng√¥‚Äù l√† m·ªôt t·ª´ ho√†n ch·ªânh.

==> N·∫øu 1 t·ª´ b·ªã t√°ch th√†nh 2 token th√¨ label ph·∫£i g√°n cho token n√†o ?

TEXT:
"ng√¥ ng·ªçt b·ªã b·ªánh g√¨"

WORD LEVEL:
ng√¥     ng·ªçt     b·ªã     b·ªánh     g√¨
B-CROP I-CROP   O       O        O

SUBWORD LEVEL (PhoBERT):
<s> _ng√¥ _ng·ªçt _b·ªã _b  _·ªánh _g√¨ </s>

LABEL PH·∫¢I KH·ªöP:
-100 B-CROP I-CROP O   O   -100 -100

"""

def tokenize_and_align_labels(examples):
    """
    Tokenize text and align NER labels with subword tokens
    PhoBERT tokenizer doesn't support word_ids(), so manual alignment
    """
    # ƒê·ªãnh d·∫°ng chu·∫©n HF Trainer cho NER
    tokenized_inputs = {
        "input_ids": [],
        "attention_mask": [],
        "labels": []
    }
    # Duy·ªát t·ª´ng sapmle
    for tokens, ner_tags in zip(examples["tokens"], examples["ner_tags"]):
        # Join tokens back to text v√¨ PhoBERT c·∫ßn c√¢u ho√†n ch·ªânh ƒë·ªÉ tokenize
        text = " ".join(tokens)

        # Tokenize the full text
        encoding = tokenizer(
            text,
            truncation=True, #trunc n·∫øu >128 token
            max_length=128,
            padding="max_length",
            return_tensors=None #tr·∫£ v·ªÅ list python(ko ph·∫£i tensor)
        )
#         M·∫£nh:    [<s>,  _ng√¥,  _ng,  ·ªçt,  _b·ªã,  </s>]
#         S·ªë ID:   [0,    1234,  5678, 9012, 3456, 2]
        """
        encoding = {
        # m·ªói s√≥ ƒë·∫°i di·ªán cho 1 token
            "input_ids": [0, 1234, 5678, 9012, 3456, 7890, 2, 1, 1, 1, ...],  # 128 s·ªë
            "attention_mask": [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...]             # 128 s·ªë
        }   
        input_ids: M·ªói s·ªë ƒë·∫°i di·ªán cho 1 token
        0: <s> (b·∫Øt ƒë·∫ßu c√¢u)
        1234: "_ng√¥"
        5678: "_ng"
        9012: "·ªçt"
        2: </s> (k·∫øt th√∫c c√¢u)
        1: <pad> (padding)
        attention_mask:
        1: Token th·∫≠t
        0: Padding (b·ªè qua)
        """

        
        # Get token IDs
        # output ki·ªÉu 
        #         [
        #           [0, 1234, 5678, 9012, 3456, 2, 1, 1, ...]
        #         ]
        token_ids = encoding["input_ids"]
        attention_mask = encoding["attention_mask"]

        # Initialize labels with -100 (ignore index)
        labels = [-100] * len(token_ids)

        # Manual alignment: match each word to its tokens
        current_pos = 0
        word_idx = 0
        # <s>, _ng√¥, _ng·ªçt, _b·ªã, _b·ªánh, _g√¨, </s>, <pad>, <pad>, ...
        for i, token_id in enumerate(token_ids):
            # Skip special tokens
            if token_id in [tokenizer.bos_token_id, tokenizer.eos_token_id, tokenizer.pad_token_id]:
                continue
      
            # Decode token
            token_text = tokenizer.decode([token_id], skip_special_tokens=True).strip()
            
            """
            text = "ng√¥ ng·ªçt b·ªã"
            tokens = ["ng√¥", "ng·ªçt", "b·ªã"]
            ner_tags = [1, 2, 0]  # B-CROP, I-CROP, O
            # Sau khi tokenize:
            token_ids = [0, 1234, 5678, 9012, 3456, 2, 1, 1, ...]
                          ‚Üë   ‚Üë     ‚Üë     ‚Üë     ‚Üë    ‚Üë
                        <s> _ng√¥  _ng   ·ªçt   _b·ªã  </s> <pad>
            
            """
            # Remove PhoBERT underscore prefix
            token_clean = token_text.replace("_", " ").strip()

            if not token_clean:
                continue

            # Try to match this token to a word
            # nh∆∞ for m√† word_idx++ , <len
            if word_idx < len(tokens): 
                word = tokens[word_idx]

                # Check if this token is part of the current word
                if token_clean.lower() in word.lower() or word.lower().startswith(token_clean.lower()):
                    # Assign the label for this word
                    labels[i] = ner_tags[word_idx]

                    # Check if we've finished this word
                    if token_clean.lower() == word.lower():
                        word_idx += 1
                else:
                    # Move to next word
                    word_idx += 1
                    if word_idx < len(tokens):
                        labels[i] = ner_tags[word_idx]

        tokenized_inputs["input_ids"].append(encoding["input_ids"])
        tokenized_inputs["attention_mask"].append(encoding["attention_mask"])
        tokenized_inputs["labels"].append(labels)

    return tokenized_inputs

# Tokenize datasets
print("Tokenizing training dataset...")
tokenized_train = train_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=train_dataset.column_names
)

print("Tokenizing validation dataset...")
tokenized_val = val_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=val_dataset.column_names
)

print("‚úÖ Datasets tokenized")
print(f"   Training samples: {len(tokenized_train)}")
print(f"   Validation samples: {len(tokenized_val)}")
"""
tokens = ["B·∫≠t", "m√°y", "b∆°m", "·ªü", "khu", "A", "trong", "30", "ph√∫t"]
ner_tags = [0, 1, 2, 0, 3, 4, 0, 5, 6]

T·ª´:      B·∫≠t    m√°y    b∆°m    ·ªü    khu    A    trong    30    ph√∫t
Tag:     O      B-DEV  I-DEV  O    B-AREA I-AREA O      B-DUR I-DUR
S·ªë:      0      1      2      0    3      4      0      5     6

text = " ".join(tokens)
# text = "B·∫≠t m√°y b∆°m ·ªü khu A trong 30 ph√∫t"

token_ids = [
    0,      # <s> (b·∫Øt ƒë·∫ßu)
    8901,   # _B·∫≠t
    2345,   # _m√°y
    6789,   # _b∆°m
    1234,   # _·ªü
    5678,   # _khu
    9012,   # _A
    3456,   # _trong
    7890,   # _30
    4567,   # _ph√∫t
    2,      # </s> (k·∫øt th√∫c)
    1, 1, 1, ...  # <pad> (padding ƒë·∫øn 128)
]

# Decode l·∫°i th√†nh text:
decoded_tokens = [
    "<s>", "_B·∫≠t", "_m√°y", "_b∆°m", "_·ªü", "_khu", "_A", 
    "_trong", "_30", "_ph√∫t", "</s>", "<pad>", "<pad>", ...
]

labels = [-100] * 128  # 128 v·ªã tr√≠, t·∫•t c·∫£ l√† -100 (ignore)

word_idx = 0  # B·∫Øt ƒë·∫ßu t·ª´ t·ª´ ƒë·∫ßu ti√™n

i = 0
token_id = 0
# Ki·ªÉm tra special token:
if token_id == tokenizer.bos_token_id:  # 0 == 0 ‚Üí TRUE
    continue  # B·ªè qua
# labels[0] v·∫´n l√† -100

token_ids = [
    0,      # <s> (b·∫Øt ƒë·∫ßu)
    8901,   # _B·∫≠t
    2345,   # _m√°y
    6789,   # _b∆°m
    1234,   # _·ªü
    5678,   # _khu
    9012,   # _A
    3456,   # _trong
    7890,   # _30
    4567,   # _ph√∫t
    2,      # </s> (k·∫øt th√∫c)
    1, 1, 1, ...  # <pad> (padding ƒë·∫øn 128)
]

i = 1
token_id = 8901
word_idx = 0  # ƒêang x√©t t·ª´ "B·∫≠t"
# Decode token:
token_text = tokenizer.decode([8901]) = "_B·∫≠t"
token_clean = "B·∫≠t"
# L·∫•y t·ª´ hi·ªán t·∫°i:
word = tokens[0] = "B·∫≠t"
# Ki·ªÉm tra kh·ªõp:
"B·∫≠t" in "B·∫≠t" ‚Üí TRUE
# HO·∫∂C
"B·∫≠t".startswith("B·∫≠t") ‚Üí TRUE
# G√°n nh√£n:
labels[1] = ner_tags[0] = 0  # O
# Ki·ªÉm tra h·∫øt t·ª´:
"B·∫≠t" == "B·∫≠t" ‚Üí TRUE
word_idx = 1  # Chuy·ªÉn sang t·ª´ "m√°y"

i = 2
token_id = 2345
word_idx = 1  # ƒêang x√©t t·ª´ "m√°y"
# Decode token:
token_clean = "m√°y"
# L·∫•y t·ª´ hi·ªán t·∫°i:
word = tokens[1] = "m√°y"
# Ki·ªÉm tra kh·ªõp:
"m√°y" in "m√°y" ‚Üí TRUE
# G√°n nh√£n:
labels[2] = ner_tags[1] = 1  # B-DEVICE
# Ki·ªÉm tra h·∫øt t·ª´:
"m√°y" == "m√°y" ‚Üí TRUE
word_idx = 2  # Chuy·ªÉn sang t·ª´ "b∆°m"

OUTPUT cu·ªëi c√πnG
tokenized_inputs = {
    "input_ids": [
        0, 8901, 2345, 6789, 1234, 5678, 9012, 3456, 7890, 4567, 2, 1, 1, ...
    ],
    "attention_mask": [
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ...
    ],
    "labels": [
        -100, 0, 1, 2, 0, 3, 4, 0, 5, 6, -100, -100, -100, ...
    ]
}

ƒê·∫∑c ƒëi·ªÉm:

M·ªói token c√≥ 1 nh√£n t∆∞∆°ng ·ª©ng
Special tokens v√† padding c√≥ nh√£n -100
S·∫µn s√†ng ƒë∆∞a v√†o model ƒë·ªÉ training
"""

## üéØ Step 8: Define Training Arguments

In [None]:
"""
Step 8 (thay th·∫ø TrainingArguments)
Disable wandb logging
"""

from transformers import TrainingArguments, Trainer
from transformers import DataCollatorForTokenClassification
# batch = {
#     "input_ids": [
#         [0, 1234, 5678, 2, 1, 1, ...],      # M·∫´u 1
#         [0, 9012, 3456, 7890, 2, 1, ...],   # M·∫´u 2
#         [0, 2345, 6789, 2, 1, 1, ...]       # M·∫´u 3
#     ],  # Shape: (3, 128)
    
#     "attention_mask": [
#         [1, 1, 1, 1, 0, 0, ...],
#         [1, 1, 1, 1, 1, 0, ...],
#         [1, 1, 1, 1, 0, 0, ...]
#     ],  # Shape: (3, 128)
    
#     "labels": [
#         [-100, 1, 2, -100, -100, ...],
#         [-100, 0, 3, 4, -100, ...],
#         [-100, 5, 6, -100, -100, ...]
#     ]  # Shape: (3, 128)
# }
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score
import os

# Disable wandb
os.environ["WANDB_DISABLED"] = "true"

# Data collator
# Gh√©p nhi·ªÅu sample l·∫°i th√†nh 1 batch sao cho: input_ids, attention_mask, labels
# ƒë·ªÅu c√πng chi·ªÅu d√†i
data_collator = DataCollatorForTokenClassification(tokenizer)

# Metric computation
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    
    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }

# Training arguments (with wandb disabled)
training_args = TrainingArguments(
    output_dir="./ner_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=10,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
    push_to_hub=False,
    report_to="none",  # Disable all reporting (wandb, tensorboard, etc.)
)

print("‚úÖ Training arguments configured")
print(f"   Epochs: {training_args.num_train_epochs}")
print(f"   Batch size: {training_args.per_device_train_batch_size}")
print(f"   Learning rate: {training_args.learning_rate}")
print(f"   Logging: Disabled (no wandb)")

## üöÄ Step 9: Train Model

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Start training
print("\nüöÄ Starting training...\n")
trainer.train()

print("\n‚úÖ Training completed!")

## üìä Step 10: Evaluate Model

In [None]:
# Evaluate on validation set
results = trainer.evaluate()

print("\nüìä Evaluation Results:")
for key, value in results.items():
    print(f"  {key}: {value:.4f}")

## üíæ Step 11: Save Model

In [None]:
# Save model and tokenizer
output_dir = "./ner_extractor_final"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

# Copy label mapping
import shutil
shutil.copy('label_mapping.json', f'{output_dir}/label_mapping.json')

print(f"\n‚úÖ Model saved to {output_dir}")
print("\nüì¶ Files to download:")
print("  - config.json")
print("  - pytorch_model.bin (or model.safetensors)")
print("  - label_mapping.json")

## üß™ Step 12: Test Model

In [None]:
# Test on new examples
test_examples = [
    "B·∫≠t t∆∞·ªõi khu A trong 5 ph√∫t",
    "ƒê·ªô ·∫©m ·ªü khu B l√† bao nhi√™u",
    "Chi ph√≠ th√°ng n√†y",
    "T·∫Øt ƒë√®n khu C",
    "Nhi·ªát ƒë·ªô khu 1 hi·ªán t·∫°i",
    "Doanh thu qu√Ω 2",
    "C√°ch tr·ªìng cam s√†nh"
]

def predict_entities(text):
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Predict
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = torch.argmax(outputs.logits, dim=-1)[0]
    
    # Decode
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    labels = [id2label[p.item()] for p in predictions]
    
    # Extract entities
    entities = []
    current_entity = None
    
    for token, label in zip(tokens, labels):
        if token in ["<s>", "</s>", "<pad>"]:
            continue
            
        if label.startswith("B-"):
            if current_entity:
                entities.append(current_entity)
            current_entity = {"type": label[2:], "text": token.replace("_", " ").strip()}
        elif label.startswith("I-") and current_entity:
            current_entity["text"] += " " + token.replace("_", "").strip()
        elif label == "O" and current_entity:
            entities.append(current_entity)
            current_entity = None
    
    if current_entity:
        entities.append(current_entity)
    
    return entities

print("\nüß™ Testing model on new examples:\n")
for example in test_examples:
    entities = predict_entities(example)
    print(f"Text: {example}")
    print(f"Entities: {entities}")
    print()

## üì• Step 13: Download Model Files

In [None]:
# Zip model files for easy download
!zip -r ner_model.zip ner_extractor_final/
print("‚úÖ Model zipped as ner_model.zip")
print("\nüì• Download ner_model.zip from Colab Files panel")
print("\nüìã Deployment instructions:")
print("1. Extract ner_model.zip")
print("2. Copy files to: C:\\Users\\ADMIN\\Desktop\\ex\\apps\\python-ai-service\\models\\ner_extractor\\")
print("3. Restart Python AI service")