# Thai N-NER: Thai Nested Named Entity Recognition

This demo notebook provides a tutorial on using Thai N-NER, with references from [Thai N-NER](https://medium.com/airesearch-in-th/thai-n-ner-thai-nested-named-entity-recognition-1969f8fe91f0)

Learn more about Thai N-NER here : [Thai N-NER](https://medium.com/airesearch-in-th/thai-n-ner-thai-nested-named-entity-recognition-1969f8fe91f0)

## 1. Setup and Preprocessing

In [1]:
!pip install seqeval
!pip install pythainlp
!pip install transformers==4.29.2
!pip install sentencepiece
!pip install gdown
!pip install thai_nner
!pip install protobuf==3.20.3

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (pyproject.toml) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16217 sha256=962eabcf2851b51aad60b4b2bcf1dade6aa3fb0b519d8c485d6a228250279b6e
  Stored in directory: /Users/idhibhatpankam/Library/Caches/pip/wheels/5f/b8/73/0b2c1a76b701a677653dd79ece07cfabd7457989dbfbdcd8d7
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32

# Model checkpoints

> Thai N-NER provides necessary resources, including models, datasets, and pre-trained language models, available here : [Thai N-NER (resources)](https://drive.google.com/drive/folders/1Dy-360iZ9hIA-xA0yizSwmpM8sx6rrjJ?usp=sharing)

To utilize this, please follow these steps::

1. Add the Shared Folder [Thai N-NER (resources)](https://drive.google.com/drive/folders/1Dy-360iZ9hIA-xA0yizSwmpM8sx6rrjJ?usp=sharing)  to Your Google Drive.
* first open the shared folder link in your web browser
* Click the folder named "thai-nner" at the top of the page.
* In the menu bar, click "Organize", then click "Add shortcut" to Drive (you may see an icon that looks like a Drive logo with a plus sign)
* Select "My Drive"


In [None]:
# Clone github
!git clone https://github.com/vistec-AI/Thai-NNER.git
%cd /content/Thai-NNER

# Mount your drive to Google Colab.

In [None]:
# Load data
from google.colab import drive
drive.mount('/content/drive/')

# Create symbolic links
!ln -s "/content/drive/MyDrive/thai-nner/lm" ./data/lm
!ln -s "/content/drive/MyDrive/thai-nner/checkpoints" ./data/checkpoints

# Inference

In [None]:
import json
import torch
import argparse
from tqdm import tqdm
from tabulate import tabulate

from utils.unique import unique
import model.loss as module_loss
import model.model as module_arch
import model.metric as module_metric
from parse_config import ConfigParser
import data_loader.data_loaders as module_data

PAD = '<pad>'

In [None]:
resume = 'data/checkpoints/1102_151935/checkpoint.pth'

In [None]:
args = argparse.ArgumentParser(description='PyTorch Template')
args.add_argument('-c', '--config', default=None, type=str, help='config file path (default: None)')
args.add_argument('-r', '--resume', default=f"{resume}", type=str, help='path to latest checkpoint (default: None)')
args.add_argument('-d', '--device', default=None, type=str, help='indices of GPUs to enable (default: all)')
args.add_argument('-f', '--file', default=None, type=str, help='Error')
config = ConfigParser.from_args(args)
logger = config.get_logger('test')

# build model architecturea
model = config.init_obj('arch', module_arch)

# get function handles of loss and metrics
criterion = getattr(module_loss, config['loss'])
metric_fns = [getattr(module_metric, met) for met in config['metrics']]

logger.info('Loading checkpoint: {} ...'.format(config.resume))
checkpoint = torch.load(config.resume)
state_dict = checkpoint['state_dict']

if config['n_gpu'] > 1:
    model = torch.nn.DataParallel(model)

model.load_state_dict(state_dict)
layers_train = config._config['trainer']['layers_train']

# prepare model for testing
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
model.eval()

total_loss = 0.0
total_metrics = torch.zeros(len(metric_fns))
# logger.info(model)

In [None]:
# Loading only few testing examples.
config.config['data_loader']['args']['sample_data'] = True

data_loader = config.init_obj('data_loader', module_data)
test_data_loader = data_loader.get_test()

> Now, Let's try using the pre-trained Thai N-NER model checkpoint to perform inference and predict NE tags.

In [None]:
from utils.prediction import predict, get_dict_prediction, show



text = " วันนี้วันที่ 27 มกราคม 2568 เป็นวันที่อากาศดีมาก "


tokens, out = predict(model, text, data_loader, config)
tokens = [tk for tk in tokens if tk!=PAD]
print("|".join(tokens), "\n")
[show(x) for x in out];

In [None]:
text = "คณะกรรมการ 40 ปี 14 ตุลาเพื่อประชาธิปไตรสมบูรณ์"
tokens, out = predict(model, text, data_loader, config)
tokens = [tk for tk in tokens if tk!=PAD]
print("|".join(tokens), "\n")
[show(x) for x in out];

In [None]:
text = " วันที่ 18 มกราคม 2568 เมื่อเวลา 11.15 น. ที่จ.นครพนม นายทักษิณ ชินวัตร อดีตนายกฯ ให้สัมภาษณ์กรณีนายชาดา ไทยเศรษฐ์ อดีต รมช.มหาดไทย เซ็นคำสั่งเพิกถอนที่ดินสนามกอล์ฟอัลไพน์ กลับคืนเป็นที่ธรณีสงฆ์ ก่อนหมดวาระเพียงไม่กี่วัน "
tokens, out = predict(model, text, data_loader, config)
tokens = [tk for tk in tokens if tk!=PAD]
print("|".join(tokens), "\n")
[show(x) for x in out];

In [None]:
text = " สธ.กางตัวเลขเบื้องต้นคนป่วยจากปัญหาฝุ่น PM2.5 แค่ 3 สัปดาห์ของเดือน ม.ค.พุ่ง 144,000 คนส่วนใหญ่ผิวหนัง ตาอักเสบ โรคหืด พบ 5 จังหวัดค่าฝุ่นเกิน 75 มคก.ต่อ ลบ.ม.ต่อเนื่องเกิน 3 ในระดับสีแดง "
tokens, out = predict(model, text, data_loader, config)
tokens = [tk for tk in tokens if tk!=PAD]
print("|".join(tokens), "\n")
[show(x) for x in out];